ExtractTracker - Python

Below is a deeper dive of the capabilities of the Python implementation ExtractTracker submodule. Note that ProcessTracker MUST be used in conjunction with ExtractTracker.

Registering Extracts

Once the process run has been registered, an extract can be registered, provided the following variables are set.:

process_run = ProcessTracker(process_name='Lahman Teams Load'
                                     , process_type='Stage Load'
                                     , actor_name='New ProcessTracker User'
                                     , tool_name='Spark'
                                     , source_name='Lahman Baseball Dataset')

extract = ExtractTracker(process_run=process_run
                              , filename='Teams.csv'
                              , location_name='Lahman Baseball Databank 2018'
                              , location_path='~/baseballdatabank-master_2018-03-28/baseballdatabank-master/core/')

Those variables will be used to populate the data store backend as explained in the following table:

ExtractTracker object initialization variables
Variable Name Variable Description Reference Object Object Created If Not Exist?
process_run An instance of ProcessTracker Process Tracking No
filename The extract file’s filename Extract Tracking Yes
location An instance of Extract Location, optional if created already. Location No
location_name The given name of the location. Optional. Location Yes
location_path The filepath of the location. Required if location instance not provided. Location Yes
status The extract file status. Optional. Extract Status Yes

Changing Extract Status

As extract files are used within a process run, their status will need to be modified.:

extract.change_extract_status(status='loading')

Custom extract status can be entered, but the default status types must be used for ProcessTracker to know what to do with files. As long as the file’s status is eventually changed to one of those then the process flow will continue.