Data is stored securely in \\wsrm020inf\DataVal\
Before any script execution, ensure that you have setup a python environment with the requirements installed from requirements.txt on all relevant machines.
The current ETL is set by modifying ETL_NAME in CURRENT.env, the environment by modifying ENVIRONMENT_NAME.
HTTP_PROXY and HTTPS_PROXY should both be configured to the qrm-services proxy, unless running the script from a machine based at GPO.
This determines both the extract and test details pertaining to each ETL & Environment.
For live ETLs, ensure COMMON: BASEPATH is set to \\wsrm020inf\dataval\, and the LOG_PATH_PATTERN and EXTRACT_PATH_PATTERN include the following values:
{ETL_NAME}: The currentETL_NAMEdetermined byCURRENT.env{ENVIRONMENT}: The currentENVIRONMENT_NAMEdetermined byCURRENT.env{VALIDATION_TYPE}: The current validation type, eitherDATA_LOADorDATA_QUALITY_UPLIFT{SYSTEM}: The system being extracted/tested, e.g.DataRemedy,EC,ECP,UKG
Under ETL: (ETL name): (Environment name), the following values should be set:
WRITE_LOGS: Whether logs should be written to the log path. Should always beTrueUKG_BASE_PATH: Base url for UKG extract.SECRETS:.envfile used for credentials and secretsLOAD_EC_CONFIG_MAP: Configuration file used for matching files during EC/ECP Data Load validations.TRANSFORM_EC_CONFIG_MAP: Same asLOAD_EC_CONFIG_MAP, but for transformation validation. Unused.DATAREMEDY_EXTRACT_DB: Database used for DataRemedy extracts.EC_API_URL: Base URL for the EC API.GRANULAR_EXTRACT_FORCE_LIST: Used during extraction to prevent extracting all HRMIS data at once.DATA_QUALITY_UPLIFT_OUTPUT: Local URL for outputting excel results for Data Quality Uplift validations.DATA_QUALITY_UPLIFT_SHAREPOINT: Sharepoint URL for DQU validation output, used to generate sharepoint links.LATTICE_EXTRACT_DATE: Extract date for the given ETL, used during some Data Quality validations as a proxy "Current Date"
Secrets file should include credentials for DataRemedy, EC, ECP, and UKG.
Extraction should only be executed from the HRMIS NUK, as this is the only QRM machine with access to all target systems.
At the beginning of a new ETL, before extracting any tables from any system, the granular extract force list should be created. Ensure the GRANULAR_EXTRACT_FORCE_LIST file for the current etl has been created under the extract/targeted_extract/ path, with the following text:
clear: []
ignore: []
queue: []This file is used for tracking which files have been recognised in DataRemedy, which files have been extracted, and which files are queued for extraction. This can be populated with python extract/extract.py -p. This should only be done ONCE per ETL, as this will overwrite the entire granular extract file.
When extracting from DataRemedy, refer to the GRANULAR_EXTRACT_FORCE_LIST file. Cut-and-paste the relevant files from the ignore list, and add them to the queue. Make sure all files related to the required table have been added to the queue (E.G. if the extract calls for basic_user_info_01_unscrambled, ensure BOTH mapping.basic_user_info_01_scrambled AND mapping.basic_user_info_01_unscrambled are queued for extraction).
Once the relevant files have been queued up, the extract can be run with python extract/extract.py -f. IMPORTANT: Do not omit the -f instruction when extracting - omitting -f will cause the extract script to ignore the granular force list, and attempt to extract everything.
When extracting from DataRemedy mapping files, the corresponding EC and/or ECP files will automatically be extracted according to the pointers_ec.yaml and pointers_ecp.yaml files respectively.
Once the relevant data has been extracted, validation can begin. VSCode should be able to find the relevant tests with pytest:
For Data Quality Uplift runs: Ensure that all data quality uplift tests are run at once, rather than individually running DQ rules. This ensures that pytest can collect a list of failing rules when providing a defect summary.
Pytest will report all tests as Passed, regardless of the amount of defects found by the script - Pytest will only report a test as Failed if a script failure has occurred.
Results from running each test should be logged according to the BASE_PATH and LOG_PATH_PATTERN values found in the environment_config.yaml.
Individual folders will be created for each data quality rule run, as well as a single defects.csv file containing the results from each rule run.
Each folder will contain the following:
__summary.jsoncontains information regarding the test execution itself.default.jsonwill contain a technical log with all the internal defects recorded. If there are a lot of defects, there will be multipledefault_X.jsonfiles.errors.csvcontains a list of all internal defects found during validation.errors.xlsxis a more detailed spreadsheet, containing both the defective data found during validation, and information about the rule being tested. This file should be used for logging and reporting defects.
Individual folders will be created for each table validated. Each folder will contain the following:
__summary.jsoncontains information regarding the test execution itself.test_[table name].jsonwill contain a technical log with all the internal defects recorded. If there are a lot of defects, there will be multipletest_[table name]_X.jsonfiles.errors.csvcontains a list of all internal defects found during validation.test_[table name]_logs.xlsxis a more detailed spreadsheet, containing the defective data found during testing, with information on both the source & target data compared. This file should be used for logging and reporting defects.
In the test_[table name]_logs.xlsx file, create a filter by selecting all the rows/columns in the spreadsheet, and selecting "Filter" from the Data tab.
When logging Data Load validations, individual reports should be created for each mismatching field found in the "Mismatch in Target" rows. E.G. If a table was found to have mismatches for both the standardHours and seqNumber fields, a seperate ETL bug should be raised encapsulating all standardHours mismatches, and all seqNumber mismatches.
For non-mismatch Error Types, an individual report should be created for each Error Type found, I.E. There should be an ETL bug raised encapsulating all Missing Rows, and all Duplicate Rows.
When reading reports, fields highlighted in bold in the topmost row are the primary key used to match source rows to their target entries.
For fields compared using a custom field format - such as a unique datatype or a picklist - cells can be hovered to show how the validation script has compared them.


