Skip to content

common-systems/test

Repository files navigation

HRMIS Data Validation

Link to Test Plan

Data is stored securely in \\wsrm020inf\DataVal\

Setup

Python Environment & requirements.txt

Before any script execution, ensure that you have setup a python environment with the requirements installed from requirements.txt on all relevant machines.

CURRENT.env

The current ETL is set by modifying ETL_NAME in CURRENT.env, the environment by modifying ENVIRONMENT_NAME.

HTTP_PROXY and HTTPS_PROXY should both be configured to the qrm-services proxy, unless running the script from a machine based at GPO.

environment_config.yml

This determines both the extract and test details pertaining to each ETL & Environment.

For live ETLs, ensure COMMON: BASEPATH is set to \\wsrm020inf\dataval\, and the LOG_PATH_PATTERN and EXTRACT_PATH_PATTERN include the following values:

  • {ETL_NAME}: The current ETL_NAME determined by CURRENT.env
  • {ENVIRONMENT}: The current ENVIRONMENT_NAME determined by CURRENT.env
  • {VALIDATION_TYPE}: The current validation type, either DATA_LOAD or DATA_QUALITY_UPLIFT
  • {SYSTEM}: The system being extracted/tested, e.g. DataRemedy, EC, ECP, UKG

Under ETL: (ETL name): (Environment name), the following values should be set:

  • WRITE_LOGS: Whether logs should be written to the log path. Should always be True
  • UKG_BASE_PATH: Base url for UKG extract.
  • SECRETS: .env file used for credentials and secrets
  • LOAD_EC_CONFIG_MAP: Configuration file used for matching files during EC/ECP Data Load validations.
  • TRANSFORM_EC_CONFIG_MAP: Same as LOAD_EC_CONFIG_MAP, but for transformation validation. Unused.
  • DATAREMEDY_EXTRACT_DB: Database used for DataRemedy extracts.
  • EC_API_URL: Base URL for the EC API.
  • GRANULAR_EXTRACT_FORCE_LIST: Used during extraction to prevent extracting all HRMIS data at once.
  • DATA_QUALITY_UPLIFT_OUTPUT: Local URL for outputting excel results for Data Quality Uplift validations.
  • DATA_QUALITY_UPLIFT_SHAREPOINT: Sharepoint URL for DQU validation output, used to generate sharepoint links.
  • LATTICE_EXTRACT_DATE: Extract date for the given ETL, used during some Data Quality validations as a proxy "Current Date"

Secrets

Secrets file should include credentials for DataRemedy, EC, ECP, and UKG.

Test Execution

Extracting from DataRemedy & Target systems

Extract Privileges

Extraction should only be executed from the HRMIS NUK, as this is the only QRM machine with access to all target systems.

Initialising repository for extraction

At the beginning of a new ETL, before extracting any tables from any system, the granular extract force list should be created. Ensure the GRANULAR_EXTRACT_FORCE_LIST file for the current etl has been created under the extract/targeted_extract/ path, with the following text:

clear: []
ignore: []
queue: []

This file is used for tracking which files have been recognised in DataRemedy, which files have been extracted, and which files are queued for extraction. This can be populated with python extract/extract.py -p. This should only be done ONCE per ETL, as this will overwrite the entire granular extract file.

Extracting data from DataRemedy

When extracting from DataRemedy, refer to the GRANULAR_EXTRACT_FORCE_LIST file. Cut-and-paste the relevant files from the ignore list, and add them to the queue. Make sure all files related to the required table have been added to the queue (E.G. if the extract calls for basic_user_info_01_unscrambled, ensure BOTH mapping.basic_user_info_01_scrambled AND mapping.basic_user_info_01_unscrambled are queued for extraction).

Once the relevant files have been queued up, the extract can be run with python extract/extract.py -f. IMPORTANT: Do not omit the -f instruction when extracting - omitting -f will cause the extract script to ignore the granular force list, and attempt to extract everything.

Extracting data from target systems

When extracting from DataRemedy mapping files, the corresponding EC and/or ECP files will automatically be extracted according to the pointers_ec.yaml and pointers_ecp.yaml files respectively.

Validation

Running the validations

Once the relevant data has been extracted, validation can begin. VSCode should be able to find the relevant tests with pytest:

A screenshot of the HRMIS test configuration from VSCode's "Testing" tab. A hierarchy displays a list of tests under the following path: data-migration-validation -> validation -> dataload -> ecp -> test_ECP.py -> test_ECP_file. The tests are named as follows: mapping.bsb_ECP_unscrambled, mapping.costcentre_ECP_unscrambled, mapping.leave_balance_ECP_unscrambled, mapping.pregolive_payroll_results_t558b_unscrambled, mapping.pregolive_payroll_results_t558c_unscrambled, mapping.counter_leave_shift_ECP_unscrambled.

For Data Quality Uplift runs: Ensure that all data quality uplift tests are run at once, rather than individually running DQ rules. This ensures that pytest can collect a list of failing rules when providing a defect summary.

Reporting logs

Pytest will report all tests as Passed, regardless of the amount of defects found by the script - Pytest will only report a test as Failed if a script failure has occurred.

Results from running each test should be logged according to the BASE_PATH and LOG_PATH_PATTERN values found in the environment_config.yaml.

Data Quality Uplift Validations

Individual folders will be created for each data quality rule run, as well as a single defects.csv file containing the results from each rule run. Each folder will contain the following:

  • __summary.json contains information regarding the test execution itself.
  • default.json will contain a technical log with all the internal defects recorded. If there are a lot of defects, there will be multiple default_X.json files.
  • errors.csv contains a list of all internal defects found during validation.
  • errors.xlsx is a more detailed spreadsheet, containing both the defective data found during validation, and information about the rule being tested. This file should be used for logging and reporting defects.
Data Load Validations

Individual folders will be created for each table validated. Each folder will contain the following:

  • __summary.json contains information regarding the test execution itself.
  • test_[table name].json will contain a technical log with all the internal defects recorded. If there are a lot of defects, there will be multiple test_[table name]_X.json files.
  • errors.csv contains a list of all internal defects found during validation.
  • test_[table name]_logs.xlsx is a more detailed spreadsheet, containing the defective data found during testing, with information on both the source & target data compared. This file should be used for logging and reporting defects.

In the test_[table name]_logs.xlsx file, create a filter by selecting all the rows/columns in the spreadsheet, and selecting "Filter" from the Data tab.

A screenshot of the excel toolbar, on the "Data" tab. The "Filter" button has been circled.

When logging Data Load validations, individual reports should be created for each mismatching field found in the "Mismatch in Target" rows. E.G. If a table was found to have mismatches for both the standardHours and seqNumber fields, a seperate ETL bug should be raised encapsulating all standardHours mismatches, and all seqNumber mismatches.

For non-mismatch Error Types, an individual report should be created for each Error Type found, I.E. There should be an ETL bug raised encapsulating all Missing Rows, and all Duplicate Rows.

When reading reports, fields highlighted in bold in the topmost row are the primary key used to match source rows to their target entries.

For fields compared using a custom field format - such as a unique datatype or a picklist - cells can be hovered to show how the validation script has compared them.

A screenshot of the hover functionality for Data Load docs. A field with "/Date(1735689600000)/" has a red triangle in the bottom right corner, and have been hovered to reveal some additional text that states "Item compared as: 2025-01-01". This screenshot is of a table that has been previously scrambled, and should contain no sensitive data.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages