HRMIS Data Validation

Data is stored securely in \\wsrm020inf\DataVal\

Setup

Python Environment & `requirements.txt`

Before any script execution, ensure that you have setup a python environment with the requirements installed from requirements.txt on all relevant machines.

`CURRENT.env`

The current ETL is set by modifying ETL_NAME in CURRENT.env, the environment by modifying ENVIRONMENT_NAME.

HTTP_PROXY and HTTPS_PROXY should both be configured to the qrm-services proxy, unless running the script from a machine based at GPO.

`environment_config.yml`

This determines both the extract and test details pertaining to each ETL & Environment.

For live ETLs, ensure COMMON: BASEPATH is set to \\wsrm020inf\dataval\, and the LOG_PATH_PATTERN and EXTRACT_PATH_PATTERN include the following values:

{ETL_NAME}: The current ETL_NAME determined by CURRENT.env
{ENVIRONMENT}: The current ENVIRONMENT_NAME determined by CURRENT.env
{VALIDATION_TYPE}: The current validation type, either DATA_LOAD or DATA_QUALITY_UPLIFT
{SYSTEM}: The system being extracted/tested, e.g. DataRemedy, EC, ECP, UKG

Under ETL: (ETL name): (Environment name), the following values should be set:

WRITE_LOGS: Whether logs should be written to the log path. Should always be True
UKG_BASE_PATH: Base url for UKG extract.
SECRETS: .env file used for credentials and secrets
LOAD_EC_CONFIG_MAP: Configuration file used for matching files during EC/ECP Data Load validations.
TRANSFORM_EC_CONFIG_MAP: Same as LOAD_EC_CONFIG_MAP, but for transformation validation. Unused.
DATAREMEDY_EXTRACT_DB: Database used for DataRemedy extracts.
EC_API_URL: Base URL for the EC API.
GRANULAR_EXTRACT_FORCE_LIST: Used during extraction to prevent extracting all HRMIS data at once.
DATA_QUALITY_UPLIFT_OUTPUT: Local URL for outputting excel results for Data Quality Uplift validations.
DATA_QUALITY_UPLIFT_SHAREPOINT: Sharepoint URL for DQU validation output, used to generate sharepoint links.
LATTICE_EXTRACT_DATE: Extract date for the given ETL, used during some Data Quality validations as a proxy "Current Date"

Secrets

Secrets file should include credentials for DataRemedy, EC, ECP, and UKG.

Test Execution

Extracting from DataRemedy & Target systems

Extract Privileges

Extraction should only be executed from the HRMIS NUK, as this is the only QRM machine with access to all target systems.

Initialising repository for extraction

At the beginning of a new ETL, before extracting any tables from any system, the granular extract force list should be created. Ensure the GRANULAR_EXTRACT_FORCE_LIST file for the current etl has been created under the extract/targeted_extract/ path, with the following text:

clear: []
ignore: []
queue: []

This file is used for tracking which files have been recognised in DataRemedy, which files have been extracted, and which files are queued for extraction. This can be populated with python extract/extract.py -p. This should only be done ONCE per ETL, as this will overwrite the entire granular extract file.

Extracting data from DataRemedy

When extracting from DataRemedy, refer to the GRANULAR_EXTRACT_FORCE_LIST file. Cut-and-paste the relevant files from the ignore list, and add them to the queue. Make sure all files related to the required table have been added to the queue (E.G. if the extract calls for basic_user_info_01_unscrambled, ensure BOTH mapping.basic_user_info_01_scrambled AND mapping.basic_user_info_01_unscrambled are queued for extraction).

Once the relevant files have been queued up, the extract can be run with python extract/extract.py -f. IMPORTANT: Do not omit the -f instruction when extracting - omitting -f will cause the extract script to ignore the granular force list, and attempt to extract everything.

Extracting data from target systems

When extracting from DataRemedy mapping files, the corresponding EC and/or ECP files will automatically be extracted according to the pointers_ec.yaml and pointers_ecp.yaml files respectively.

Validation

Running the validations

Once the relevant data has been extracted, validation can begin. VSCode should be able to find the relevant tests with pytest:

For Data Quality Uplift runs: Ensure that all data quality uplift tests are run at once, rather than individually running DQ rules. This ensures that pytest can collect a list of failing rules when providing a defect summary.

Reporting logs

Pytest will report all tests as Passed, regardless of the amount of defects found by the script - Pytest will only report a test as Failed if a script failure has occurred.

Results from running each test should be logged according to the BASE_PATH and LOG_PATH_PATTERN values found in the environment_config.yaml.

Data Quality Uplift Validations

Individual folders will be created for each data quality rule run, as well as a single defects.csv file containing the results from each rule run. Each folder will contain the following:

__summary.json contains information regarding the test execution itself.
default.json will contain a technical log with all the internal defects recorded. If there are a lot of defects, there will be multiple default_X.json files.
errors.csv contains a list of all internal defects found during validation.
errors.xlsx is a more detailed spreadsheet, containing both the defective data found during validation, and information about the rule being tested. This file should be used for logging and reporting defects.

Data Load Validations

Individual folders will be created for each table validated. Each folder will contain the following:

__summary.json contains information regarding the test execution itself.
test_[table name].json will contain a technical log with all the internal defects recorded. If there are a lot of defects, there will be multiple test_[table name]_X.json files.
errors.csv contains a list of all internal defects found during validation.
test_[table name]_logs.xlsx is a more detailed spreadsheet, containing the defective data found during testing, with information on both the source & target data compared. This file should be used for logging and reporting defects.

In the test_[table name]_logs.xlsx file, create a filter by selecting all the rows/columns in the spreadsheet, and selecting "Filter" from the Data tab.

When logging Data Load validations, individual reports should be created for each mismatching field found in the "Mismatch in Target" rows. E.G. If a table was found to have mismatches for both the standardHours and seqNumber fields, a seperate ETL bug should be raised encapsulating all standardHours mismatches, and all seqNumber mismatches.

For non-mismatch Error Types, an individual report should be created for each Error Type found, I.E. There should be an ETL bug raised encapsulating all Missing Rows, and all Duplicate Rows.

When reading reports, fields highlighted in bold in the topmost row are the primary key used to match source rows to their target entries.

For fields compared using a custom field format - such as a unique datatype or a picklist - cells can be hovered to show how the validation script has compared them.

Name		Name	Last commit message	Last commit date
Latest commit History 570 Commits
.vscode		.vscode
docs		docs
extract		extract
reporting		reporting
validation		validation
.env.template		.env.template
.gitignore		.gitignore
CURRENT.env		CURRENT.env
README.md		README.md
dev_rfc.log		dev_rfc.log
environment_config.yml		environment_config.yml
pytest.ini		pytest.ini
pytest_profiler.py		pytest_profiler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HRMIS Data Validation

Setup

Python Environment & `requirements.txt`

`CURRENT.env`

`environment_config.yml`

Secrets

Test Execution

Extracting from DataRemedy & Target systems

Extract Privileges

Initialising repository for extraction

Extracting data from DataRemedy

Extracting data from target systems

Validation

Running the validations

Reporting logs

Data Quality Uplift Validations

Data Load Validations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HRMIS Data Validation

Setup

Python Environment & requirements.txt

CURRENT.env

environment_config.yml

Secrets

Test Execution

Extracting from DataRemedy & Target systems

Extract Privileges

Initialising repository for extraction

Extracting data from DataRemedy

Extracting data from target systems

Validation

Running the validations

Reporting logs

Data Quality Uplift Validations

Data Load Validations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Python Environment & `requirements.txt`

`CURRENT.env`

`environment_config.yml`

Packages