[]
In a traditional extract, load, and transform (ELT) or data warehouse solution, you need to ingest data into your staging area from various sources and prep the data before it can be processed further by downstream applications. If data quality is overlooked, data warehouse users will have inaccurate and incomplete data on their hands. This causes inaccurate results produced by analytical queries run against the “dirty” data set.
This data quality framework is based on configurable data quality rules applied to a specific column or a set of columns of a Snowflake (staging) table, thereby curating a data set to eliminate the bad records.
The objective of this post is to help create a configuration-driven data quality framework that can be leveraged to cleanse any source system feed with minimal or no code changes, reducing the TTM.
Business benefits:
You can get the following benefits by using this framework:
- Curate any Snowflake table by placing the rules as configurations. This assures your time-to-market (TTM) is short as your developers don’t need to build any code. Additionally, this framework can give them a jump start to quickly customize and shorten the build phase.
- Developers only have to apply the data quality rule details to a CONFIG table. This means no code change is involved to cleanse any new data source.
- This framework supports schema evolution. Any change in the table structure of any existing table doesn’t have an impact on the solution framework, eliminating the need for any code change.
- Developers/users don’t need to have expertise in Snowflake to use this framework. Basic SQL knowledge is sufficient.
Solution framework overview
Data quality rules:
A JavaScript stored procedure is created for each data quality rule. When the data quality rule is applied to a column of a source table, the framework inserts the corresponding record of that source table for which the column undergoing data quality checks doesn’t satisfy the concerned data quality rule into the DQ_RULE_VALIDATION_RESULTS table. Additionally, it inserts some other metadata information, such as TABLE_NAME, COL_NAME, INVALID_VALUE, DQ_RULE, and ERR_MSG into the same DQ_RULE_VALIDATION_RESULTS table.
Each data quality rule stored procedure takes the following five parameters:
- DB_NAME: Source DB name
- SCHEMA_NAME: Source schema name
- TABLE_NAME: Source table name
- CONFIG_NAME: A JSON string with the parameter value
- REQD_CLAEANSED_RECORD: A flag to denote whether to create a CLEANSED table
The following data quality rules have been created:
- RULE_DATE: Used to check the date value conforming to the pattern supplied
- RULE_DECIMAL: Used to check a decimal value
- RULE_INTEGER: Used to check an integer value
- RULE_LENGTH: Used to check whether the length of a field is within the supplied value
- RULE_NOT_NULL: Used to check whether a field contains NULL value
- RULE_REGEX: Used to check whether a field conforms to the supplied regex pattern
- RULE_SQL_FILTER: Used to check whether a record satisfies a SQL predicate
- RULE_UNIQUE: Used to validate whether a field contains unique values
- RULE_VALID_VALUES: Used to check whether a field contains values specified in the supplied value array
Sample data quality rule stored procedure:
Database: ANALYTICS
Schema: SNOWDQ
Table: STORE_DQ
Rule Name: RULE_DATE
Description: Validates a date/datetime field as per the supplied format
CREATE OR REPLACE PROCEDURE RULE_DATE ( DB_NAME VARCHAR, SCHEMA_NAME VARCHAR, TABLE_NAME VARCHAR, CONFIG_NAME VARIANT, REQD_CLEANSE_RECORD BOOLEAN) RETURNS VARIANT NOT NULL LANGUAGE JAVASCRIPT AS $$ var error_msg = []; –validate_date is a JAVA UDF used to check whether a date is valid. var qry = ` insert into analytics.snowdq.DQ_RULE_VALIDATION_RESULTS (table_name,col_name,invalid_value,DQ_RULE,err_msg,err_rec) SELECT CONCAT(’${DB_NAME}’,’.’,’${SCHEMA_NAME}’,’.’,’${TABLE_NAME}’), ‘${CONFIG_NAME[“COL”]}’, ${CONFIG_NAME[“COL”]}, concat(’RULE_DATE: ‘,’${CONFIG_NAME[“FORMAT”]}’), ‘${CONFIG_NAME[“COL”]} HAS INVALID DATE’ AS ERR_MSG, object_construct(*) FROM “${DB_NAME}”.”${SCHEMA_NAME}”.”${TABLE_NAME}” WHERE validate_date(${CONFIG_NAME[“COL”]},’${CONFIG_NAME[“FORMAT”]}’) = false and ${CONFIG_NAME[“COL”]} IS NOT NULL ;` –Create a temporary table to store the BAD records try { var rs = snowflake.execute({ sqlText: qry }); if(REQD_CLEANSE_RECORD) { var qry = ` create temporary table if not exists ${DB_NAME}.${SCHEMA_NAME}.${TABLE_NAME}_TEMP like ${DB_NAME}.${SCHEMA_NAME}.${TABLE_NAME};` var rs_temp = snowflake.execute({ sqlText: qry }); var qry = ` insert into ${DB_NAME}.${SCHEMA_NAME}.${TABLE_NAME}_TEMP select * from ${DB_NAME}.${SCHEMA_NAME}.${TABLE_NAME} WHERE validate_date(${CONFIG_NAME[“COL”]},’${CONFIG_NAME[“FORMAT”]}’) = false and ${CONFIG_NAME[“COL”]} IS NOT NULL;` var rs_ins = snowflake.execute({ sqlText: qry }); } return “RULE_DATE VALIDATION COMPLETED Successfully”; } catch (err) { error_msg.push(` { sql statement : ‘${qry}’, error_code : ‘${err.code}’, error_state : ‘${err.state}’, error_message : ‘${err.message}’, stack_trace : ‘${err.stackTraceTxt}’ } `); return error_msg; } $$;
Code snippet for validate_date Java udf stored in a Snowflake internal stage:
create or replace function validate_date(datestring varchar, format varchar) returns boolean language JAVA IMPORTS = (’@dq_udf_jar/dq_udf_utility.jar’) handler = ‘IsDate.apply’;
A master stored procedure named DATA_QUALITY_RULE_VALIDATION, as previously mentioned in the solution overview framework diagram, is created to call the individual data quality RULE stored procedures based upon the entries made in DQ_RULE_CONFIG table (a snapshot of the table data is given below) for a concerned SOURCE TABLE where APPLY_RULE flag is set to TRUE. It takes the following arguments:
- DB_NAME: Source DB name
- SCHEMA_NAME: Source schema name
- TABLE_NAME: Source table name
- REQD_CLAEANSED_RECORD: A flag to denote whether to create a CLEANSED table
If REQD_CLAEANSED_RECORD is set to TRUE, it then creates a CLEANSED table having the exact same structure as that of the source table dynamically containing only the clean records.
A sample call to this master stored procedure would look like this:
call DATA_QUALITY_RULE_VALIDATION(‘ANALYTICS’,’SNOWDQ’,’STORE_DQ’,true);
This stored procedure will call all data quality rule stored procedures as mentioned in the DQ_RULE_CONFIG table for any source table undergoing data quality validation. A snapshot of the DQ_RULE_CONFIG table data is given below.
Data quality framework features:
- DQ_RULE_CONFIG table will hold the rule mapping for a table including rule name, rule parameter, and apply rule flag.
- Adding or removing rules on a data set doesn’t require any code changes. Only CONFIG table entries are required to be updated.
- If any data quality rule for a table is to be skipped, only the APPLY_RULE flag should be set to FALSE for that entry.
- A master stored procedure is created to call the data quality rule procedures based upon the entries made in the DQ_RULE_CONFIG table for a table to be validated.
- All the validated records can optionally be loaded into a CLEANSED table for downstream processing. REQD_CLEANSE_RECORD input parameter of the master procedure is used to determine this.
Snapshot of DQ_RULE_CONFIG table
All the DQ rules applied to STORE_DQ table are shown below:
Snapshot of DQ_RULE_VALIDATION_RESULTS table
Below are some of the records of the STORE_DQ table that didn’t satisfy the DQ rule checks:
Below is a sample bad record that didn’t satisfy RULE_DATE DQ rule in record-view mode:
Conclusion
This data quality framework can be extended to include more complex cleansing rules to fit the system’s requirements. DQ_RULE_VALIDATION_RESULTS table can be used to create dashboards in Snowsight or in any other BI tool to capture the bad record summary at the table level, DQ rule level, or at the record level.
The framework can be scheduled in the task for a full-fledged Snowflake solution or it can be integrated with any ETL/ELT tools such as Talend, Informatica, dbt, etc. With the growing need for accurate and clean data, this framework should help to curate the data post-acquisition into the data platform.
Originally posted on August 29, 2022 @ 9:38 pm