References → Data Quality Recipe

Data quality is paramount for ensuring the reliability, integrity, and utility of data. High-quality data underpins accurate analysis and informed decision-making. Organizations can reduce risks associated with errors, inefficiencies, and regulatory non-compliance by ensuring data is accurate, complete, consistent, and timely.

Configuration

ConfigurationDescription
Recipe NameA freeform name of how a user would like to name a recipe
InputSelect a previously constructed recipe. This will not be processed but instead will process the below-configured table.
SchemaThe schema that contains the table to perform data quality checks against.
Base TableThe table to perform data quality checks against.
Add RulesDescription
  ●  A user-assigned value that declares the intent of the data quality rule.
Category
  ●  A high-level categorization of what the rule is intended to address.
Column
  ●  The column that should be considered violated by the rule.

Category definitions

Each data quality rule must have an accompanying category. While this is required, the selection of each of these categories will not change the function of the rule itself. Instead, it’s intended to help identify what the rule is trying to achieve without a user needing to read the rule syntax.

Completeness - the degree to which the necessary data is available for use.
Uniqueness - the degree to which data records are unique and not duplicates.
Timeliness - the degree to which data is up to date and available when it is needed.
Validity - the degree of records’ conformance to format, type, and range.
Accuracy - the degree to which data values align with real values.
Consistency - the degree to which data is consistent across different sources.
Relevance - the degree to which the dataset’s level of detail aligns with its intended purpose.
Conformity - the degree to which data follows the set of standard data definitions like data type, size, and format.

Output

In the result set, there will be a new column prepended to the dataset named Incorta_DQ_Violations. In this column, the number of violations for this row will be shown. Additionally, hovering over the row value will showcase the DQ rule in violation. The information in the hover-over includes:

  • A unique identifier
  • The description
  • The category

The unique identifier can be used as a list item in the remainder of the workflow. For example, if you want to filter all records that contain a certain violation, copy the unique identifier from Incorta_DQ_Violations and paste it as the operand in a filter tool.

Data Quality rule storage

When exporting the MV to a schema, a CSV file is built and saved in Data/Rules/{workflowname}. This file contains the following attributes from the data flow:

Column NameDescription
Rule_IDThe automatically assigned unique identified for the data quality rule.
Schema_NameThe base schema in which the data quality rules were applied.
Table_NameThe base table in which the data quality rules were applied.
Rule_SQL_ExpressionThe written Spark SQL expression for the data quality rule
Rule_DescriptionThe written description that describes the purpose of the data quality rule.
Rule_CategoryThe assigned category in which the data quality rule belongs.
Rule_OwnerThe name of the user who created the data quality rule.
Is_Active_FlagA true/false value describing if the data quality rule is active in the MV.
Output_TableThe schema and table where the data quality rule is deployed.
TimestampThe time in which the data quality rule was deployed.