References → Data Quality Recipe

Data quality is paramount for ensuring the reliability, integrity, and utility of data. High-quality data underpins accurate analysis and informed decision-making. Organizations can reduce risks associated with errors, inefficiencies, and regulatory non-compliance by ensuring data is accurate, complete, consistent, and timely.

Configuration

Configuration	Description
Recipe Name	A freeform name of how a user would like to name a recipe
Input	Select a previously constructed recipe. This will not be processed but instead will process the below-configured table.
Schema	The schema that contains the table to perform data quality checks against.
Base Table	The table to perform data quality checks against.
Add Rules	Description ● A user-assigned value that declares the intent of the data quality rule. Category ● A high-level categorization of what the rule is intended to address. Column ● The column that should be considered violated by the rule. Use Incorta Nexus to generate an intelligent rule suggestion: ● Enter the name and description of your new rule. ● Select the Incorta Nexus icon next to the description field. ● View the generated Category, Column, and Rule Expression based on your input in the description field.

Category definitions

Each data quality rule must have an accompanying category. While this is required, the selection of each of these categories will not change the function of the rule itself. Instead, it’s intended to help identify what the rule is trying to achieve without a user needing to read the rule syntax.

Completeness - the degree to which the necessary data is available for use.
Uniqueness - the degree to which data records are unique and not duplicates.
Timeliness - the degree to which data is up to date and available when it is needed.
Validity - the degree of records’ conformance to format, type, and range.
Accuracy - the degree to which data values align with real values.
Consistency - the degree to which data is consistent across different sources.
Relevance - the degree to which the dataset’s level of detail aligns with its intended purpose.
Conformity - the degree to which data follows the set of standard data definitions like data type, size, and format.

Output

In the result set, there will be a new column prepended to the dataset named Incorta_DQ_Violations. In this column, the number of violations for this row will be shown. Additionally, hovering over the row value will showcase the DQ rule in violation. The information in the hover-over includes:

A unique identifier
The description
The category

The unique identifier can be used as a list item in the remainder of the workflow. For example, if you want to filter all records that contain a certain violation, copy the unique identifier from Incorta_DQ_Violations and paste it as the operand in a filter tool.

Data Quality rule storage

When exporting the MV to a schema, a CSV file is built and saved in Data/Rules/{workflowname}. This file contains the following attributes from the data flow:

Column Name	Description
Rule_ID	The automatically assigned unique identified for the data quality rule.
Schema_Name	The base schema in which the data quality rules were applied.
Table_Name	The base table in which the data quality rules were applied.
Rule_SQL_Expression	The written Spark SQL expression for the data quality rule
Rule_Description	The written description that describes the purpose of the data quality rule.
Rule_Category	The assigned category in which the data quality rule belongs.
Rule_Owner	The name of the user who created the data quality rule.
Is_Active_Flag	A true/false value describing if the data quality rule is active in the MV.
Output_Table	The schema and table where the data quality rule is deployed.
Timestamp	The time in which the data quality rule was deployed.

Content

Configuration

Category definitions

Output

Data Quality rule storage