Concepts → Full Load

About a full load

The full load strategy is one of multiple data load strategies available in Incorta. There are several options for loading data. You can load data for a given physical schema or a given object. You can also load data on-demand or schedule a load job to run. During a load job, data loading can be from source: full load or incremental load, or from staging (Shared Storage).

You can perform a full load for a physical schema or an object. Physical schema full load jobs can be on-demand or scheduled while object full load jobs are available only on demand.

For more information about loading data in Incorta, refer to References → Data Ingestion and Loading.

How to start or schedule a full load job

A Super User tenant administrator or a user that belongs to a group with the SuperRole or Schema Manager role can start a load job or create a scheduled job to run one or more unattended load jobs for the same physical schema. As a schema developer, you can start or schedule a full load job from the Schema Designer. You can also create a full load scheduled job from the Schema Manager or the Scheduler.

Note

When a full load job starts, the Loader Service, by default, performs a full load for all physical schema tables or materialized views (MVs). However, in the case of physical schema tables and MVs that have incremental load enabled and full load disabled, the Loader Service skips them during full load jobs. Typically, a schema developer performs a full load of an object at least once before enabling the Disable Full Load property.

Warning

Incorta does not recommend running concurrent schema model update jobs and load jobs on the same schema or dependent schemas as this may result in errors or inaccurate data.

Schema updates that require a full load

Some updates you make to the physical schema objects require loading data fully from source to ensure data consistency.

The following are the updates that require a full load:

  • Adding a new physical schema table or MV
  • Changing the data type of a physical schema table column or materialized view column
  • Changing the source of a physical schema table or MV, whether by selecting another source file in the Data Source properties dialog or editing the query
  • Adding or changing a key column (changing the column function from key to dimension or measure and vice versa) in a physical schema table or MV
  • Adding a new physical schema table column
  • Adding a new MV column
  • Changing the object type, for example, changing a physical schema table to an Incorta Analyzer table or MV
  • Removing a physical schema table column or an MV column that functions as a key
  • Changing the encryption status of one or more columns in a physical schema table or MV

The full load job cycle

A full load job goes through the following stages:

During a full load job, the following occurs:

  • The Loader Service extracts data from the data source for each physical schema table or the single specified table according to the table data source properties.

  • The Loader Service creates new source parquet files in the source directory. The Loader Service creates a new parquet version directory with a subdirectory to save these files.

  • When the Table Editor → Enforce Primary Key Constraint is enabled for an object, primary key index calculations (deduplication) start to mark duplicate records that must be deleted to ensure that only unique data records exist.

  • If the Cluster Management Console (CMC) → Tenant Configurations → Data Loading → Enable Always Compact option is enabled, a compaction job starts to remove duplicate rows and create a compacted version of the object parquet files in the object’s _rewritten directory in the source area. The following are the consumers of compacted parquet files: MVs, SQLi queries on Spark port, internal and external Notebook services, and Preview data function.

    Note

    In releases prior to 6.0, a compaction job resulted in both: rewriting a compacted version of each parquet file that has duplicates and copying other extracted parquet files. Copied and rewritten parquet files were saved to the compacted directory under the tenant directory. The compacted directory might have multiple versions of compacted files of the same object. Consumers of compacted parquet files were directed to read data from the latest committed compacted version of the parquet files in the compacted directory.

  • When the Enforce Primary Key Constraint property is disabled for an object, both the deduplication and compaction calculations for this object are skipped. The Enforce Primary Key Constraint option is disabled by default for newly created physical schema tables and MVs with one or more key columns. You can enable it to enforce the calculation of the primary key index to ensure record uniqueness during a full load job or disable it to skip this calculation and optimize data load time and performance. Disable it only if your dataset has unique records.

    Note

    In releases before 6.0.3, when the Enforce Primary Key Constraint option was disabled for physical tables or MVs, and the selected key columns resulted in duplicate key values, unique index calculations would not fail, the first matching value was returned whenever a single value of the key columns was required.

    Starting with release 6.0.3, in such a case, the unique index calculation will fail, and the load job will finish with errors. You must either select key columns that ensure row uniqueness and perform a full load or enable the Enforce Primary Key Constraint option and load tables from staging to have the unique index correctly calculated.

  • At the end of the compaction job, a group of metadata files is generated in Delta Lake file formats to point to all parquet files (whether extracted or rewritten) that constitute a compacted version. Consumers of the compacted parquet will use the Delta Lake metadata files to find out which extracted or compacted parquet file versions to read data from.

  • For an MV, the Loader Service passes the query of the MV Script to Spark. Spark reads data from the parquet files of the underlying physical schema objects and creates new parquet files for the MV in a new parquet version directory in the source directory. A compacted version of the MV parquet files is also created in the object’s _rewritten directory if compaction is enabled.

    Note

    Since a materialized view can reference columns from Incorta SQL tables or Incorta Analyzer tables in other physical schemas, Spark will read data from the source parquet files of these Incorta tables as they do not have a compacted version.

    For each of these tables, a _delta_log directory exists in the object directory to include a group of metadata files that compacted parquet consumers (such as Spark) use to find out the parquet files of each object version to read from.

  • For Incorta Analyzer tables and Incorta SQL tables, the Loader Service creates full parquet files in the source directory. With the introduction of the derived tables' support for key columns in 6.0.3, the Loader Service creates snapshot DDM files for the unique index each time the derived table's key columns are updated or the schema or table is loaded.

  • For physical schema tables and MVs with performance optimization enabled, the Loader Service loads data to the Engine memory. The Engine then calculates any formula columns, key columns, or load filters for each object and creates snapshot DDM files. These files are saved to the schemas directory that exists in the ddm directory.

  • In the case that there is a join relationship where one of the physical schema objects is the child table, the Engine creates a new version of the join DDM files and saves them to the joins directory that exists in the ddm directory.