Tools → Apache Parquet Merge Tool

About the Apache Parquet Merge Tool

Over time, an Incorta table configured for incremental loads can generate hundreds of Apache Parquet files that are small in size. The Apache Parquet Merge tool merges multiple Parquet table increment files into a single table increment file that contains the merged segments. As the merged files are typically about 1GB in size, the result is improved performance for reading Parquet files from Shared Storage.

The Apache Parquet Merge tool runs as part of the Tenant Management tool (TMT).

There are three options for running the Parquet Merge Tool:

  • For a specific Tenant, merge all schema table increments
  • For a specific Tenant, merge all table increments for one or more Schemas
  • For a specific Tenant and Schema, merge one or more table increments

To start using the Parquet Merge Tool, follow these steps:

The tool creates a new version for each table. After merging the files of a table, the tool updates the table record in the metadata database with the new uncommitted version.

Important

After running the Parquet Merge Tool, you must load the respective schemas from staging.

Apache Parquet Merge Tool Input Parameters

The interactive shell script has the following parameters:

ParameterDescription
-clnmMandatory. Specify the cluster name.
-mpMandatory. Specify the Tenant name
-snOptional. Specify the Schema name(s).
-tnOptional. Specify the Table name(s).
-miOptional. Specify the minimum table increments count for the merge. The default is 100. If less than the minimum, the tool will not merge the table increments.
-mtOptional. Specify merge type. The default is append, knowing that available values are: append, readWrite.

For a specific Tenant, to migrate and merge all parquet files of a table under a schema, execute the following specifying a minimum increment that the table should have:

./tmt.sh -clnm localCluster -mp demo -sn <SchemaName> -tn <TableName> -mi <MinIncrementsNumber> -mt readWrite

View the Apache Parquet Merge Tool Log Files

The Parquet Merge Tool generates a log file, whose name starts with merge-tool.log, in the following CMC installation directory:

/home/incorta/IncortaAnalytics/cmc/tenant

The file contains the Source Segment, Target Segment, and Source Offset. Use this log file to determine the results of the merge activities.

You can also find the .csv report file under cmc\tmt\work.