Tools → Apache Parquet Merge Tool

About the Apache Parquet Merge Tool

Over time, an Incorta table configured for incremental loads can generate hundreds of Apache Parquet files that are small in size. The Apache Parquet Merge tool merges multiple Parquet table increment files into a single table increment file that contains the merged segments. As the merged files are typically about 1GB in size, the result is improved performance for reading Parquet files from Shared Storage.

In releases prior to 5.1, the Apache Parquet Merge tool was a standalone interactive, command line tool. Starting with the Incorta 5.1 release, the Apache Parquet Merge tool runs as part of the TMT.

There are three options for running the Parquet Merge Tool:

  • For a specific Tenant, merge all schema table increments
  • For a specific Tenant, merge all table increments for one or more Schemas
  • For a specific Tenant and Schema, merge one or more table increments

The tool backs up the original files into a backup directory and merges the table increments into the original directory.

To start using the Parquet Merge Tool, follow these steps:

Apache Parquet Merge Tool Input Parameters

The interactive shell script has the following parameters:

ParameterDescription
-clnmMandatory. Specify the cluster name.
-mpMandatory. Specify the Tenant name
-snOptional. Specify the Schema name(s).
-tnOptional. Specify the Table name(s).
-miOptional. Specify the minimum table increments count for the merge. The default is 100. If less than the minimum, the tool will not merge the table increments.
-mtOptional. Specify merge type. The default is append, knowing that available values are: append, readWrite.

For a specific Tenant, to migrate and merge all parquet files of a table under a schema, execute the following specifying a minimum increment that the table should have:

./tmt.sh -clnm localCluster -mp demo -sn <SchemaName> -tn <TableName> -mi <MinIncrementsNumber> -mt readWrite

When the merge is complete, you will see the following message:

Merge is Done ! Do you want to delete the backup source files ? (press y to continue ,otherwise exit)

View the Apache Parquet Merge Tool Log Files

The Parquet Merge Tool generates a log file, whose name starts with merge-tool.log, in the following CMC installation directory:

/home/incorta/IncortaAnalytics/cmc/tenant

The file contains the Source Segment, Target Segment, and Source Offset. Use this log file to determine the results of the merge activities.

You can also find the .csv report file under cmc\tmt\work.