Connectors → Apache Hadoop

About Apache Hadoop

Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. With the Apache Hadoop Web UI, you can typically access browse the HDFS file system.

Apache Hadoop Connector Updates

This section is to explore the updates in the newer versions of the Apache Hadoop connector available on the Incorta connectors marketplace.

In order to get the newer version of the connector, please update the connector using the marketplace.

VersionUpdates
2.0.1.8Fixed an issue with versions from 2.0.1.0 to 2.0.1.7 of the Apache Hadoop connector that might have affected users who use Wildcard Union on directories containing a large number of files, resulting in load failures or longer load times
Recommendation

Keep your connector up-to-date with the latest connector version released to get all introduced fixes and enhancements.

Apache Hadoop Connector

The Apache Hadoop Connector enables Incorta to access files stored in an HDFS directory. Incorta is able to load the following file types from HDFS:

  • Text (csv, tsv, tab, txt)
  • Excel (xlsx)
  • Parquet
  • Optimized Row Columnar (ORC)

The Apache Hadoop connector supports the following Incorta specific functionality:

FeatureSupported
Chunking
Data Agent
Encryption at Ingest
Incremental Load
Multi-Source
OAuth
Performance Optimized
Remote
Single-Source
Spark Extraction
Webhook Callbacks

Steps to connect Apache Hadoop and Incorta

To connect Apache Hadoop and Incorta, here are the high level steps, tools, and procedures:

Create an external data source

Here are the steps to create a external data source with the Apache Hadoop connector:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Data.
  • In the Action bar, select + NewAdd Data Source.
  • In the Choose a Data Source dialog, in Data lake, select Data Lake - HDFS.
  • In the New Data Source dialog, specify the applicable connector properties.
  • To test, select Test Connection.
  • Select Ok to save your changes.

Apache Hadoop connector properties

Here are the properties for the Apache Hadoop connector:

PropertyControlDescription
Data Source Nametext boxEnter the name of the data source
Directorytext boxEnter the path to the public HDFS user directory using a hdfs URL such as:
hdfs://<PUBLIC_IP>:<PORT>/user/<USER_NAME_OR_PATH>

Replace <PUBLIC_IP> with a the HDFS website public IP address or public DNS.

The default <PORT> is 50700.

<USER_NAME_OR_PATH> represents the user and/or path to the HDFS directory.

Create a schema with the Schema Wizard

Here are the steps to create an Apache Hadoop schema with the Schema Wizard:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the Action bar, select + New → Schema Wizard
  • In (1) Choose a Source, specify the following:
    • For Enter a name, enter the schema name.
    • For Select a Datasource, select the Apache Hadoop external data source.
    • Optionally create a description.
  • In the Schema Wizard footer, select Next.
  • In (2) Manage Tables, in the Data Panel, navigate the directory tree as necessary to select the Apache Hadoop files. You can either check the Select All checkbox or select individual sheets.
  • In the Schema Wizard footer, select Next.
  • In (3) Finalize, in the Schema Wizard footer, select Create Schema.

Create a schema with the Schema Designer

Here are the steps to create an Apache Hadoop schema using the Schema Designer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the Action bar, select + New → Create Schema.
  • In Name, specify the schema name, and select Save.
  • In Start adding tables to your schema, select Data Lake.
  • In the Data Source dialog, specify the Apache Hadoop table data source properties.
  • Select Add.
  • In the Table Editor, in the Table Summary section, enter the table name.
  • To save your changes, select Done in the Action bar.

Apache Hadoop table data source properties

For a schema table in Incorta, you can define the following Apache Hadoop specific data source properties as follows:

PropertyControlDescription
Typedrop down listDefault is Data Lake
Data Sourcedrop down listSelect the Apache Hadoop external data source
RemotetoggleEnable this option to remotely access file data, which means no data is loaded to Incorta. See the Summary of Data Access Methods table for details on how setting this and the Performance Optimized property affects data accessibility.
File Typedrop down listSelect a file type option:
  ●  Text (csv, tsv, tab, txt)
  ●  Excel (xlsx) - not an option with Remote enabled
  ●  Parquet
  ●  ORC
IncrementaltoggleEnables incremental loading for the schema table
Has Header?toggleThis property appears when Remote is disabled and the File Type is Text. Enable this property to indicate the data source has a header row.
Rows to Skiptext boxThis property appears when Remote is disabled and the File Type is Text. Enter the number of rows to skip from the top of the file.
Wildcard UniontoggleEnable this property to get incremental data file updates from an existing directory
File Pathtext boxThis property appears when Wildcard Union is disabled. Enter the path to the data file, relative to the root directory configured in the data source.
Worksheettext boxThis property appears when Wildcard Union is disabled and the File Type is Excel. Select the data file worksheet of interest.
Update Filetext boxThis property appears when Incremental is enabled and Wildcard Union is disabled. Enter the path to the update file, relative to the root directory configured in the data source.
Update Worksheettext boxThis property appears when the File Type is Excel, Incremental is enabled, and Wildcard Union is disabled. Select the update file worksheet of interest.
Incremental Extract Usingdrop down listThis property appears when Incremental and Wildcard Union are enabled. Select an incremental load method.
Timestamp format in file namedrop down listThis property appears when the Timestamp in File Name option is selected for the Incremental Extract Using property. Select the timestamp format that appears in the file name.
Directory Pathtext boxThis property appears when Wildcard Union is enabled. Enter the path to the directory, relative to the root directory configured in the data source. To use the root directory, enter ./ or .
Apply Include Pattern ondrop down listThis property appears when Wildcard Union is enabled. Select either:
  ●  File Name - apply pattern on all file names in the selected directory path
  ●  File Relative Path - apply pattern on relative path in the selected directory path
Includetext boxThis property appears when Wildcard Union is enabled. This property appears when Wildcard Union is enabled. To include only certain files in the load process, enter a prefix to compare against:
  ●  The names of the files in a directory if Apply Include Pattern on has a value of File Name. For example, entering sales* .parquet will load only those files that start with the word sales and end with .parquet.
  ●  The relative path in a directory if Apply Include Pattern has a value of File Relative Path. For example, entering sales will load those files in the sales directory.
Excludetext boxThis property appears when Wildcard Union is enabled. To exclude files from the load process, enter a prefix to compare against. Files that match the prefix will not be loaded.
Include Sub-DirectoriestoggleThis property appears when Wildcard Union is enabled. Enable this property to load files within all subdirectories of the directory path hierarchy. If an Include prefix is specified, only files or relative paths in the subdirectories matching the prefix will be loaded.
Include Filename as a ColumntoggleThis property appears when Wildcard Union is enabled. Enable this property to add the file name as the first column in the schema table.
Date Formatdrop down listThis property appears when the File Type is Text<. Select the text file date format.
Timestamp Formatdrop down listThis property appears when the File Type is Text. Select the text file timestamp format.
Character Setdrop down listThis property appears when the File Type is Text. Select the text file character set.
Separatordrop down listThis property appears when the File Type is Text. Select the text file separator.
Enable ChunkingtoggleThis property appears when the File Type is Text. Turn this property on to process the text file in chunks.
CallbacktoggleEnable this property on to expose the Callback URL field
Callback URLtext boxThis property appears when the Callback toggle is enabled. Specify the URL.

Summary of Data Access Methods Based on Remote and Performance Optimized Settings

Table PropertiesData Source PropertiesParquetDDMMemorySQLiMV/ NotebooksAnalytics
Performance Optimized = OffRemote = OnNoNoNoYesYesNo
Performance Optimized = OffRemote = OffYesYesNoYesYesNo, unless populated via MV/Notebook
Performance Optimized = OnRemote = OffYesYesYesYesYesYes

Incremental Extract Methods

  • Last Successful Extract Time: This option will load data from the time the last successful extract occurred.

    Here is an example use case of Last Successful Extract Time:

    • A directory containing all the sales data is located at /path/to/sales
    • The directory contains the following files: /path/to/sales/sales_california.parquet, /path/to/sales/sales_newyork.parquet, /path/to/sales/illinois.parquet

    When you perform a full load, the union of all existing files will be extracted into the same table. After that, if the directory receives a new file, such as /path/to/sales/sales_ohio.parquet, the next incremental load will pick up this file since its last modified timestamp will be more recent than that of the files extracted in the previous full load.

  • Timestamp in File Name: This option will load data from the time specified in the file name.

    Here is an example use case of Timestamp in File Name:

    • A directory containing all the sales data is located at /path/to/sales
    • The directory receives a new file on daily basis: /path/to/sales/sales_2020-04-01.parquet, /path/to/sales/sales_2020-04-02.parquet, /path/to/sales/sales_2020-04-03.parquet

    When you perform a full load, the union of all existing files will be extracted into the same table. After that, if the directory receives a new file, such as /path/to/sales/sales_2020-04-04.parquet, the next incremental load will pick up this file since the timestamp in the file name is more recent than that of the files extracted in the previous full load.

Timestamp Formats in File Name

  • yyyy-MM-dd
  • dd.MM.yyyy
  • dd-MMM-yy
  • dd-MMM-yyyy
  • yyyy-MM-dd HH.mm.ss
  • Unix Epoch (seconds)
  • Unix Epoch (milliseconds)

Text File Date Format

  • yyyy-MM-dd
  • dd/MM/yyyy
  • dd.MM.yyyy
  • dd/MMM/yyyy
  • dd-MMM-yy
  • dd-MMM-yyyy
  • MM/dd/yyyy
  • yyyy/MM/dd
  • Unix Epoch (seconds
  • Unix Epoch (milliseconds)
  • Other

Text File Timestamp Format

  • yyyy-MM-dd HH:mm:ss
  • yyyy-MM-dd HH.mm.ss
  • yyyy-MM-dd HH:mm:ss.SSS
  • dd/MM/yyyy HH:mm:ss
  • dd/MM/yyyy HH.mm.ss
  • dd/MM/yyyy HH:mm:ss.SSS
  • Unix Epoch (seconds
  • Unix Epoch (milliseconds)
  • Other

Text File Character Set

  • US-ASCII
  • ISO-8859-1
  • UTF-8
  • UTF-16BE
  • UTF-16LE
  • UTF-16

Text File Separator

  • Comma
  • Tab
  • Other

View the schema diagram with the Schema Diagram Viewer

Here are the steps to view the schema diagram using the Schema Diagram Viewer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the list of schemas, select the Apache Hadoop schema.
  • In the Schema Designer, in the Action bar, select Diagram.

Load the schema

Here are the steps to perform a Full Load of the Apache Hadoop schema using the Schema Designer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the list of schemas, select the Apache Hadoop schema.
  • In the Schema Designer, in the Action bar, select LoadFull Load.
  • To review the load status, in Last Load Status, select the date.

Explore the schema

With the full load of the Apache Hadoop schema complete, you can use the Analyzer to explore the schema, create your first insight, and save the insight to a new dashboard.

To open the Analyzer from the schema, follow these steps:

  • In the Navigation bar, select Schema.
  • In the Schema Manager, in the List view, select the Apache Hadoop schema.
  • In the Schema Designer, in the Action bar, select Explore Data.