Connectors → Google Cloud Storage
About Google Cloud Storage
Cloud Storage is a service for storing your data in Google Cloud. Your data can consist of a file of any format. Cloud Storage considers an object as an immutable piece of data. You store objects in containers called buckets. Your buckets are associated with a project and your projects exist under an organization. You can store and retrieve any amount of data for companies of all sizes.
Google Cloud Storage Connector Updates
This section is to explore the updates in the newer versions of the Google Cloud Storage connector available on the Incorta connectors marketplace.
In order to get the newer version of the connector, please update the connector using the marketplace.
Version | Updates |
---|---|
2.0.1.8 | Fixed an issue with versions from 2.0.1.0 to 2.0.1.7 of the Google Cloud Storage connector that might have affected users who use Wildcard Union on directories containing a large number of files, resulting in load failures or longer load times |
Keep your connector up-to-date with the latest connector version released to get all introduced fixes and enhancements.
About the Google Cloud Storage Connector
With the Google Cloud Storage (GCS) connector, you can create a data source for a Cloud Storage for files of any format.
You can access all folders and files that you have in your bucket.
The GCS connector supports the following Incorta specific functionality:
Feature | Supported |
---|---|
Chunking | |
Data Agent | |
Encryption at Ingest | |
Incremental Load | ✔ |
Multi-Source | ✔ |
OAuth | ✔ |
Performance Optimized | ✔ |
Remote | ✔ |
Single-Source | ✔ |
Spark Extraction | |
Webhook Callbacks | ✔ |
The GCS connector allows two types of authentication:
- OAUTH
- Service Account
OAUTH Requirements
The System Administrator who manages your organization’s GCS accounts as well as your Incorta Cluster creates your project on Google Cloud Platform and your credentials as a web application. To use GCS connector, in your Google Cloud Platform:
- Find your account key in APIs & Services Credentials
- Find your secret key in APIs & Services Credentials
- Enable GCS JSON API
- Make sure you have a bucket that contains your data
Service Account Requirements
The System Administrator who manages your organization’s GCS accounts as well as your Incorta Cluster creates your project on Google Cloud Platform and creates your credentials as a web application. To use GCS connector, in your Google Cloud Platform:
- In IAM & Admin service account, choose your project and generate your service account JSON key
- In the downloaded service account JSON file, find your private ID, private key, and client email
Steps to connect GCS and Incorta
To connect GCS and Incorta, here are the high level steps, tools, and procedures:
- Create an external data source
- Create a schema with the Schema Wizard
- or, Create a schema with the Schema Designer
- Load the schema
- Explore the schema
Create an external data source
Here are the steps to create an external data source with the GCS connector:
- Sign in to the Incorta Direct Data Platform™.
- In the Navigation bar, select Data.
- In the Action bar, select + New → Add Data Source.
- In the Choose a Data Source dialog, in Data lake, select Data lake-GCS.
- In the New Data Source dialog, specify the applicable connector properties.
- To test, select Test Connection.
- Select Ok to save your changes.
GCS connector properties
Here are the properties for the GCS connector:
Property | Control | Description |
---|---|---|
Name Your Data Source | text box | Enter a name for your data source |
Project ID | text box | Enter your project ID. You can find it in your GCS Settings. |
Authentication Type | text box | Select your authentication type. The options are: ● OAUTH 2.0 ● Service Account To learn more, refer to OAUTH Requirements |
Google OAuth2 Client ID | text box | Choose OAUTH 2.0 Authentication type to configure this property. Enter your GCS account key. You can find it in your Google Cloud platform, APIs & Services Credentials. |
Google OAuth2 Client Secret | text box | Choose OAUTH 2.0 Authentication type to configure this property. Enter your GCS secret key. You can find it in your Google Cloud platform > APIs & Services Credentials. |
Authorize | button/link | Click this link to allow GCS access |
Account Client Email | text box | Choose Service Account Authentication type to configure this property. Enter your Service Account Client Email. Copy it from the service account JSON file. |
Account Private Key ID | text box | Choose Service Account Authentication type to configure this property. Enter your service account private key id. Copy it from the service account JSON file. |
Account Private Key | text box | Choose Service Account Authentication type to configure this property. Enter your service account private key. Copy it from the service account JSON file. |
Directory | text box | Enter your GCS bucket name and path to your target folder. Example: gs://bucket-name/path/to/root/directory |
Create a schema with the Schema Wizard
Here are the steps to create a GCS schema with the Schema Wizard:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Schema.
- In the Schema Manager, in the Action bar, select + New → Schema Wizard.
- In (1) Choose a Source, specify the following:
- For Enter a name, enter the schema name.
- For Select a Datasource, select the GCS source.
- Optionally create a description.
- In the Schema Wizard footer, select Next.
- In (2) Manage Tables, in the Data Panel, first select the name of the Data Source, and then check the Select All checkbox.
- In the Schema Wizard footer, select Next.
- In (3) Finalize, in the Schema Wizard footer, select Create Schema.
Create a schema with the Schema Designer
Here are the steps to create a GCS schema using the Schema Designer:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Schema.
- In the Action bar, select + New → Create Schema.
- In Name, specify the schema name, optionally create a description, and select Save.
- In Start adding tables to your schema, select Data Lake.
- In the Data Source dialog, specify the GCS table data source properties.
- Select Add.
- In the Table Editor, in the Table Summary section, enter the table name.
- To save your changes, select Done in the Action bar.
GCS table data source properties
For a schema table in Incorta, you can define the following GCS specific data source properties as follows:
Property | Control | Description |
---|---|---|
Type | drop down list | Default is Data Lake |
Data Source | drop down list | Select the GCS external data source |
Remote | toggle | Enable this option to remotely access file data, which means no data is loaded to Incorta. See the Summary of Data Access Methods table for details on how setting this and the Performance Optimized option affects data accessibility. |
File Type | drop down list | Select a file type option: Text (csv, tsv, tab, txt) Excel (xlsx) - not an option with Remote enabled Parquet ORC |
Incremental | toggle | Enables incremental loading for the schema table |
Has header? | toggle | This property appears when Remote is disabled and the File Type is Text. Enable this property to indicate the data source has a header row. |
Rows to Skip | text box | This property appears when Remote is disabled and the File Type is Text. Enter the number of rows to skip from the top of the file. |
Wildcard Union | toggle | Enable this property to get incremental data file updates from an existing directory |
Incremental Extract Using | drop down list | Enable Incremental and Wildcard Union to configure this property. Choose the incremental extract method. |
Directory Path | text box | Enable Wildcard Union to configure this property. Enter the path to the directory, relative to the root directory configured in the data source. For example: sales/branches |
Apply Include Pattern on | drop down list | This property appears when Wildcard Union is enabled. Select either: ● File Name - apply pattern on all file names in the selected directory path ● File Relative Path - apply pattern on relative path in the selected directory path |
Include | text box | This property appears when Wildcard Union is enabled. To include only certain files in the load process, enter a prefix to compare against: ● The names of the files in a directory if Apply Include Pattern has a value of File Name. For example, entering sales* .parquet will load only files that start with the word sales and end with .parquet. ● The relative path in a directory if Apply Include Pattern on has a value of File Relative Path. For example, entering sales will load files in the sales directory. |
Exclude | text box | This property appears when Wildcard Union is enabled. To exclude files from the load process, enter a prefix to compare against. Files that match the prefix will not be loaded. |
Include Sub-Directories | toggle | Enable Wildcard Union to configure this property. Enable this property to load files within all subdirectories of the directory path hierarchy. If an Include Prefix is specified, only files or relative paths in the subdirectories matching the prefix will be loaded. |
Include Filename as a Column | toggle | Enable Wildcard Union to configure this property. Enable this property to add the file name as the first column in the schema table. |
File Path | text box | Enter the path to the data file, relative to the root directory configured in the data source. For example: sales/Q1.csv |
Update File | text box | Enter the path to update the file relative to the root directory configured in the data source. For example: sales/Q1_updates.csv |
Filename column | text box | Enable Include Filename as a Column to configure this property. Enter the filename column. |
Date Format | drop down list | This property appears when the File Type is Text. Choose the date format from the available options. |
Timestamp Format | drop down list | This property appears when the File Type is Text. Choose the timestamp format from the available options. |
Character Set | drop down list | This property appears when the File Type is Text. Choose the character set from the available options. |
Separator | drop down list | This property appears when the File Type is Text. Choose the separator from the available options. |
Enable Chunking | toggle | This property appears when the File Type is Text. Enable this property to process the text file in chunks which reduces the extract time. |
Chunk Size (MB) | text box | Enter chunking size in Megabytes |
Callback | toggle | Enable this property to expose the Callback URL field |
Callback URL | text box | This property appears when the Callback toggle is enabled. Specify the URL. |
Summary of Data Access Methods Based on Remote and Performance Optimized Settings
Table Properties | Data Source Properties | Parquet | DDM | Memory | SQLi | MV/ Notebooks | Analytics |
---|---|---|---|---|---|---|---|
Performance Optimized = Off | Remote = On | No | No | No | Yes | Yes | No |
Performance Optimized = Off | Remote = Off | Yes | Yes | No | Yes | Yes | No, unless populated via MV/Notebook |
Performance Optimized = On | Remote = Off | Yes | Yes | Yes | Yes | Yes | Yes |
Incremental Extract Methods
Last Successful Extract Time: This option will load data from the time the last successful extract occurred.
Here is a use case of Last Successful Extract Time:
- A directory containing all the sales data is located at
/path/to/sales
- The directory contains the following files:
/path/to/sales/sales_california.parquet
,/path/to/sales/sales_newyork.parquet
,/path/to/sales/illinois.parquet
- A directory containing all the sales data is located at
When you perform a full load, the union of all existing files will be extracted into the same table. After that, if the directory receives a new file, such as
/path/to/sales/sales_ohio.parquet
, the next incremental load will pick up this file since its last modified timestamp will be more recent than that of the files extracted in the previous full load.Timestamp in File Name: This option will load data from the time specified in the file name.
Here is an example use case of Timestamp in File Name:
- A directory containing all the sales data is located at
/path/to/sales
- The directory receives a new file on daily basis:
/path/to/sales/sales_2020-04-01.parquet
,/path/to/sales/sales_2020-04-02.parquet
,/path/to/sales/sales_2020-04-03.parquet
- A directory containing all the sales data is located at
When you perform a full load, the union of all existing files will be extracted into the same table. After that, if the directory receives a new file, such as
/path/to/sales/sales_2020-04-04.parquet
, the next incremental load will pick up this file since the timestamp in the file name is more recent than that of the files extracted in the previous full load.
View the schema diagram with the Schema Diagram Viewer
Here are the steps to view the schema diagram using the Schema Diagram Viewer:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Schema.
- In the list of schemas, select the GCS schema.
- In the Schema Designer, in the Action bar, select Diagram.
Load the schema
Here are the steps to perform a Full Load of the GCS schema using the Schema Designer:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Schema.
- In the list of schemas, select the GCS schema.
- In the Schema Designer, in the Action bar, select Load → Full Load.
- To review the load status, in Last Load Status, select the date.
Explore the schema
With the full load of the GCS schema complete, you can use the Analyzer to explore the schema, create your first insight, and save the insight to a new dashboard.
To open the Analyzer from the schema, follow these steps:
- In the Navigation bar, select Schema.
- In the Schema Manager, in the List view, select the GCS schema.
- In the Schema Designer, in the Action bar, select Explore Data.