Tools → Avro Extractor Tool
About the Avro Extractor Tool
The Apache Kafka connector requires an .avro
file for a given Kafka topic. The Avro file describes how to serialize and flatten a semi-structured, schema-less JSON message into structured data consisting of one or more related physical schema tables within Incorta.
The Avro Extractor Tool is a JAR file, avroExtractor.jar
, that you can use to generate an Avro file from a single JSON message sample.
The resulting .avro
file defines how a specific JSON object maps to one or more physical schema tables, and how specific key-value pairs map to table columns. The Avro file itself is in JSON format. The file contains Avro annotations which define the mapping of JSON key-value pairs to tables, columns, data types, and keys. Annotations for keys describe the relationships between tables.
Deployment
The default location of the Avro Extractor Tool is:
/home/incorta/IncortaAnalytics/IncortaNode/bin/avroExtractor.jar
The avroExtractor.jar
file requires a dependent JAR file, javax.json-1.0.4.jar
, to be in the same parent directory.
Here are the steps to copy javax.json-1.0.4.jar
to the same parent directory as the avroExtractor.jar
:
- On an Incorta Node where you created the sample JSON file, as the incorta user, copy the
javax.json-1.0.4.jar
file to the same directory as../IncortaAnalytics/IncortaNode/bin/
.
sudo su incortacp /home/incorta/IncortaAnalytics/IncortaNode/runtime/lib/javax.json-1.0.4.jar /home/incorta/IncortaAnalytics/IncortaNode/bin/
- Confirm that you copied the file as the following will show
javax.json-1.0.4.jar
:
ls -l /home/incorta/IncortaAnalytics/IncortaNode/bin/ | grep javax
If you have JAVA installed on your local machine, you can run the Avro Extractor Tool as an independent JAR file on your machine. Simply copy both the avroExtractor.jar
and javax.json-1.0.4.jar
file to the same parent directory.
How to use the Avro Extractor Tool
The tool requires a JSON file that contains a single JSON message from a Kafka topic. After creating the sample file, you can run the Avro Extractor Tool as deployed on the Incorta Node. You can then copy the .avro
file to your local desktop. Creating an external data source that uses the Apache Kafka connector for the related Kafka topic requires this .avro
file.
Create a sample JSON file
In order to generate a .avro
file, the Avro Extractor Tool requires a .json
file that contains a single JSON message. Here is an example of a JSON message for the FruitSales topic in Kafka:
{ "messageId": 143, "entity": "tblFruitSales", "data": { "id": 143, "dt": "2019-09-30 19:28:23", "item": "jackfruit", "quantity": 35, "price": 5.72 } }
The JSON message itself must contain:
- a Message Type key-pair, such as
{ ... "entity": "tblData" ...}
- a
data
key-pair, where the value of thedata
key contains the actual data you want to store in Incorta, for example:
{... "data": { "name1": "value1" } }
The Avro Extractor Tool will ignore the data
key, and you will not see this key in the resulting .avro
file.
To create a sample JSON file for the Avro Extractor Tool, follow these steps:
- Switch to the incorta user on the same host as where you will run the Avro Extractor Tool.
sudo su incorta
- In the
/tmp
directory, create themySampleMessage.json
file using vim.
cd /tmpvim mySampleMessage.json
- Using the
i
keystroke, switch to Insert mode, and copy your message to the file. Here is an example:
{ "messageId": 143, "entity": "tblFruitSales", "data": { "id": 143, "dt": "2019-09-30 19:28:23", "item": "jackfruit", "quantity": 35, "price": 5.72 } }
- Using the
ESC
keystroke, switch back to Read mode. - With the keystroke
:wq!
, force quit and save. - Verify the console output of the message contents.
cat mySampleMessage.json
Generate an Avro file for the sample JSON message
After deploying the Avro Extractor Tool and creating the sample JSON message file, you can run the Avro Extractor Tool. By default, the tool will generate an .avro
file in the same directory as the input .json
file. The Avro Extractor has several available input parameters that you can specify.
The resulting file in the following example is mySampleMessage.json.avro
. Here are the steps to create the .avro
file as the incorta user:
- Change directories to the location of the
avroExtractor.jar
file.
cd /home/incorta/IncortaAnalytics/IncortaNode/bin/
- Run the Avro Extractor Tool from the command line and specify the required values for the Avro Extractor Tool input parameters.
java -jar avroExtractor.jar -messageType entity -input /tmp/mySampleMessage.json
- The console output will the location of the
.avro
file:
Extracting the schema for: /tmp/mySampleMessage.jsonoutput to: /tmp/mySampleMessage.json.avromessageType key is: entitytrimAfterDash: false
- Read the content of the
.avro
file in the/tmp
directory.
cat /tmp/mySampleMessage.json.avro
- Verify that the file is similar to the following:
{ "avroSchema": [{ "type": "record", "name": "tblFruitSales", "fields": [{ "name": "id", "type": "long"},{ "name": "dt", "type": "string"},{ "name": "item", "type": "string"},{ "name": "quantity", "type": "long"},{ "name": "price", "type": "double"}]}], "annotations": []}
- Secure copy the
.avro
file to your desktop. Here is an example of using scp to copy a file from the remote host to the desktop:
INCORTA_NODE_HOST=100.101.102.103scp -i ~/.ssh/host_pemkey.pem incorta@${INCORTA_NODE_HOST}:/tmp/mySampleMessage.json.avro ~/Desktop/mySampleMessage.json.avro
- Edit the
.avro
file and add Avro Notations that describe keys and relationships between tables. - Save your changes.
Avro Extractor Tool input parameters
Here are the various input parameters available for the Avro Extractor Tool:
Parameter | Description |
---|---|
-input | Specify the sample JSON message file name and path, if not in the current directory |
-output | Optional. Specify this parameter to designate the name of the output file and path, if not the current directory. If not specified, the tool will output the .avro file to the same directory as the input file and will use the input filename, appending the .avro file extension to the name. |
-messageType | Specify the root table name for the JSON messages. If the key name is entity as such in this example {"entity" : "Employee"} , the specify -messageType entity . Incorta will create an Employee table as the root table. |
-trimAfterDash | Optional. Specify as true orfalse . If the value of the -messageType has a dash, Incorta will remove the dash and everything after the dash. For example, if the key name is entity as in this example, {"entity" : "Employee-USA"} , use -trimAfterDash true so that Incorta trims the value of Employee-USA to Employee and create an Employee table as the root table. Incorta does not allow a dash (- ) in a physical schema table name. |
Avro Annotations
The Avro Extractor Tool discerns the data types for key-value pairs within the JSON sample. In many cases, the Avro Extractor Tool must flatten nested JSON objects into tables.
In order to describe relationships between tables, you must edit the .avro
file and add Avro notations. You can use various annotations to Incorta to ignore certain columns, define columns as keys, and create both child and parent tables.
All annotations start with a path
attribute that describes a field or a record using dots to form a fully qualified name. Here is an example of defining an column as a key in the mySampleMessage.json.avro
file:
{ "avroSchema": [{ "type": "record", "name": "tblFruitSales", "fields": [{ "name": "id", "type": "long"},{ "name": "dt", "type": "string"},{ "name": "item", "type": "string"},{ "name": "quantity", "type": "long"},{ "name": "price", "type": "double"}]}], "annotations": [{ "path": "tblFruitSales.id", "isKey": "true"}]}
Here are the supported Avro Annotations:
Annotation | Name | Description | Example |
---|---|---|---|
isKey | Primary Key | Sets the given field or column to be part of the primary key of its parent table. This flag is set on the field level. | Example: { "path": "person.id", "isKey": "true" } |
isIgnored | Ignored | Any field marked as "isIgnored": "true" will not be shown in schema discovery. | Example:{ "path": "person.corrections", "isIgnored": "true" } |
isTable | Set as table | The default behavior of Incorta is to flatten the fields of nested child JSON objects inside its parent. If it is needed to have a nested child being a separate table, the isTable flag should be set to true. | Example:{"path": "person.qualifications", "isTable": "true" } |
isParent | Parent table | A nested child object in JSON is by default a child table in Incorta. If the table is a parent table, the isParent flag is set to true. Note that isParent must be used in conjunction with "isTable": "true" | Example:{ "path": "person.organization", "isTable": "true", "isParent": "true" } |
isPersistent | Persistent | If the data for a JSON nested object will actually be sent via another Kafka message, then the nested object does not need to be saved as part of the current message. In this case, the isPersistent should be set to false. Note that isPersistent must be used in conjunction with "isTable": "true" and "isParent": "true". | Example:{ "path": "person.organization", "isTable" : "true", "isParent": "true", "isPersistent": "false" } |
isMap | Map | If a nested JSON object is maps the keys in the JSON to a set of child records, use "isMap": "true" to define a child table. Note that isMap cannot be used in conjunction with isTable . | Example:{ "path": "person.addresses", "isMap": "true" } |
<KEY> | Map Key | In the "path" , when referencing fields or records that are children of a map, then the path needs to include the variable element, <KEY> . | Example:{ "path": "person.addresses.<KEY>.typeName", "isKey": "true" } |
tableName | Table Name | An alias table name can be added to any table, map, or array. | Example:{ "path": "country.lastModification", "isTable": "true", "tableName": "coun_lsat_mod"} |
tableName | Array Map | If a map is actually a set of fields inside a record, then the annotations need to specify which field names and the corresponding table name. The list of names of the fields is specified inside the "path" must be comma-separated. Note that the tableName must be used in conjunction with "isMap": "true" . | Example:{ "path": "person.addresses.<KEY>.local1,local2", "isMap": "true", "tableName": "person.addresses.LocalAddresses" } |
isOneToOne | One To One | While annotating a table as "isTable": "true" , you can also annotate the table as one-to-one table. During schema discovery, Incorta will only mark Key columns that are defined in and inherited from the parent table. | Example:{ "path": "person.demographic", "isTable": "true", "isOneToOne": "true"} |
isEncrypted | Source encrypted | Annotating a field as "isEncrypted": "true" means that it is encrypted from the source and needs to be decrypted by Incorta using custom class | Example:{ "path": "person.basic.firstName", "isEncrypted": "true" } |
encryptionName | Encryption name | This annotation must follow "isEncrypted": "true" . You must specify the name of crypto class as defined in the kafka-crypto.properties file. | Example:{ "path": "person.basic.firstName", "isEncrypted" : "true", “encryptionName” : “person”} |