User Guide for GridDB Embulk Output Plugin
Revision: 2.1.0-12973-a96d5d04
1 GridDB output plugin for Embulk
The GridDB output plugin for Embulk loads data records into a GridDB database.
1.1 Overview
The GridDB output plugin for Embulk transfers data from other databases (or data files) to a GridDB database.
The GridDB output plugin for Embulk can read and parse those data formats that are supported by the input plugin.
- Plugin type: output
- Resume supported: no (If a transaction fails, clean all data and run it again.)
Note: To learn more about the usage of Embulk, see https://github.com/embulk/embulk
1.2 Development Environment
Name | Version |
---|---|
Embulk | 0.9.x |
Java | Oracle JDK 8 |
Gradle | Gradle 6.3 |
OS | CentOS 7 |
1.3 Configuration
1.3.1 Configuration File
The GridDB output plugin for Embulk can be configured by a configuration file (config.yaml
). This file is written in YAML format. A sample of this configuration file is given below. Here the GridDB output plugin corresponds to the "out:
" section:
in:
type: file
path_prefix: sample_file.csv
parser:
type: csv
columns:
- {name: column1, type: long }
- {name: column2, type: string }
- {name: column3, type: double }
out:
type: griddb
mode_cluster: PROVIDER
provider_url: http://example.com/my_cluster.json
cluster: myCluster
database: myDB
container: myContainer
user: admin
password: admin
mode_insert: replace
column_options:
column1: { type: long }
column2: { type: string }
column3: { type: double }
Note: The "in:
" section in the sample above corresponds to the Embulk input plugin. For more information about the Embulk configuration, refer to https://www.embulk.org/docs/built-in.html
The following table describes all the options of the GridDB output plugin (options used in the "out:
" section) where the value N/A
means not applicable.
Key | Data Type | Accepted Value | Default Value | Description |
---|---|---|---|---|
type | String | griddb | (required) | The name of the Embulk output plugin. Specify the plugin name "griddb " which denotes the GridDB output plugin for Embulk. |
mode_cluster | String | PROVIDER | PROVIDER | Mode of the cluster. Currently only supports "PROVIDER ". |
provider_url | String | N/A | (required) | URL of an address provider used in the PROVIDER method |
cluster | String | N/A | (required) | The GridDB cluster name |
database | String | public | (required) | The name of a GridDB database that stores data |
container | String | N/A | (required) | The target container name. Also called table name. |
mode_insert | Enumeration | append, replace | append | - append : inserts data into an existing container- replace : deletes all existing data (but not drop container) and inserts new records. |
user | String | N/A | (required) | The GridDB administrator username |
password | String | N/A | (required) | The GridDB administrator password |
batch_size | Integer | N/A | 1000000 | Size of a single batch for insertion. If the size of data records is greater than batch_size , those records are separated into smaller batches. Instead of inserting all data records at once, the GridDB output plugin splits records into N_records /batch_size parts and inserts them on a per batch basis. |
default_timezone | String | N/A | UTC | If the type of a column is TIMESTAMP and the embulk type is string , column values are formatted as specified in default_timezone . The timezone for each column can be overwritten by using the column_options option. |
column_options | Object | N/A | N/A | A map of source column types and target column types. |
column_options/column_name | String | N/A | N/A | Column name. |
column_options/column_name/ type |
Enumeration | string, long, double, float, boolean, timestamp | (same as the input type) | Column data type |
column_options/column_name/ timestamp_format |
String | N/A | N/A | If the input type is timestamp , and the output type is string , this timestamp_format is used to format the string value. For further information about timestamp format, see https://docs.oracle.com/javase/8/docs/api/index.html?java/text/SimpleDateFormat.html |
column_options/column_name/ timezone |
String | N/A | (same as default_timezone) | If the input type is timestamp and the output type is string , the timezone is appended after the timestamp value. |
column_options/column_name/ time_precision |
String | MILLISECOND, MICROSECOND, NANOSECOND | MILLISECOND | The 3 values MILLISECOND, MICROSECOND, NANOSECOND are converted into GridDB column types timestamp/timestamp(3), timestamp(6), and timestamp(9) respectively. |
1.3.2 Environment Preparation
Download Embulk from https://github.com/embulk/embulk/releases/
Grant executable permission to
embulk.jar
file:$ chmod +x ./embulk.jar
Install the JDK 1.8.0 (Development Version).
RHEL-like (RedHat, Centos, Fedora, etc.):
# yum install java-1.8.0-openjdk-devel
Debian-like (Debian, Ubuntu, Linux Mint, etc.):
# apt-get install openjdk-8-jdk
Change the working directory to the
embulk-output-griddb
directory. Run the command:$ gradle package
Note :
- Configure your proxy settings in gradle.properties if needed before performing the step 3
- If the
./gradlew
does not have the executable permission, run the commandchmod +x ./gradlew
.
Now, the plugin is ready to use.
1.4 Example
Prepare the following files and the directory:
test
├── product.csv
├── config.yaml
└── embulk-output-griddb
- product.csv
"id","name","price","deleted","join_date","end_date"
"1","Café","10000.000000000000001","false","2020-07-14 00:00:00.000 +07:00","2015-07-12 15:00:00"
"2","北京烤鸭","3.14159265358979","true","2020-07-14 00:00:00.000 +07:00","2015-07-12 15:00:00"
"3","サーモン" ,"0.66666666666666","false","2020-07-14 00:00:00.000 +07:00","2015-07-12 15:00:00"
- config.yaml
in:
type: file
path_prefix: product.csv
parser:
charset: UTF-8
newline: LF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: string }
- {name: name, type: string }
- {name: price, type: string }
- {name: deleted, type: string }
- {name: join_date, type: string }
- {name: end_date, type: timestamp, format: "%Y-%m-%d %H:%M:%S.%N" }
out:
type: griddb
mode_cluster: PROVIDER
provider_url: http://example.com/api/griddb/mycluster.json
cluster: mycluster
database: public
container: product
user: admin
password: admin
mode_insert: replace
batch_size: 1
column_options:
id: { type: long }
name: { type: string }
price: { type: double }
deleted: { type: boolean }
join_date: { type: timestamp(6) }
end_date: { type: string, timestamp_format: yyyy-MM-dd HH:mm:ss.SSSSSSXXX, timezone: UTC, time_precision: MICROSECOND }
Note: Change the values of provider_url
and cluster
above to suitable values.
Run the plugin:
$ ./embulk.jar run -L ./embulk-output-griddb/ config.yaml
Observe the result in the public
database. A container named product
was created:
id | name | price | deleted | join_date | end_date |
---|---|---|---|---|---|
1 | Café | 10000.0 | false | 2020-07-14T00:00:00.000000+07:00 | 2015-07-12T15:00:00.000000UTC |
2 | 北京烤鸭 | 3.14159265358979 | true | 2020-07-14T00:00:00.000000+07:00 | 2015-07-12T15:00:00.000000UTC |
3 | サーモン | 0.66666666666666 | false | 2020-07-14T00:00:00.000000+07:00 | 2015-07-12T15:00:00.000000UTC |
In this example, the input and output types of each column are defined as follows:
Column | Input Type | Output Type |
---|---|---|
id | string | long |
name | string | string |
price | string | double |
deleted | string | boolean |
join_date | string | timestamp(6) |
end_date | timestamp | string |
It is preferable to set the same input type and output type for the columns id
, name
, price
, deleted
and join_date
. Even if they are different, the command embulk-output-griddb can convert the output type into an appropriate type based on the information about the output destination. The end_date
column is for demonstrating a timestamp with custom formatting when the input type is timestamp
and the output type is string
. Because the column_options
defines timestamp_format
as yyyy-MM-dd HH:mm:ss.SSSSSSXXX
and timezone
as UTC
, 2015-07-12T15:00:00.000000UTC
will result in the end_date
column.
Note: The timestamp format needs to follow the RFC3339 specification. If it doesn't, the output will be converted based on the input timestamp format.
Now, change the batch_size
to 1
in config.yaml
; then, run the command again and observe the output on the terminal. Notice the data is split into three batches:
[INFO] (0015:task-0000): put rows:0-1
[INFO] (0015:task-0000): convert STRING:Café
[INFO] (0015:task-0000): convert TIMESTAMP:2020-07-14 00:00:00.000
[INFO] (0015:task-0000): convert STRING:2015-07-12 15:00:00
[INFO] (0015:task-0000): put rows success:1-1
[INFO] (0015:task-0000): put rows:1-2
[INFO] (0015:task-0000): convert STRING:北京烤鸭
[INFO] (0015:task-0000): convert TIMESTAMP:2020-07-14 00:00:00.000
[INFO] (0015:task-0000): convert STRING:2015-07-12 15:00:00
[INFO] (0015:task-0000): put rows success:2-2
[INFO] (0015:task-0000): put rows:2-3
[INFO] (0015:task-0000): convert STRING:サーモン
[INFO] (0015:task-0000): convert TIMESTAMP:2020-07-14 00:00:00.000
[INFO] (0015:task-0000): convert STRING:2015-07-12 15:00:00
[INFO] (0015:task-0000): put rows success:3-3
[INFO] (0015:task-0000): Finish page: 3
Next change the mode_insert
to "append
" and run the command again. Observe the result in the public database. Notice the product
container has 6 records instead of 3 because the append
mode inserts additional data into the existing container:
id | name | price | deleted | join_date | end_date |
---|---|---|---|---|---|
1 | Café | 10000.0 | false | 2020-07-14T02:00:00.000+09:00 | 2015-07-12T15:00:00.000000UTC |
2 | 北京烤鸭 | 3.14159265358979 | true | 2020-07-14T02:00:00.000+09:00 | 2015-07-12T15:00:00.000000UTC |
3 | サーモン | 0.66666666666666 | false | 2020-07-14T02:00:00.000+09:00 | 2015-07-12T15:00:00.000000UTC |
1 | Café | 10000.0 | false | 2020-07-14T02:00:00.000+09:00 | 2015-07-12T15:00:00.000000UTC |
2 | 北京烤鸭 | 3.14159265358979 | true | 2020-07-14T02:00:00.000+09:00 | 2015-07-12T15:00:00.000000UTC |
3 | サーモン | 0.66666666666666 | false | 2020-07-14T02:00:00.000+09:00 | 2015-07-12T15:00:00.000000UTC |
2 References
- Embulk on GitHub - https://github.com/embulk/embulk
- Embulk Homepage - https://www.embulk.org/
- Embulk Documentation - https://www.embulk.org/docs/built-in.html#csv-parser-plugin