User Guide for GridDB Embulk Output Plugin
Revision: 1.5.0-9417-7fa03c1d
1 GridDB output plugin for Embulk
The GridDB output plugin for Embulk loads data records into a GridDB database.
1.1 Overview
The GridDB output plugin for Embulk transfers data from other databases (or data files) to a GridDB database.
- Plugin type: output
- Resume supported: no (If a transaction fails, clean all data and run it again.)
Note: To learn more about the usage of Embulk, see https://github.com/embulk/embulk
1.2 Configuration
1.2.1 Configuration File
The GridDB output plugin for Embulk can be configured by a configuration file (config.yaml). This file is written in YAML format. A sample of this configuration file is given below. Here the GridDB output plugin corresponds to the "out:" section:
in:
type: file
path_prefix: sample_file.csv
parser:
type: csv
columns:
- {name: column1, type: long }
- {name: column2, type: string }
- {name: column3, type: double }
out:
type: griddb
mode_cluster: PROVIDER
provider_url: http://example.com/my_cluster.json
cluster: myCluster
database: myDB
container: myContainer
user: admin
password: admin
mode_insert: replace
column_options:
column1: { type: long }
column2: { type: string }
column3: { type: double }Note: The "in:" section in the sample above corresponds to the Embulk input plugin. For more information about the Embulk configuration, refer to https://www.embulk.org/docs/built-in.html
The following table describes all the options of the GridDB output plugin (options used in the "out:" section) where the value N/A means not applicable.
| Key | Data Type | Accepted Value | Default Value | Description |
|---|---|---|---|---|
| type | String | griddb | (required) | The name of the Embulk output plugin. Specify the plugin name "griddb" which denotes the GridDB output plugin for Embulk. |
| mode_cluster | String | PROVIDER | PROVIDER | Mode of the cluster. Currently only supports "PROVIDER". |
| provider_url | String | N/A | (required) | URL of an address provider used in the PROVIDER method |
| cluster | String | N/A | (required) | The GridDB cluster name |
| database | String | public | (required) | The name of a GridDB database that stores data |
| container | String | N/A | (required) | The target container name. Also called table name. |
| mode_insert | Enumeration | append, replace | append | - append: inserts data into an existing container- replace: deletes all existing data (but not drop container) and inserts new records. |
| user | String | N/A | (required) | The GridDB administrator username |
| password | String | N/A | (required) | The GridDB administrator password |
| batch_size | Integer | N/A | 1000000 | Size of a single batch for insertion. If the size of data records is greater than batch_size, those records are separated into smaller batches. Instead of inserting all data records at once, the GridDB output plugin splits records into N_records/batch_size parts and inserts them on a per batch basis . |
| default_timezone | String | N/A | UTC | If the type of a column is TIMESTAMP and the embulk type is string, column values are formatted as specified in default_timezone. The timezone for each column can be overwritten by using the column_options option. |
| column_options | Object | N/A | N/A | A map of source column types and target column types. |
| column_options/column_name | String | N/A | N/A | Column name. |
| column_options/column_name/ type |
Enumeration | string, long, double, float, boolean, timestamp | (same as the input type) | Column data type |
| column_options/column_name/ timestamp_format |
String | N/A | N/A | If the input type is timestamp, and the output type is string, this timestamp_format is used to format the string value. For further information about timestamp format, see https://docs.oracle.com/javase/8/docs/api/index.html?java/text/SimpleDateFormat.html |
| column_options/column_name/ timezone |
String | N/A | (same as default_timezone) | If the input type is timestamp and the output type is string, the timezone is appended after the timestamp value. |
1.2.2 Environment Preparation
Download Embulk from https://github.com/embulk/embulk/releases/
Grant executable permission to
embulk.jarfile:$ chmod +x ./embulk.jarInstall the JDK 1.8.0 (Development Version).
RHEL-like (RedHat, Centos, Fedora, etc.):
# yum install java-1.8.0-openjdk-develDebian-like (Debian, Ubuntu, Linux Mint, etc.):
# apt-get install openjdk-8-jdk
Change the working directory to the
embulk-output-griddbdirectory. Run the command:$ ./gradlew package
Note: If the ./gradlew does not have the executable permission, run the command chmod +x ./gradlew.
Now, the plugin is ready to use.
1.3 Example
Prepare the following files and directory:
test
├── product.csv
├── config.yaml
└── embulk-output-griddb
- product.csv
"id","name","price","deleted","join_date","end_date"
"1","Café","10000.000000000000001","false","2020-07-14 00:00:00.000 +07:00","2015-07-12 15:00:00"
"2","北京烤鸭","3.14159265358979","true","2020-07-14 00:00:00.000 +07:00","2015-07-12 15:00:00"
"3","サーモン" ,"0.66666666666666","false","2020-07-14 00:00:00.000 +07:00","2015-07-12 15:00:00"
- config.yaml
in:
type: file
path_prefix: product.csv
parser:
charset: UTF-8
newline: LF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: string }
- {name: name, type: string }
- {name: price, type: string }
- {name: deleted, type: string }
- {name: join_date, type: string }
- {name: end_date, type: timestamp, format: "%Y-%m-%d %H:%M:%S" }
out:
type: griddb
mode_cluster: PROVIDER
provider_url: http://example.com/api/griddb/mycluster.json
cluster: mycluster
database: public
container: product
user: admin
password: admin
mode_insert: replace
batch_size: 1
column_options:
id: { type: long }
name: { type: string }
price: { type: double }
deleted: { type: boolean }
join_date: { type: timestamp }
end_date: { type: string, timestamp_format: yyyyMMdd, timezone: UTC+06 }Note: Change the values of provider_url and cluster above to suitable values.
Run the plugin:
$ ./embulk.jar run -L ./embulk-output-griddb/ config.yaml
Observe the result in the public database. A container named product was created:
| id | name | price | deleted | join_date | end_date |
|---|---|---|---|---|---|
| 1 | Café | 10000.0 | false | 2020-07-14T02:00:00.000+09:00 | 20150712UTC+06 |
| 2 | 北京烤鸭 | 3.14159265358979 | true | 2020-07-14T02:00:00.000+09:00 | 20150712UTC+06 |
| 3 | サーモン | 0.66666666666666 | false | 2020-07-14T02:00:00.000+09:00 | 20150712UTC+06 |
In this example, the input and output types of each column are defined as follows:
| Column | Input Type | Output Type |
|---|---|---|
| id | string | long |
| name | string | string |
| price | string | double |
| deleted | string | boolean |
| join_date | string | timestamp |
| end_date | timestamp | string |
It is preferable to set the same input type and output type for the columns id, name, price, deleted and join_date. Even if they are different, the command embulk-output-gridd can convert the output type into an appropriate type based on the information about the output destination. The end_date column is for demonstrating a timestamp with custom formatting when the input type is timestamp and the output type is string. Because the column_options defines timestamp_format as yyyyMMdd and timezone as UTC+06, 20150712UTC+06 will result in the end_date column.
Now, change the batch_size to 1 in config.yaml; then, run the command again and observe the output on the terminal. Notice the data is split into three batches:
[INFO] (0015:task-0000): put rows:0-1
[INFO] (0015:task-0000): convert STRING:Café
[INFO] (0015:task-0000): convert TIMESTAMP:2020-07-14 00:00:00.000
[INFO] (0015:task-0000): convert STRING:2015-07-12 15:00:00
[INFO] (0015:task-0000): put rows success:1-1
[INFO] (0015:task-0000): put rows:1-2
[INFO] (0015:task-0000): convert STRING:北京烤鸭
[INFO] (0015:task-0000): convert TIMESTAMP:2020-07-14 00:00:00.000
[INFO] (0015:task-0000): convert STRING:2015-07-12 15:00:00
[INFO] (0015:task-0000): put rows success:2-2
[INFO] (0015:task-0000): put rows:2-3
[INFO] (0015:task-0000): convert STRING:サーモン
[INFO] (0015:task-0000): convert TIMESTAMP:2020-07-14 00:00:00.000
[INFO] (0015:task-0000): convert STRING:2015-07-12 15:00:00
[INFO] (0015:task-0000): put rows success:3-3
[INFO] (0015:task-0000): Finish page: 3
Next change the mode_insert to "append" and run the command again. Observe the result in the public database. Notice the product container has 6 records instead of 3 because the append mode inserts additional data into the existing container:
| id | name | price | deleted | join_date | end_date |
|---|---|---|---|---|---|
| 1 | Café | 10000.0 | false | 2020-07-14T02:00:00.000+09:00 | 20150712UTC+06 |
| 2 | 北京烤鸭 | 3.14159265358979 | true | 2020-07-14T02:00:00.000+09:00 | 20150712UTC+06 |
| 3 | サーモン | 0.66666666666666 | false | 2020-07-14T02:00:00.000+09:00 | 20150712UTC+06 |
| 1 | Café | 10000.0 | false | 2020-07-14T02:00:00.000+09:00 | 20150712UTC+06 |
| 2 | 北京烤鸭 | 3.14159265358979 | true | 2020-07-14T02:00:00.000+09:00 | 20150712UTC+06 |
| 3 | サーモン | 0.66666666666666 | false | 2020-07-14T02:00:00.000+09:00 | 20150712UTC+06 |
2 References
- Embulk on GitHub - https://github.com/embulk/embulk
- Embulk Homepage - https://www.embulk.org/
- Embulk Documentation - https://www.embulk.org/docs/built-in.html#csv-parser-plugin



