User Guide for GridDB Embulk Output Plugin

Revision: 1.5.0-9417-7fa03c1d

1 GridDB output plugin for Embulk

The GridDB output plugin for Embulk loads data records into a GridDB database.

1.1 Overview

The GridDB output plugin for Embulk transfers data from other databases (or data files) to a GridDB database.

The GridDB output plugin for Embulk can read and parse those data formats that are supported by the input plugin.

Plugin type: output
Resume supported: no (If a transaction fails, clean all data and run it again.)

Note: To learn more about the usage of Embulk, see https://github.com/embulk/embulk

1.2 Configuration

1.2.1 Configuration File

The GridDB output plugin for Embulk can be configured by a configuration file (config.yaml). This file is written in YAML format. A sample of this configuration file is given below. Here the GridDB output plugin corresponds to the "out:" section:

in:
    type: file
    path_prefix: sample_file.csv
    parser:
        type: csv
        columns:
          - {name: column1, type: long }
          - {name: column2, type: string }
          - {name: column3, type: double }

out:
    type: griddb
    mode_cluster: PROVIDER
    provider_url: http://example.com/my_cluster.json
    cluster: myCluster
    database: myDB
    container: myContainer
    user: admin
    password: admin
    mode_insert: replace
    column_options:
        column1: { type: long    }
        column2: { type: string  }
        column3: { type: double  }

Note: The "in:" section in the sample above corresponds to the Embulk input plugin. For more information about the Embulk configuration, refer to https://www.embulk.org/docs/built-in.html

The following table describes all the options of the GridDB output plugin (options used in the "out:" section) where the value N/A means not applicable.

Key	Data Type	Accepted Value	Default Value	Description
type	String	griddb	(required)	The name of the Embulk output plugin. Specify the plugin name "`griddb`" which denotes the GridDB output plugin for Embulk.
mode_cluster	String	PROVIDER	PROVIDER	Mode of the cluster. Currently only supports "`PROVIDER`".
provider_url	String	N/A	(required)	URL of an address provider used in the PROVIDER method
cluster	String	N/A	(required)	The GridDB cluster name
database	String	public	(required)	The name of a GridDB database that stores data
container	String	N/A	(required)	The target container name. Also called table name.
mode_insert	Enumeration	append, replace	append	- `append`: inserts data into an existing container - `replace`: deletes all existing data (but not drop container) and inserts new records.
user	String	N/A	(required)	The GridDB administrator username
password	String	N/A	(required)	The GridDB administrator password
batch_size	Integer	N/A	1000000	Size of a single batch for insertion. If the size of data records is greater than `batch_size`, those records are separated into smaller batches. Instead of inserting all data records at once, the GridDB output plugin splits records into `N_records`/`batch_size` parts and inserts them on a per batch basis .
default_timezone	String	N/A	UTC	If the type of a column is `TIMESTAMP` and the embulk type is `string`, column values are formatted as specified in `default_timezone`. The `timezone` for each column can be overwritten by using the `column_options` option.
column_options	Object	N/A	N/A	A map of source column types and target column types.
column_options/column_name	String	N/A	N/A	Column name.
column_options/column_name/ type	Enumeration	string, long, double, float, boolean, timestamp	(same as the input type)	Column data type
column_options/column_name/ timestamp_format	String	N/A	N/A	If the input type is `timestamp`, and the output type is `string`, this `timestamp_format` is used to format the string value. For further information about timestamp format, see https://docs.oracle.com/javase/8/docs/api/index.html?java/text/SimpleDateFormat.html
column_options/column_name/ timezone	String	N/A	(same as default_timezone)	If the input type is `timestamp` and the output type is `string`, the `timezone` is appended after the timestamp value.

1.2.2 Environment Preparation

Download Embulk from https://github.com/embulk/embulk/releases/

Grant executable permission to embulk.jar file:
```
$ chmod +x ./embulk.jar
```
Install the JDK 1.8.0 (Development Version).

RHEL-like (RedHat, Centos, Fedora, etc.):
```
# yum install java-1.8.0-openjdk-devel
```
Debian-like (Debian, Ubuntu, Linux Mint, etc.):
```
# apt-get install openjdk-8-jdk
```

Change the working directory to the embulk-output-griddb directory. Run the command:
```
$ ./gradlew package
```

Note: If the ./gradlew does not have the executable permission, run the command chmod +x ./gradlew.

Now, the plugin is ready to use.

1.3 Example

Prepare the following files and directory:

test
├── product.csv
├── config.yaml
└── embulk-output-griddb

product.csv

"id","name","price","deleted","join_date","end_date"
"1","Café","10000.000000000000001","false","2020-07-14 00:00:00.000 +07:00","2015-07-12 15:00:00"
"2","北京烤鸭","3.14159265358979","true","2020-07-14 00:00:00.000 +07:00","2015-07-12 15:00:00"
"3","サーモン" ,"0.66666666666666","false","2020-07-14 00:00:00.000 +07:00","2015-07-12 15:00:00"

config.yaml

in:
    type: file
    path_prefix: product.csv
    parser:
        charset: UTF-8
        newline: LF
        type: csv
        delimiter: ','
        quote: '"'
        trim_if_not_quoted: false
        skip_header_lines: 1
        allow_extra_columns: false
        allow_optional_columns: false
        columns:
          - {name: id,       type: string }
          - {name: name,     type: string }
          - {name: price,    type: string }
          - {name: deleted,  type: string }
          - {name: join_date, type: string }
          - {name: end_date,      type: timestamp, format: "%Y-%m-%d %H:%M:%S" }

out:
    type: griddb
    mode_cluster: PROVIDER
    provider_url: http://example.com/api/griddb/mycluster.json
    cluster: mycluster
    database: public
    container: product
    user: admin
    password: admin
    mode_insert: replace
    batch_size: 1
    column_options:
        id:         { type: long      }
        name:       { type: string    }
        price:      { type: double    }
        deleted:    { type: boolean   }
        join_date:   { type: timestamp }
        end_date:        { type: string, timestamp_format: yyyyMMdd, timezone: UTC+06 }

Note: Change the values of provider_url and cluster above to suitable values.

Run the plugin:

$ ./embulk.jar run -L ./embulk-output-griddb/ config.yaml

Observe the result in the public database. A container named product was created:

id	name	price	deleted	join_date	end_date
1	Café	10000.0	false	2020-07-14T02:00:00.000+09:00	20150712UTC+06
2	北京烤鸭	3.14159265358979	true	2020-07-14T02:00:00.000+09:00	20150712UTC+06
3	サーモン	0.66666666666666	false	2020-07-14T02:00:00.000+09:00	20150712UTC+06

In this example, the input and output types of each column are defined as follows:

Column	Input Type	Output Type
id	string	long
name	string	string
price	string	double
deleted	string	boolean
join_date	string	timestamp
end_date	timestamp	string

It is preferable to set the same input type and output type for the columns id, name, price, deleted and join_date. Even if they are different, the command embulk-output-gridd can convert the output type into an appropriate type based on the information about the output destination. The end_date column is for demonstrating a timestamp with custom formatting when the input type is timestamp and the output type is string. Because the column_options defines timestamp_format as yyyyMMdd and timezone as UTC+06, 20150712UTC+06 will result in the end_date column.

Now, change the batch_size to 1 in config.yaml; then, run the command again and observe the output on the terminal. Notice the data is split into three batches:

[INFO] (0015:task-0000): put rows:0-1
[INFO] (0015:task-0000): convert STRING:Café
[INFO] (0015:task-0000): convert TIMESTAMP:2020-07-14 00:00:00.000
[INFO] (0015:task-0000): convert STRING:2015-07-12 15:00:00
[INFO] (0015:task-0000): put rows success:1-1
[INFO] (0015:task-0000): put rows:1-2
[INFO] (0015:task-0000): convert STRING:北京烤鸭
[INFO] (0015:task-0000): convert TIMESTAMP:2020-07-14 00:00:00.000
[INFO] (0015:task-0000): convert STRING:2015-07-12 15:00:00
[INFO] (0015:task-0000): put rows success:2-2
[INFO] (0015:task-0000): put rows:2-3
[INFO] (0015:task-0000): convert STRING:サーモン
[INFO] (0015:task-0000): convert TIMESTAMP:2020-07-14 00:00:00.000
[INFO] (0015:task-0000): convert STRING:2015-07-12 15:00:00
[INFO] (0015:task-0000): put rows success:3-3
[INFO] (0015:task-0000): Finish page: 3

Next change the mode_insert to "append" and run the command again. Observe the result in the public database. Notice the product container has 6 records instead of 3 because the append mode inserts additional data into the existing container:

id	name	price	deleted	join_date	end_date
1	Café	10000.0	false	2020-07-14T02:00:00.000+09:00	20150712UTC+06
2	北京烤鸭	3.14159265358979	true	2020-07-14T02:00:00.000+09:00	20150712UTC+06
3	サーモン	0.66666666666666	false	2020-07-14T02:00:00.000+09:00	20150712UTC+06
1	Café	10000.0	false	2020-07-14T02:00:00.000+09:00	20150712UTC+06
2	北京烤鸭	3.14159265358979	true	2020-07-14T02:00:00.000+09:00	20150712UTC+06
3	サーモン	0.66666666666666	false	2020-07-14T02:00:00.000+09:00	20150712UTC+06

2 References

Embulk on GitHub - https://github.com/embulk/embulk
Embulk Homepage - https://www.embulk.org/
Embulk Documentation - https://www.embulk.org/docs/built-in.html#csv-parser-plugin