Jump to contents

User Guide for GridDB Embulk Output Plugin

Revision: 1.5.0-9417-7fa03c1d

1 GridDB output plugin for Embulk

The GridDB output plugin for Embulk loads data records into a GridDB database.

1.1 Overview

The GridDB output plugin for Embulk transfers data from other databases (or data files) to a GridDB database.

GridDB output plugin for Embulk
The GridDB output plugin for Embulk can read and parse those data formats that are supported by the input plugin.
  • Plugin type: output
  • Resume supported: no (If a transaction fails, clean all data and run it again.)

Note: To learn more about the usage of Embulk, see https://github.com/embulk/embulk

1.2 Configuration

1.2.1 Configuration File

The GridDB output plugin for Embulk can be configured by a configuration file (config.yaml). This file is written in YAML format. A sample of this configuration file is given below. Here the GridDB output plugin corresponds to the "out:" section:

in:
    type: file
    path_prefix: sample_file.csv
    parser:
        type: csv
        columns:
          - {name: column1, type: long }
          - {name: column2, type: string }
          - {name: column3, type: double }

out:
    type: griddb
    mode_cluster: PROVIDER
    provider_url: http://example.com/my_cluster.json
    cluster: myCluster
    database: myDB
    container: myContainer
    user: admin
    password: admin
    mode_insert: replace
    column_options:
        column1: { type: long    }
        column2: { type: string  }
        column3: { type: double  }

Note: The "in:" section in the sample above corresponds to the Embulk input plugin. For more information about the Embulk configuration, refer to https://www.embulk.org/docs/built-in.html

The following table describes all the options of the GridDB output plugin (options used in the "out:" section) where the value N/A means not applicable.

Key Data Type Accepted Value Default Value Description
type String griddb (required) The name of the Embulk output plugin. Specify the plugin name "griddb" which denotes the GridDB output plugin for Embulk.
mode_cluster String PROVIDER PROVIDER Mode of the cluster. Currently only supports "PROVIDER".
provider_url String N/A (required) URL of an address provider used in the PROVIDER method
cluster String N/A (required) The GridDB cluster name
database String public (required) The name of a GridDB database that stores data
container String N/A (required) The target container name. Also called table name.
mode_insert Enumeration append, replace append - append: inserts data into an existing container
- replace: deletes all existing data (but not drop container) and inserts new records.
user String N/A (required) The GridDB administrator username
password String N/A (required) The GridDB administrator password
batch_size Integer N/A 1000000 Size of a single batch for insertion. If the size of data records is greater than batch_size, those records are separated into smaller batches. Instead of inserting all data records at once, the GridDB output plugin splits records into N_records/batch_size parts and inserts them on a per batch basis .
default_timezone String N/A UTC If the type of a column is TIMESTAMP and the embulk type is string, column values are formatted as specified in default_timezone. The timezone for each column can be overwritten by using the column_options option.
column_options Object N/A N/A A map of source column types and target column types.
column_options/column_name String N/A N/A Column name.
column_options/column_name/
type
Enumeration string, long, double, float, boolean, timestamp (same as the input type) Column data type
column_options/column_name/
timestamp_format
String N/A N/A If the input type is timestamp, and the output type is string, this timestamp_format is used to format the string value.
For further information about timestamp format, see https://docs.oracle.com/javase/8/docs/api/index.html?java/text/SimpleDateFormat.html
column_options/column_name/
timezone
String N/A (same as default_timezone) If the input type is timestamp and the output type is string, the timezone is appended after the timestamp value.

1.2.2 Environment Preparation

  1. Download Embulk from https://github.com/embulk/embulk/releases/

    Grant executable permission to embulk.jar file:

    $ chmod +x ./embulk.jar
    
  2. Install the JDK 1.8.0 (Development Version).

  • RHEL-like (RedHat, Centos, Fedora, etc.):

    # yum install java-1.8.0-openjdk-devel
    
  • Debian-like (Debian, Ubuntu, Linux Mint, etc.):

    # apt-get install openjdk-8-jdk
    
  1. Change the working directory to the embulk-output-griddb directory. Run the command:

    $ ./gradlew package
    

Note: If the ./gradlew does not have the executable permission, run the command chmod +x ./gradlew.

Now, the plugin is ready to use.

1.3 Example

Prepare the following files and directory:

test
├── product.csv
├── config.yaml
└── embulk-output-griddb
  • product.csv
"id","name","price","deleted","join_date","end_date"
"1","Café","10000.000000000000001","false","2020-07-14 00:00:00.000 +07:00","2015-07-12 15:00:00"
"2","北京烤鸭","3.14159265358979","true","2020-07-14 00:00:00.000 +07:00","2015-07-12 15:00:00"
"3","サーモン" ,"0.66666666666666","false","2020-07-14 00:00:00.000 +07:00","2015-07-12 15:00:00"
  • config.yaml
in:
    type: file
    path_prefix: product.csv
    parser:
        charset: UTF-8
        newline: LF
        type: csv
        delimiter: ','
        quote: '"'
        trim_if_not_quoted: false
        skip_header_lines: 1
        allow_extra_columns: false
        allow_optional_columns: false
        columns:
          - {name: id,       type: string }
          - {name: name,     type: string }
          - {name: price,    type: string }
          - {name: deleted,  type: string }
          - {name: join_date, type: string }
          - {name: end_date,      type: timestamp, format: "%Y-%m-%d %H:%M:%S" }

out:
    type: griddb
    mode_cluster: PROVIDER
    provider_url: http://example.com/api/griddb/mycluster.json
    cluster: mycluster
    database: public
    container: product
    user: admin
    password: admin
    mode_insert: replace
    batch_size: 1
    column_options:
        id:         { type: long      }
        name:       { type: string    }
        price:      { type: double    }
        deleted:    { type: boolean   }
        join_date:   { type: timestamp }
        end_date:        { type: string, timestamp_format: yyyyMMdd, timezone: UTC+06 }

Note: Change the values of provider_url and cluster above to suitable values.

Run the plugin:

$ ./embulk.jar run -L ./embulk-output-griddb/ config.yaml

Observe the result in the public database. A container named product was created:

id name price deleted join_date end_date
1 Café 10000.0 false 2020-07-14T02:00:00.000+09:00 20150712UTC+06
2 北京烤鸭 3.14159265358979 true 2020-07-14T02:00:00.000+09:00 20150712UTC+06
3 サーモン 0.66666666666666 false 2020-07-14T02:00:00.000+09:00 20150712UTC+06

In this example, the input and output types of each column are defined as follows:

Column Input Type Output Type
id string long
name string string
price string double
deleted string boolean
join_date string timestamp
end_date timestamp string

It is preferable to set the same input type and output type for the columns id, name, price, deleted and join_date. Even if they are different, the command embulk-output-gridd can convert the output type into an appropriate type based on the information about the output destination. The end_date column is for demonstrating a timestamp with custom formatting when the input type is timestamp and the output type is string. Because the column_options defines timestamp_format as yyyyMMdd and timezone as UTC+06, 20150712UTC+06 will result in the end_date column.

Now, change the batch_size to 1 in config.yaml; then, run the command again and observe the output on the terminal. Notice the data is split into three batches:

[INFO] (0015:task-0000): put rows:0-1
[INFO] (0015:task-0000): convert STRING:Café
[INFO] (0015:task-0000): convert TIMESTAMP:2020-07-14 00:00:00.000
[INFO] (0015:task-0000): convert STRING:2015-07-12 15:00:00
[INFO] (0015:task-0000): put rows success:1-1
[INFO] (0015:task-0000): put rows:1-2
[INFO] (0015:task-0000): convert STRING:北京烤鸭
[INFO] (0015:task-0000): convert TIMESTAMP:2020-07-14 00:00:00.000
[INFO] (0015:task-0000): convert STRING:2015-07-12 15:00:00
[INFO] (0015:task-0000): put rows success:2-2
[INFO] (0015:task-0000): put rows:2-3
[INFO] (0015:task-0000): convert STRING:サーモン
[INFO] (0015:task-0000): convert TIMESTAMP:2020-07-14 00:00:00.000
[INFO] (0015:task-0000): convert STRING:2015-07-12 15:00:00
[INFO] (0015:task-0000): put rows success:3-3
[INFO] (0015:task-0000): Finish page: 3

Next change the mode_insert to "append" and run the command again. Observe the result in the public database. Notice the product container has 6 records instead of 3 because the append mode inserts additional data into the existing container:

id name price deleted join_date end_date
1 Café 10000.0 false 2020-07-14T02:00:00.000+09:00 20150712UTC+06
2 北京烤鸭 3.14159265358979 true 2020-07-14T02:00:00.000+09:00 20150712UTC+06
3 サーモン 0.66666666666666 false 2020-07-14T02:00:00.000+09:00 20150712UTC+06
1 Café 10000.0 false 2020-07-14T02:00:00.000+09:00 20150712UTC+06
2 北京烤鸭 3.14159265358979 true 2020-07-14T02:00:00.000+09:00 20150712UTC+06
3 サーモン 0.66666666666666 false 2020-07-14T02:00:00.000+09:00 20150712UTC+06

2 References

  1. Embulk on GitHub - https://github.com/embulk/embulk
  2. Embulk Homepage - https://www.embulk.org/
  3. Embulk Documentation - https://www.embulk.org/docs/built-in.html#csv-parser-plugin