Skip to content

Bireme is an incremental synchronization tool for the Greenplum / HashData data warehouse

License

Notifications You must be signed in to change notification settings

HashDataInc/bireme

Repository files navigation

bireme

Build Status

中文文档

Getting Started Guide

Bireme is an incremental synchronization tool for the Greenplum / HashData data warehouse. It currently supports MySQL, PostgreSQL and MongoDB data sources.

Greenplum is an advanced, fully functional open source data warehouse that provides powerful and fast analysis of the amount of petabyte data. It is uniquely oriented for large data analysis and is supported by the world's most advanced cost-based query optimizer. It can provide high query performance over large amounts of data.

HashData is a flexible cloud data warehouses built based on Greenplum.

Bireme uses DELETE + COPY to synchronize the modification records of the data source to Greenplum / HashData. This mode is faster and better than INSERT + UPDATE + DELETE.

Features and Constraints:

  • Using small batch loading to enhance the performance of data synchronization. The default load delay time is 10 seconds.
  • All tables must have primary keys in the target database.

1.1 Data Flow

data_flow

Bireme supports synchronization work of multiple data sources. It can simultaneously read records from multiple data sources in parallel, and load records to the target database.

1.2 Data Source

1.2.1 Maxwell + Kafka

Maxwell + Kafka is a data source type that bireme currently supports. The structure is as follows:

maxwell

  • Maxwell is an application that reads MySQL binlogs and writes row updates to Kafka as JSON.

1.2.2 Debezium + Kafka

Debezium + Kafka is another data source type that bireme currently supports. The structure is as follows:

debezium

  • Debezium is a distributed platform that turns your existing databases into event streams, so that applications can see and respond immediately to each row-level change in the databases.

1.3 How does bireme work

Bireme reads records from the data source, delivers them into separate pipelines. In each pipeline, bireme converts them into internal format and caches them. When the cached records reaches a certain amount, they are merged into a task. Each task contains two collections, delete collection and insert collection. It finally updates the records to the target database.

Each data source may have several pipelines. For maxwell, each Kafka partition corresponds to a pipeline and for debezium, each Kafka topic corresponds to a pipeline.

bireme

The following picture depicts how change data is processed in a pipeline.

pipeline

1.4 Introduction to configuration files

The configuration files consist of two parts:

  • Basic configuration file: The default is config.properties, which contains the basic configuration of bireme.
  • Table mapping file: <source_name>.properties. Each data source corresponds to a file, which specifies the table to be synchronized and the corresponding table in the target database. <Source_name> is specified in the config.properties file.

1.4.1 config.properties

Required parameters

Parameters Description
target.url Address of the target database. Format:
jdbc:postgresql://<ip>:<port>/<database>
target.user The user name used to connect to the target database
target.passwd The password used to connect to the target database
data.source Specify the data source, which is <source_name>, with multiple data sources separated by commas, ignoring whitespace
<source_name>.type Specify the type of data source, for example maxwell

Note: The data source name is just a symbol for convinence. It can be modified as needed.

Parameters for Maxwell data source

Parameters Description
<source_name>.kafka.server Kafka address. Format:
<ip>:<port>
<source_name>.kafka.topic Corresponding topic of data source
<source_name>.kafka.groupid Kafka consumer group id. Default value is bireme

Parameters for Debezium data source

Parameters Description
<source_name>.kafka.server Kafka address. Format:
<ip>:<port>
<source_name>.kafka.groupid Kafka consumer group id. Default value is bireme
<source_name>.kafka.namespace Debezium's name.

Other parameters

Parameters Description Default
pipeline.thread_pool.size Thread pool size for Pipeline 5
transform.thread_pool.size Thread pool size for Transform 10
merge.thread_pool.size Thread pool size for Merge 10
merge.interval Maxmium interval between Merge in milliseconds 10000
merge.batch.size Maxmium number of Row in one Merge 50000
loader.conn_pool.size Number of connections to target database, which is less or equal to the number of Change Loaders 10
loader.task_queue.size The length of task queue in each Change Loader 2
metrics.reporter Bireme specifies two monitoring modes, consolo or jmx. If you do not need to monitor, you can specify this as none jmx
metrics.reporter.console.interval Time interval between metrics output in seconds. It is valid as long as metrics.reporter is console 10
state.server.port Port for state server 8080
state.server.addr IP address for state server 0.0.0.0

1.4.2 <source_name>.properties

In the configuration file for each data source, specify the table which the data source includes, and the corresponding table in the target database.

<OriginTable_1> = <MappedTable_1>
<OriginTable_2> = <MappedTable_2>
...

1.5 Monitoring

HTTP Server

Bireme starts a light HTTP server for acquiring current Load State.

When the HTTP server is started the following endpoints are exposed:

Endpoint Description
/ Get the load state for all data source.
/<data source> Get the load state for the given data source.

The result is organized in JSON format. Using parameter pretty will print the user-friendly result.

Example

The following is an example of Load State:

{
  "source_name": "XXX",
  "type": "XXX"
  "pipelines": [
    {
      "name": "XXXXXX",
      "latest": "yyyy-MM-ddTHH:mm:ss.SSSZ",
      "delay": XX.XXX,
      "state": "XXXXX"
    },
    {
      "name": "XXXXXX",
      "latest": "yyyy-MM-ddTHH:mm:ss.SSSZ",
      "delay": XX.XXX,
      "state": "XXXXX"
    },
  ]
}
  • source_name is the name of queried data source, as designated in the configuration file.
  • type is the type of data source.
  • pipelines is an array, every element in which corresponds to a pipeline. (Every data source may have several separate pipelines.)
  • name is the pipeline's name.
  • latest is produce time of latest change data that have been successfully loaded to hashdata.
  • delay is the time period for change data from entering bireme to being committed to data source.
  • state is the pipeline's state.

1.6 Reference

Maxwell Reference
Debezium Reference
Kafka Reference