Data Migration Tool Introduction

For Installation user guide, please refer to /docs/installation/README.md
For Teradata data migration deployment user guide, please refer to /docs/installation/README_teradata_datamigration.md
For Hive data migration deployment user guide, please refer to /docs/installation/README_hive_datamigration.md
For Redshift data migration deployment user guide, please refer to /docs/installation/README_redshift_datamigration.md

Why you want to use Data Migration Tool

As customers migrate from on-prem enterprise data warehouses to the Cloud and specifically BigQuery, there is no single end-to-end migration path and/or solution, to make it easy and scalable. There are several tools available for SQL translation, data transfer and data validation, but these need to be configured, run and managed separately. All of this adds additional time, complexity and risk to the data warehouse migration effort.

What Data Migration Tool can provide you

The goal of this tool is to provide an end to end automated solution for customers to perform data transfer, translate and validate the DDL,DML and SQL queries to support data warehouse migrations to BigQuery. This solution is built using Google Cloud tools and services, such as the Data Validation Tool, BQ Translation Service, BigQuery Data Transfer Service, Cloud Composer Orchestrated DAGs etc. This tool should help accelerate migrations, reduce risk and improve time to value.

Supported Sources

Sources	Translation	Schema Migration	Data Transfer(Bulk Load)	Incremental Load	Validation
Teradata	Yes	Yes	Yes	No	Yes
Hive	Yes	Yes	Yes	Yes	Yes
Redshift	Yes	Yes	Yes	No	Yes
Oracle	Yes	Yes	No	No	Yes

User Story

Data Migration Tool(DMT) is an end-to-end orchestrated solution for migrating a data warehouse to BigQuery. This tool will help users conduct several operations to successfully translate, create and validate tables and transfer data from on-premise data warehouses such as Teradata/Hive to native GCP BQ tables. These operations include:

Automated extraction of DDLs
DDL translation, schema migration and schema validation.
Data migration and validation
DML translation and dry run validation
SQL translation and validation (#5 can be triggered after #3 only.)
Looker Studio dashboard templates for monitoring above listed migration steps.

Extract and translate DDL from on-prem data warehouses into BigQuery DDL syntax.

Users copy their on-premises DDL file by manual or automated method, and a config file into a GCS bucket.
The config file will trigger Pub/Sub > CloudRun > Cloud Composer > BQ translation API for DDL translation.
The translated DDL will land into the GCS bucket.

Create tables in BigQuery.

Cloud composer will create BQ tables using translated DDLs

Validate Schema in BigQuery

A Composer DAG will validate schema in BQ.
Data Validation results will be written into BQ audit logging tables.

Transfer data from on-premise to BigQuery.

Choose one way to upload your data to BQ. Data Migration Tool provide a service to upload data via DTS for Teradata. For Hive, the prerequisite right now is to have the data files present in GCS buckets and load data using Composer DAGs to BQ.

Translate SQL SELECT statements into BigQuery syntax.

Users copy their on-premises SQL file, and a config file into a GCS bucket.
The config file will trigger Pub/Sub > CloudRun > Cloud Composer > BQ translation API for SQL translation automatically.
The translated SQL and/or failed translation files will land into the GCS bucket.

Validate SQL SELECT statements in BigQuery.

Data should be in BQ before this step.
A Composer DAG will validate the SELECT statement in BQ.
Validation results will be written into BQ.

Translate DML statements into BigQuery syntax.

There is no dependency on the step: “Translate SQL SELECT statements into BQ syntax”.
This is a dry run feature which translated provided DML queries from Teradata/Oracle/Redshift dialect to BigQuery dialect similar to DDL and SQL translations

Validate DML statements in BigQuery.

Post the translation of DML statements into BQ dialect, we leverage the BigQuery Dry Run API feature to test the execution of DML files (dry run only) to validate if the translated DML query would work well without any errors at the BQ side.

Design

The following design proposes how to orchestrate an end-to-end pipeline to run Translation, Migration and Validation for enterprise data warehouse migration to BigQuery.

Overview

Controller

DAG Overview

Infrastructure/Services

The following products/tools will be used to build an end-to-end orchestrated pipeline which both migrates a data warehouse and then validates that migration was successful:

Activity	Component
Orchestration	Cloud Composer
Event-based Triggers	Cloud Run Cloud PubSub
SQL Translation	Cloud Storage BigQuery Batch SQL Translator
SQL Validation	Data Validation Tool (query validation) BigQuery API (jobs.insert with dryRun=True)
Data Transfer	Data Transfer Service (data migration) Cloud Storage Pub/Sub BigQuery API
Infrastructure	Terraform/Terragrunt
Assets Storage	BigQuery Cloud Storage
Secret	Secret Manager
Results/Summary	Looker Studio

Cloud Composer

Cloud Composer / Airflow will connect the different components together in a DAG, the file sensor will be able to trigger a new file arrival event from GCS.

The Composer instance should be created with the following characteristics:

Version 2
Private endpoint
Custom service account (not default compute service account)
Workload Identity binding GCP service account to ‘default’ GKE namespace

Refer to this example of using the Google-provided Composer V2 Terraform module as a starting point.

Cloud Run

Cloud Run will intercept GCS OBJECT_FINALIZE events and then decide whether to trigger Composer DAGs. The Cloud Run endpoint should be created with the following characteristics:

Triggered via Cloud Storage Object Finalize events
Private network; Internal traffic only
Custom service account (not default compute service account and not the same as custom Composer service account)

Cloud Pub/Sub

Cloud Pub/Sub will serve as the message bus for the OBJECT_FINALIZE Cloud Storage events that are ultimately going to trigger Composer DAGs.

Pub/Sub invokes Cloud Run endpoint that ultimately triggers Composer DAGs.

Cloud Storage

Cloud Storage will hold the following assets:

Input DDL and SQL files (to be translated into BigQuery SQL)
Configuration trigger file (holds settings for end-to-end translation pipeline)
Output BigQuery SQL files (Output from Batch SQL translator)

Batch SQL Translator

The Batch SQL Translator service will convert a customer’s source SQL assets (DDLs, Queries, UDFs, Stored Procs) into BigQuery SQL assets.

The Batch SQL Translator tool supports translation of the following SQL dialects:

Amazon Redshift SQL
Apache HiveQL
Apache Spark SQL
Azure Synapse T-SQL
IBM Netezza SQL/NZPLSQL
Oracle SQL, PL/SQL, Exadata
Snowflake SQL
SQL Server T-SQL
Teradata SQL
Vertica SQL
Presto SQL

BigQuery

The Data Validation Tool and our Composer DAGs will need the following BigQuery datasets and tables to run their tests:

DVT BigQuery terraform dependencies are specified here
Composer DAG BigQuery terraform dependencies are specified here

Data Validation Tool

The Data Validation Tool (DVT) is an open sourced Python CLI tool that will compare source tables with the equivalent migrated tables in BigQuery.

DVT (as it is used for this SQL migration service) has the following infrastructure dependencies:

Connector for source database
Connector for BigQuery as a target database
BigQuery dataset/tables for writing validation results

Looker Studio

Results of operations and statistics are curated in Looker studio.

This is a template Looker Studio Template - The customer can use this and switch data source and point it to their project->dataset->logging tables

Operations include:

Summary result
DDL Translation results
Schema creation & validation results
Data migration & validation results
SQL Translation & validations results
DML Translation & validations results

Statistics include:

Total count - Success, Failures
Filters on error details, tables, sql, ddl and dml files

Appendix

DAG Parallelization & Composer Sizing

Airflow config overrides can help in optimizing parallelization for Schema and Validation DAGs.PLease refer to the below information to optimally size your Composer environments depending on the scaling required and the number of files to be translated and validated.

Composer Optimization

Airflow Documentation

Reference URL

Composer 1.0 Sizing

Recommend setting up a Cloud Composer cluster with a relatively small number of powerful machines, keeping in mind that if the number of machines is too small, a failure of one machine will impact the cluster severely.

Airflow Configuration Override Parameter	Default Value	Recommended Value
[scheduler] Parsing_processes Link	2	>=number_of_cores_per_node e.g.) 2 x number_of_cores_per_node
[celery] Worker_concurrency Link	12 * number of workers' CPUs	6-8 * cores_per_node or per_3.75GB_ram Ignored if worker_autoscale is set
[core] parallelism	32	[Worker_concurrency] * Number of Workers
[core] max_active_tasks_per_dag	100	Equal to [parallelism]
[core] default_pool_task_slot_count	128	Equal to [parallelism]

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
ci		ci
docs/installation		docs/installation
images		images
samples		samples
src		src
terraform		terraform
.coveragerc		.coveragerc
.flake8		.flake8
.gcloudignore		.gcloudignore
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
cloudbuild-sa-iam-setup.sh		cloudbuild-sa-iam-setup.sh
cloudbuild.yaml		cloudbuild.yaml
cloudbuild_deploy.yaml		cloudbuild_deploy.yaml
cloudbuild_destroy.yaml		cloudbuild_destroy.yaml
dmt-admin-roles-setup.sh		dmt-admin-roles-setup.sh
dmt-admin-user-setup.sh		dmt-admin-user-setup.sh
dmt-user-roles-setup.sh		dmt-user-roles-setup.sh
dmt-user-setup.sh		dmt-user-setup.sh
noxfile.py		noxfile.py
terragrunt.hcl		terragrunt.hcl

License

GoogleCloudPlatform/data-migration-tool

Folders and files

Latest commit

History

Repository files navigation