cassandra-data-migrator

Migrate and Validate Tables between Origin and Target Cassandra Clusters.

⚠️ Please note this job has been tested with spark version 2.4.8

Container Image

Get the latest image that includes all dependencies from DockerHub
- If you use this route, all migration tools (cassandra-data-migrator + dsbulk + cqlsh) would be available in the /assets/ folder of the container
OR follow the below build steps (and Prerequisite) to build the jar locally

Prerequisite

Install Java8 as spark binaries are compiled with it.
Install Maven 3.8.x
Install single instance of spark on a node where you want to run this job. Spark can be installed by running the following: -

wget https://downloads.apache.org/spark/spark-2.4.8/
tar -xvzf <spark downloaded file name>

Build

Clone this repo
Move to the repo folder cd cassandra-data-migrator
Run the build mvn clean package
The fat jar (cassandra-data-migrator-2.x.x.jar) file should now be present in the target folder

Steps for Data-Migration:

sparkConf.properties file needs to be configured as applicable for the environment

A sample Spark conf file configuration can be found here
Place the conf file where it can be accessed while running the job via spark-submit.
Run the below job using spark-submit command as shown below:

./spark-submit --properties-file sparkConf.properties /
--master "local[*]" /
--class datastax.astra.migrate.Migrate cassandra-data-migrator-2.x.x.jar &> logfile_name.txt

Note: Above command also generates a log file logfile_name.txt to avoid log output on the console.

Steps for Data-Validation:

To run the job in Data validation mode, use class option --class datastax.astra.migrate.DiffData as shown below

./spark-submit --properties-file sparkConf.properties /
--master "local[*]" /
--class datastax.astra.migrate.DiffData cassandra-data-migrator-2.x.x.jar &> logfile_name.txt

Validation job will report differences as “ERRORS” in the log file as shown below

22/10/27 23:25:29 ERROR DiffJobSession: Missing target row found for key: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% Aliquam faucibus
22/10/27 23:25:29 ERROR DiffJobSession: Inserted missing row in target: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% Aliquam faucibus
22/10/27 23:25:30 ERROR DiffJobSession: Mismatch row found for key: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% augue odio at quam Data:  (Index: 8 Origin: Hello 3 Target: Hello 2 )
22/10/27 23:25:30 ERROR DiffJobSession: Updated mismatch row in target: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% augue odio at quam

Please grep for all ERROR from the output log files to get the list of missing and mismatched records.
- Note that it lists differences by partition key values.
The Validation job can also be run in an AutoCorrect mode. This mode can
- Add any missing records from origin to target
- Fix any inconsistencies between origin and target (makes target same as origin).
Enable/disable this feature using one or both of the below setting in the config file

spark.target.autocorrect.missing                    true|false
spark.target.autocorrect.mismatch                   true|false

Migrating specific partition ranges

You can also use the tool to migrate specific partition ranges, use class option --class datastax.astra.migrate.MigratePartitionsFromFile as shown below

./spark-submit --properties-file sparkConf.properties /
--master "local[*]" /
--class datastax.astra.migrate.MigratePartitionsFromFile cassandra-data-migrator-2.x.x.jar &> logfile_name.txt

When running in above mode the tool assumes a partitions.csv file to be present in the current folder in the below format, where each line (min,max) represents a partition-range

-507900353496146534,-107285462027022883
-506781526266485690,1506166634797362039
2637884402540451982,4638499294009575633
798869613692279889,8699484505161403540

This mode is specifically useful to processes a subset of partition-ranges that may have generated errors as a result of a previous long-running job to migrate a large table.

Additional features

Counter tables
Preserve writetimes and TTL
Advanced DataTypes (Sets, Lists, Maps, UDTs)
Filter records from origin using writetimes, CQL conditions, token-ranges
Fully containerized (Docker and K8s friendly)
SSL Support (including custom cipher algorithms)
Migrate from any Cassandra origin (Apache Cassandra / DataStax Enterprise / DataStax Astra DB) to any Cassandra target (Apache Cassandra / DataStax Enterprise / DataStax Astra DB)
Validate migration accuracy and performance using a smaller randomized data-set
Custom writetime

Name		Name	Last commit message	Last commit date
Latest commit History 266 Commits
.github/workflows		.github/workflows
.idea		.idea
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
cassandra-data-migrator.iml		cassandra-data-migrator.iml
log4j.properties		log4j.properties
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cassandra-data-migrator

Container Image

Prerequisite

Build

Steps for Data-Migration:

Steps for Data-Validation:

Migrating specific partition ranges

Additional features

About

Releases 1

Packages

Languages

License

mayurchoubey/cassandra-data-migrator

Folders and files

Latest commit

History

Repository files navigation

cassandra-data-migrator

Container Image

Prerequisite

Build

Steps for Data-Migration:

Steps for Data-Validation:

Migrating specific partition ranges

Additional features

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages