eranelbaz

Posted on Nov 22, 2019 • Originally published at Medium on Mar 31, 2018

Migrating our Hadoop Cluster with Apache DistCp

#hadoop #bigdata #data #python

From an unsecure cluster to a remote and secure cluster

Whats Our Mission?

We need to migrate our Entire Hadoop Cluster from an unsecure CDH 5.7.1 cluster to a secure CDH 5.13.1 cluster in a remote Data Center.

And we wanted to find a complete product for our use case.

The Product Demands

Migrate our data hermetically
Migrate our Apahce Hive Metastore.
Need to show BI on the transfer process

So Whats The Big Deal?

There are couple of frameworks for handling this type of problem:

Cloudera BDR
Apache Falcon
Oracle Golden Gate with HDFS adapterand Hive adapter
Oracle Data Integrator
And there are more that we didn’t cover

However with each frameworks we encounters another fallback..

Cloudera BDR does not support unsecure cluster to secure cluster
Apache Falcon does not support the current Hadoop version Cloudera uses
Oracle told us that it is not the use case for Oracle Golden Gate and recommended us to use Oracle Data Integrator
Our experience with Oracle Data Integrator wasn’t that great….

So we found ourselves in front this mission without any framework that will work for us and it was obvious we had to go back to the drawing board.

What About Apache DistCp With some Python?

We always knew that we can use Apache DistCp from unsecure to a secure cluster but we wanted a complete product tested and reviewed that works, we didn’t want to develop our own “BDR”, although it came clear that we have to develop our own product.

Our Flow

Data Copy

Scan the Unsecure HDFS — recursively, we need to know also the sub paths to synchronize. how many files we have? path size?
Save the HDFS scan to Database Table.
Running The Hadoop DistCp command from the Secure cluster.
We want the entire process to run with our team user, so we need to change the directory permissions.
Validating that each and every byte have been synchronized across the clusters

We decided to run the DistCp command in batches of 10 deepest sub directories each time, this decision was made due to resource limitation…

Apahce Hive Metastore Syncronization

List all the Hive databases in our Unsecure cluster.
List all the tables we have
Query “show create table” for each table.
Query the Secure cluster with the same query in order to create the tables.
Run Invalidate Metadata in order to “load” the data to impala, and msck repair for Hive

Summary

You don’t need a fancy product to migrate or copy your data, you can always use the same old Hadoop distcp command with some hive querying.

Enjoy!

Like this post? You want to access my private github projects?
Support me via Patreon
Subscribe to my YouTube Channel

DEV Community

Migrating our Hadoop Cluster with Apache DistCp

From an unsecure cluster to a remote and secure cluster

Whats Our Mission?

The Product Demands

So Whats The Big Deal?

What About Apache DistCp With some Python?

Our Flow

Data Copy

Apahce Hive Metastore Syncronization

Summary

Top comments (0)

Read next

Advent of Code 2024 - Day7: Bridge Repair

Understanding Star Schema vs. Snowflake Schema

Amazon S3 Tables: A Game Changer in Analytics and Data Lake Space

This Week In Python