loading...

Migrating our Hadoop Cluster with Apache DistCp

eranelbaz profile image eranelbaz Originally published at Medium on ・3 min read

From an unsecure cluster to a remote and secure cluster

Whats Our Mission?

We need to migrate our Entire Hadoop Cluster from an unsecure CDH 5.7.1 cluster to a secure CDH 5.13.1 cluster in a remote Data Center.

And we wanted to find a complete product for our use case.

The Product Demands

  • Migrate our data hermetically
  • Migrate our Apahce Hive Metastore.
  • Need to show BI on the transfer process

So Whats The Big Deal?

There are couple of frameworks for handling this type of problem:

However with each frameworks we encounters another fallback..

  • Cloudera BDR does not support unsecure cluster to secure cluster
  • Apache Falcon does not support the current Hadoop version Cloudera uses
  • Oracle told us that it is not the use case for Oracle Golden Gate and recommended us to use Oracle Data Integrator
  • Our experience with Oracle Data Integrator wasn’t that great….

So we found ourselves in front this mission without any framework that will work for us and it was obvious we had to go back to the drawing board.

What About Apache DistCp With some Python?

We always knew that we can use Apache DistCp from unsecure to a secure cluster but we wanted a complete product tested and reviewed that works, we didn’t want to develop our own “BDR”, although it came clear that we have to develop our own product.

Our Flow

Data Copy

  1. Scan the Unsecure HDFS — recursively, we need to know also the sub paths to synchronize. how many files we have? path size?
  2. Save the HDFS scan to Database Table.
  3. Running The Hadoop DistCp command from the Secure cluster.
  4. We want the entire process to run with our team user, so we need to change the directory permissions.
  5. Validating that each and every byte have been synchronized across the clusters

We decided to run the DistCp command in batches of 10 deepest sub directories each time, this decision was made due to resource limitation…

Apahce Hive Metastore Syncronization

  1. List all the Hive databases in our Unsecure cluster.
  2. List all the tables we have
  3. Query “show create table” for each table.
  4. Query the Secure cluster with the same query in order to create the tables.
  5. Run Invalidate Metadata in order to “load” the data to impala, and msck repair for Hive

Summary

You don’t need a fancy product to migrate or copy your data, you can always use the same old Hadoop distcp command with some hive querying.

Enjoy!

Like this post? You want to access my private github projects?
Support me via Patreon
Subscribe to my YouTube Channel

Posted on by:

eranelbaz profile

eranelbaz

@eranelbaz

Diver, Photographer, Indie Game Developer and Big Data Geo Analytics Team Leader

Discussion

pic
Editor guide