DEV Community: Premchand G

Migrating Hadoop Workloads to AWS: On-Premises HDFS, Spark, Kafka, Airflow to AWS S3, Iceberg, and EMR

Premchand G — Fri, 11 Apr 2025 11:05:59 +0000

Introduction
Why Migrate from On-Premises Hadoop to AWS?
Target AWS Architecture with Iceberg
Step-by-Step Migration Process
Code Snippets & Implementation
Lessons Learned & Best Practices

1. Introduction

Many enterprises still run on-premises Hadoop (HDFS, Spark, Kafka, Airflow) for big data processing. However, challenges like high operational costs, scalability bottlenecks, and maintenance overhead make cloud migration attractive.

This blog provides a 6-step guide for migrating to AWS S3, Apache Iceberg, and EMR, including:

✔ Architecture diagrams

✔ Code snippets for Spark, Kafka, and Iceberg

✔ Lessons learned from real-world migrations

2. Why Migrate from On-Premises Hadoop to AWS?

Challenges with On-Prem Hadoop

Issue	AWS Solution
Expensive hardware & maintenance	Pay-as-you-go pricing (EMR, S3)
Manual scaling (YARN/HDFS)	Auto-scaling EMR clusters
HDFS limitations (durability, scaling)	S3 (11 9’s durability) + Iceberg (ACID tables)
Complex Kafka & Airflow management	AWS MSK (Managed Kafka) & MWAA (Managed Airflow)

Key Benefits of AWS + Iceberg

Cost savings (no upfront hardware, spot instances)
Modern table format (Iceberg for schema evolution, time travel)
Serverless options (Glue, Athena, EMR Serverless)

3. Target AWS Architecture with Iceberg

Current On-Premises Setup

New AWS Architecture (Iceberg + EMR)

Key AWS Services

S3 – Data lake storage (replaces HDFS)
EMR – Managed Spark with Iceberg support
AWS Glue Data Catalog – Metastore for Iceberg tables
MSK – Managed Kafka for streaming
MWAA – Managed Airflow for orchestration

4. Step-by-Step Migration Process

Phase 1: Assessment & Planning

Inventory existing workloads (HDFS paths, Spark SQL, Kafka topics)
Choose Iceberg for table format (supports schema evolution, upserts)
Plan networking (VPC, security groups, IAM roles)

Phase 2: Data Migration (HDFS → S3 + Iceberg)

Option 1: Use distcp to copy data from HDFS to S3

  hadoop distcp hdfs://namenode/path s3a://bucket/path

Option 2: Use Spark to rewrite data as Iceberg

  df = spark.read.parquet("hdfs://path")  
  df.write.format("iceberg").save("s3://bucket/iceberg_table")

Phase 3: Compute Migration (Spark → EMR with Iceberg)

Configure EMR with Iceberg (use bootstrap script):

  #!/bin/bash  
  sudo pip install pyiceberg  
  echo "spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog" >> /etc/spark/conf/spark-defaults.conf  
  echo "spark.sql.catalog.glue_catalog.warehouse=s3://bucket/warehouse" >> /etc/spark/conf/spark-defaults.conf

Phase 4: Streaming Migration (Kafka → MSK)

Mirror topics using Kafka Connect

  {
    "name": "msk-mirror",
    "config": {
      "connector.class": "org.apache.kafka.connect.mirror.MirrorSourceConnector",
      "source.cluster.bootstrap.servers": "on-prem-kafka:9092",
      "target.cluster.bootstrap.servers": "b-1.msk.aws:9092",
      "topics": ".*"
    }
  }

Phase 5: Orchestration Migration (Airflow → MWAA)

Export DAGs and update paths (replace hdfs:// with s3://)
Use AWS Secrets Manager for credentials

Phase 6: Validation & Optimization

Verify data consistency (compare row counts, checksums)
Optimize Iceberg (compact files, partition pruning)

  CALL glue_catalog.system.rewrite_data_files('db.table', strategy='binpack')

5. Code Snippets & Implementation

1. Reading/Writing Iceberg Tables in Spark

# Read from HDFS (old)  
df = spark.read.parquet("hdfs:///data/transactions")  

# Write to Iceberg (new)  
df.write.format("iceberg").mode("overwrite").save("s3://bucket/iceberg_db/transactions")  

# Query with time travel  
spark.read.format("iceberg").option("snapshot-id", "12345").load("s3://bucket/iceberg_db/transactions")

2. Kafka to Iceberg (Structured Streaming)

df = spark.readStream.format("kafka") \
  .option("kafka.bootstrap.servers", "b-1.msk.aws:9092") \
  .option("subscribe", "transactions") \
  .load()  

# Write to Iceberg in Delta Lake format  
df.writeStream.format("iceberg") \
  .outputMode("append") \
  .option("path", "s3://bucket/iceberg_db/streaming") \
  .start()

3. Airflow DAG for Iceberg Maintenance

from airflow import DAG  
from airflow.providers.amazon.aws.operators.emr import EmrAddStepsOperator  

dag = DAG("iceberg_maintenance", schedule_interval="@weekly")  

compact_task = EmrAddStepsOperator(  
    task_id="compact_iceberg",  
    job_flow_id="j-EMRCLUSTER",  
    steps=[{  
        "Name": "Compact Iceberg",  
        "HadoopJarStep": {  
            "Jar": "command-runner.jar",  
            "Args": ["spark-sql", "--executor-memory", "8G",  
                     "-e", "CALL glue_catalog.system.rewrite_data_files('db.transactions')"]  
        }  
    }]  
)

6. Lessons Learned & Best Practices

Key Challenges & Fixes

Issue	Solution
Slow S3 writes	Use EMRFS S3-optimized committer
Hive metastore conflicts	Migrate to Glue Data Catalog
Kafka consumer lag	Increase MSK broker size & optimize partitions

Best Practices

✅ Use EMR 6.8+ for native Iceberg support

✅ Partition Iceberg tables by time for better performance

✅ Enable S3 lifecycle policies to save costs

✅ Monitor MSK lag with CloudWatch

Final Thoughts

Migrating to AWS S3 + Iceberg + EMR modernizes data infrastructure, reduces costs, and improves scalability. By following this guide, enterprises can minimize downtime and maximize performance.

Next Steps

Would you like a deeper dive into Iceberg optimizations or Kafka migration strategies? Let me know in the comments!

Migrating Hadoop Workloads to AWS Migration EMR Oozie

Premchand G — Fri, 11 Apr 2025 10:50:30 +0000

Many organizations rely on Hadoop-based workflows for big data processing, leveraging tools like Apache Pig, Apache Hive, and Apache Oozie for data transformation, querying, and workflow orchestration. However, managing on-premises Hadoop clusters can be complex and costly. Migrating these workflows to AWS Elastic MapReduce (EMR) offers scalability, cost-efficiency, and reduced operational overhead.

This blog explores the key considerations, steps, and best practices for migrating Hadoop workflows (Pig, Hive, and Oozie) to AWS EMR.

1. Understanding AWS EMR and Migration Benefits

What is AWS EMR?

AWS EMR is a managed big data platform that simplifies running distributed frameworks like Hadoop, Spark, Hive, Pig, and Oozie in the cloud. It automatically handles provisioning, scaling, and cluster management.

Why Migrate to AWS EMR?

Scalability: Auto-scaling adjusts resources based on workload demands.
Cost Efficiency: Pay-as-you-go pricing reduces infrastructure costs.
Managed Service: AWS handles cluster setup, maintenance, and updates.
Integration with AWS Ecosystem: Seamless connectivity with S3, Glue, Lambda, and Redshift.
Faster Processing: Optimized performance with AWS hardware.

Key Components in Migration

On-Premises Hadoop	AWS EMR Equivalent
HDFS	Amazon S3 / EMRFS
Pig Scripts	EMR Pig (or Spark)
Hive Queries	EMR Hive / Athena
Oozie Workflows	AWS Step Functions / Managed Workflows for Apache Airflow (MWAA)

Here’s a high-level architecture for the migrated solution:

[Data Sources] --> [AWS DataSync/DistCp] --> [Amazon S3]
                      |
                      v
              [AWS Glue ETL or EMR with Pig/Spark]
                      |
                      v
              [AWS Step Functions or MWAA (Airflow)]
                      |
                      v
              [Data Destinations: S3, Redshift, RDS, etc.]

2. Migration Steps: Pig, Hive, and Oozie to AWS EMR

Tools and Services

Data Migration: AWS DataSync, DistCp, S3 CLI.
Data Processing: AWS Glue, EMR, Spark, PySpark.
Orchestration: AWS Step Functions, Apache Airflow (MWAA).
Monitoring: Amazon CloudWatch, AWS CloudTrail.
Security: IAM, KMS, VPC.

Step 1: Assess Existing Workflows

Document current Pig scripts, Hive queries, and Oozie workflows.
Identify dependencies (e.g., external databases, custom UDFs).
Evaluate data storage (HDFS → S3 migration strategy).

Step 2: Set Up AWS EMR Cluster

Choose EMR Release: Select a version supporting Pig, Hive, and Oozie.

   aws emr create-cluster \
   --name "Hadoop Migration Cluster" \
   --release-label emr-6.9.0 \
   --applications Name=Pig Name=Hive Name=Oozie \
   --ec2-attributes KeyName=my-key-pair \
   --instance-type m5.xlarge \
   --instance-count 3 \
   --use-default-roles

Configure Storage: Replace HDFS with Amazon S3.

   <property>
     <name>fs.defaultFS</name>
     <value>s3://my-data-bucket/</value>
   </property>

Step 3: Migrate Pig Scripts

Option 1: Run Pig scripts directly on EMR.

  -- Example: WordCount.pig
  data = LOAD 's3://input-data/wordcount.txt' AS (line:chararray);
  words = FOREACH data GENERATE FLATTEN(TOKENIZE(line)) AS word;
  grouped = GROUP words BY word;
  count = FOREACH grouped GENERATE group, COUNT(words);
  STORE count INTO 's3://output-data/wordcount_result';

Option 2: Convert Pig to Spark SQL (for better performance).

Step 4: Migrate Hive Queries

Option 1: Use EMR Hive with S3 as storage.

  CREATE EXTERNAL TABLE logs (
    timestamp STRING,
    message STRING
  ) LOCATION 's3://my-hive-tables/logs/';

Option 2: Use AWS Athena for serverless HiveQL queries.

Step 5: Replace Oozie with AWS Workflow Solutions

Option 1: AWS Step Functions for orchestration.

  {
    "StartAt": "RunHiveQuery",
    "States": {
      "RunHiveQuery": {
        "Type": "Task",
        "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
        "Parameters": {
          "ClusterId": "j-2AXXXXXX",
          "Step": {
            "Name": "HiveQueryStep",
            "ActionOnFailure": "CONTINUE",
            "HadoopJarStep": {
              "Jar": "command-runner.jar",
              "Args": ["hive-script", "--run-hive-script", "--args", "-f", "s3://scripts/query.hql"]
            }
          }
        },
        "End": true
      }
    }
  }

Option 2: Managed Workflows for Apache Airflow (MWAA) for complex DAGs.

Step 6: Validate and Optimize

Test: Run sample workflows in EMR.
Optimize: Adjust EMR configurations (e.g., instance types, spot instances).
Monitor: Use CloudWatch for logging and performance tracking.

Migrate data from an on-premises Hadoop environment
Using traditional Hadoop DistCp on the source cluster for data transfer can consume many resources. Instead, use S3DistCp with Direct Connect to migrate terabytes of data from an on-premises Hadoop environment to Amazon S3. This method runs the job on the target EMR cluster, reducing the load on the source cluster.
Transfer data using S3DistCp
To transfer the source HDFS folder to the target S3 bucket, use the following command:

s3-dist-cp --src hdfs://hadoopcluster01.test.amazon.local/user/hive/warehouse/test.db/test_table01 --dest s3://<BUCKET_NAME>/user/hive/warehouse/test.db/test_table01

To transfer large files in multipart chunks, use the following command to set the chuck size:

s3-dist-cp --src hdfs://hadoopcluster01.test.amazon.local/user/hive/warehouse/test.db/test_table01 --dest s3://<BUCKET_NAME>/user/hive/warehouse/test.db/test_table01 --multipartUploadChunkSize=1024

This will invoke a MapReduce job on the target EMR cluster. Depending on the volume of the data and the bandwidth speed, the job can take a few minutes up to a few hours to complete.

3. Best Practices and Challenges

Best Practices

✔ Use S3 Instead of HDFS: Cheaper and more durable.

✔ Leverage Spot Instances: Reduce costs for non-critical workloads.

✔ Automate Cluster Lifecycle: Use AWS EMR Serverless or EMR Steps API for transient clusters.

✔ Security: Enable IAM roles, encryption (KMS), and VPC isolation.

Common Challenges

⚠ Script Compatibility: Some Pig/Hive scripts may need adjustments for S3.

⚠ Oozie Dependency Replacement: Step Functions/MWAA may require workflow redesign.

⚠ Performance Tuning: Optimize partition strategies for S3-based queries.

Conclusion

Migrating Hadoop workflows from on-premises to AWS EMR improves scalability, reduces costs, and leverages AWS-managed services. By following the steps outlined—assessing workflows, setting up EMR, migrating Pig/Hive scripts, and replacing Oozie with AWS-native orchestration—you can ensure a smooth transition.

For further optimization, consider EMR Serverless for sporadic workloads or AWS Glue for ETL automation. Start with a proof-of-concept migration to validate performance before full-scale deployment.

Would you like a deeper dive into any specific migration step? Let us know in the comments!

DEV Community: Premchand G

Migrating Hadoop Workloads to AWS: On-Premises HDFS, Spark, Kafka, Airflow to AWS S3, Iceberg, and EMR

Table of Contents

1. Introduction

2. Why Migrate from On-Premises Hadoop to AWS?

Challenges with On-Prem Hadoop

Key Benefits of AWS + Iceberg

3. Target AWS Architecture with Iceberg

Current On-Premises Setup

New AWS Architecture (Iceberg + EMR)

Key AWS Services

4. Step-by-Step Migration Process

Phase 1: Assessment & Planning

Phase 2: Data Migration (HDFS → S3 + Iceberg)

Phase 3: Compute Migration (Spark → EMR with Iceberg)

Phase 4: Streaming Migration (Kafka → MSK)

Phase 5: Orchestration Migration (Airflow → MWAA)

Phase 6: Validation & Optimization

5. Code Snippets & Implementation

1. Reading/Writing Iceberg Tables in Spark

2. Kafka to Iceberg (Structured Streaming)

3. Airflow DAG for Iceberg Maintenance

6. Lessons Learned & Best Practices

Key Challenges & Fixes

Best Practices

Final Thoughts

Next Steps

Migrating Hadoop Workloads to AWS Migration EMR Oozie

1. Understanding AWS EMR and Migration Benefits

What is AWS EMR?

Why Migrate to AWS EMR?

Key Components in Migration

2. Migration Steps: Pig, Hive, and Oozie to AWS EMR

Step 1: Assess Existing Workflows

Step 2: Set Up AWS EMR Cluster

Step 3: Migrate Pig Scripts

Step 4: Migrate Hive Queries

Step 5: Replace Oozie with AWS Workflow Solutions

Step 6: Validate and Optimize

This will invoke a MapReduce job on the target EMR cluster. Depending on the volume of the data and the bandwidth speed, the job can take a few minutes up to a few hours to complete.

3. Best Practices and Challenges

Best Practices

Common Challenges

Conclusion