<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Premchand G</title>
    <description>The latest articles on DEV Community by Premchand G (@premchand_g_b701825d22ef9).</description>
    <link>https://dev.to/premchand_g_b701825d22ef9</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3038758%2F16ec0f56-bb93-40c0-bf9a-aeb9a00e0601.png</url>
      <title>DEV Community: Premchand G</title>
      <link>https://dev.to/premchand_g_b701825d22ef9</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/premchand_g_b701825d22ef9"/>
    <language>en</language>
    <item>
      <title>Migrating Hadoop Workloads to AWS: On-Premises HDFS, Spark, Kafka, Airflow to AWS S3, Iceberg, and EMR</title>
      <dc:creator>Premchand G</dc:creator>
      <pubDate>Fri, 11 Apr 2025 11:05:59 +0000</pubDate>
      <link>https://dev.to/premchand_g_b701825d22ef9/migrating-hadoop-workloads-to-aws-on-premises-hdfs-spark-kafka-airflow-to-aws-s3-iceberg-and-452e</link>
      <guid>https://dev.to/premchand_g_b701825d22ef9/migrating-hadoop-workloads-to-aws-on-premises-hdfs-spark-kafka-airflow-to-aws-s3-iceberg-and-452e</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Table of Contents&lt;/strong&gt;
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
Introduction
&lt;/li&gt;
&lt;li&gt;
Why Migrate from On-Premises Hadoop to AWS?
&lt;/li&gt;
&lt;li&gt;
Target AWS Architecture with Iceberg
&lt;/li&gt;
&lt;li&gt;
Step-by-Step Migration Process
&lt;/li&gt;
&lt;li&gt;
Code Snippets &amp;amp; Implementation
&lt;/li&gt;
&lt;li&gt;
Lessons Learned &amp;amp; Best Practices
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;1. Introduction&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Many enterprises still run &lt;strong&gt;on-premises Hadoop&lt;/strong&gt; (HDFS, Spark, Kafka, Airflow) for big data processing. However, challenges like &lt;strong&gt;high operational costs, scalability bottlenecks, and maintenance overhead&lt;/strong&gt; make cloud migration attractive.  &lt;/p&gt;

&lt;p&gt;This blog provides a &lt;strong&gt;6-step guide&lt;/strong&gt; for migrating to &lt;strong&gt;AWS S3, Apache Iceberg, and EMR&lt;/strong&gt;, including:&lt;br&gt;&lt;br&gt;
✔ &lt;strong&gt;Architecture diagrams&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
✔ &lt;strong&gt;Code snippets for Spark, Kafka, and Iceberg&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
✔ &lt;strong&gt;Lessons learned from real-world migrations&lt;/strong&gt;  &lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;2. Why Migrate from On-Premises Hadoop to AWS?&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Challenges with On-Prem Hadoop&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Issue&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;AWS Solution&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Expensive hardware &amp;amp; maintenance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Pay-as-you-go pricing (EMR, S3)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Manual scaling (YARN/HDFS)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Auto-scaling EMR clusters&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HDFS limitations (durability, scaling)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;S3 (11 9’s durability) + Iceberg (ACID tables)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Complex Kafka &amp;amp; Airflow management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;AWS MSK (Managed Kafka) &amp;amp; MWAA (Managed Airflow)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key Benefits of AWS + Iceberg&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost savings&lt;/strong&gt; (no upfront hardware, spot instances)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Modern table format&lt;/strong&gt; (Iceberg for schema evolution, time travel)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serverless options&lt;/strong&gt; (Glue, Athena, EMR Serverless)
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;3. Target AWS Architecture with Iceberg&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Current On-Premises Setup&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F583z3vcclunfobjj41cr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F583z3vcclunfobjj41cr.png" alt="Image description" width="332" height="541"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;New AWS Architecture (Iceberg + EMR)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh970ra4fqxvxorch1z1w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh970ra4fqxvxorch1z1w.png" alt="Image description" width="473" height="538"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Key AWS Services&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;S3&lt;/strong&gt; – Data lake storage (replaces HDFS)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EMR&lt;/strong&gt; – Managed Spark with Iceberg support
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Glue Data Catalog&lt;/strong&gt; – Metastore for Iceberg tables
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MSK&lt;/strong&gt; – Managed Kafka for streaming
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MWAA&lt;/strong&gt; – Managed Airflow for orchestration
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;4. Step-by-Step Migration Process&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 1: Assessment &amp;amp; Planning&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inventory existing workloads&lt;/strong&gt; (HDFS paths, Spark SQL, Kafka topics)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Choose Iceberg for table format&lt;/strong&gt; (supports schema evolution, upserts)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan networking&lt;/strong&gt; (VPC, security groups, IAM roles)
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 2: Data Migration (HDFS → S3 + Iceberg)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Option 1:&lt;/strong&gt; Use &lt;code&gt;distcp&lt;/code&gt; to copy data from HDFS to S3
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  hadoop distcp hdfs://namenode/path s3a://bucket/path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Option 2:&lt;/strong&gt; Use Spark to rewrite data as Iceberg
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hdfs://path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
  &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iceberg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://bucket/iceberg_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 3: Compute Migration (Spark → EMR with Iceberg)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Configure EMR with Iceberg&lt;/strong&gt; (use bootstrap script):
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  &lt;span class="c"&gt;#!/bin/bash  &lt;/span&gt;
  &lt;span class="nb"&gt;sudo &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pyiceberg  
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /etc/spark/conf/spark-defaults.conf  
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"spark.sql.catalog.glue_catalog.warehouse=s3://bucket/warehouse"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /etc/spark/conf/spark-defaults.conf  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 4: Streaming Migration (Kafka → MSK)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mirror topics using Kafka Connect&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"msk-mirror"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"connector.class"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"org.apache.kafka.connect.mirror.MirrorSourceConnector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"source.cluster.bootstrap.servers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"on-prem-kafka:9092"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"target.cluster.bootstrap.servers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"b-1.msk.aws:9092"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"topics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;".*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 5: Orchestration Migration (Airflow → MWAA)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Export DAGs and update paths&lt;/strong&gt; (replace &lt;code&gt;hdfs://&lt;/code&gt; with &lt;code&gt;s3://&lt;/code&gt;)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use AWS Secrets Manager for credentials&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 6: Validation &amp;amp; Optimization&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Verify data consistency&lt;/strong&gt; (compare row counts, checksums)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize Iceberg&lt;/strong&gt; (compact files, partition pruning)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;  &lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="n"&gt;glue_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rewrite_data_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'db.table'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'binpack'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;5. Code Snippets &amp;amp; Implementation&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Reading/Writing Iceberg Tables in Spark&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read from HDFS (old)  
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hdfs:///data/transactions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="c1"&gt;# Write to Iceberg (new)  
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iceberg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://bucket/iceberg_db/transactions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="c1"&gt;# Query with time travel  
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iceberg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snapshot-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;12345&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://bucket/iceberg_db/transactions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;2. Kafka to Iceberg (Structured Streaming)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka.bootstrap.servers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;b-1.msk.aws:9092&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscribe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transactions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

&lt;span class="c1"&gt;# Write to Iceberg in Delta Lake format  
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iceberg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;outputMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://bucket/iceberg_db/streaming&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;3. Airflow DAG for Iceberg Maintenance&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.providers.amazon.aws.operators.emr&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EmrAddStepsOperator&lt;/span&gt;  

&lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iceberg_maintenance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@weekly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="n"&gt;compact_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EmrAddStepsOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compact_iceberg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;job_flow_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;j-EMRCLUSTER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Compact Iceberg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HadoopJarStep&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Jar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;command-runner.jar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Args&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark-sql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--executor-memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8G&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-e&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CALL glue_catalog.system.rewrite_data_files(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;db.transactions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="p"&gt;}]&lt;/span&gt;  
&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;6. Lessons Learned &amp;amp; Best Practices&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key Challenges &amp;amp; Fixes&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Issue&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Solution&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Slow S3 writes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Use &lt;strong&gt;EMRFS S3-optimized committer&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hive metastore conflicts&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Migrate to &lt;strong&gt;Glue Data Catalog&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kafka consumer lag&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Increase MSK broker size &amp;amp; optimize partitions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Best Practices&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;✅ &lt;strong&gt;Use EMR 6.8+ for native Iceberg support&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Partition Iceberg tables by time for better performance&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Enable S3 lifecycle policies to save costs&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Monitor MSK lag with CloudWatch&lt;/strong&gt;  &lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Final Thoughts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Migrating to &lt;strong&gt;AWS S3 + Iceberg + EMR&lt;/strong&gt; modernizes data infrastructure, reduces costs, and improves scalability. By following this guide, enterprises can &lt;strong&gt;minimize downtime and maximize performance&lt;/strong&gt;.  &lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Next Steps&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg.html" rel="noopener noreferrer"&gt;AWS Iceberg Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://aws.amazon.com/blogs/big-data/best-practices-for-amazon-emr/" rel="noopener noreferrer"&gt;EMR Best Practices&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Would you like a deeper dive into &lt;strong&gt;Iceberg optimizations&lt;/strong&gt; or &lt;strong&gt;Kafka migration strategies&lt;/strong&gt;? Let me know in the comments!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Migrating Hadoop Workloads to AWS Migration EMR Oozie</title>
      <dc:creator>Premchand G</dc:creator>
      <pubDate>Fri, 11 Apr 2025 10:50:30 +0000</pubDate>
      <link>https://dev.to/premchand_g_b701825d22ef9/migrating-hadoop-workloads-to-aws-migrationemr-oozie-4483</link>
      <guid>https://dev.to/premchand_g_b701825d22ef9/migrating-hadoop-workloads-to-aws-migrationemr-oozie-4483</guid>
      <description>&lt;p&gt;Many organizations rely on Hadoop-based workflows for big data processing, leveraging tools like &lt;strong&gt;Apache Pig&lt;/strong&gt;, &lt;strong&gt;Apache Hive&lt;/strong&gt;, and &lt;strong&gt;Apache Oozie&lt;/strong&gt; for data transformation, querying, and workflow orchestration. However, managing on-premises Hadoop clusters can be complex and costly. Migrating these workflows to &lt;strong&gt;AWS Elastic MapReduce (EMR)&lt;/strong&gt; offers scalability, cost-efficiency, and reduced operational overhead.&lt;/p&gt;

&lt;p&gt;This blog explores the key considerations, steps, and best practices for migrating Hadoop workflows (Pig, Hive, and Oozie) to &lt;strong&gt;AWS EMR&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;1. Understanding AWS EMR and Migration Benefits&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is AWS EMR?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;AWS EMR is a managed big data platform that simplifies running distributed frameworks like &lt;strong&gt;Hadoop, Spark, Hive, Pig, and Oozie&lt;/strong&gt; in the cloud. It automatically handles provisioning, scaling, and cluster management.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why Migrate to AWS EMR?&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Auto-scaling adjusts resources based on workload demands.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Efficiency&lt;/strong&gt;: Pay-as-you-go pricing reduces infrastructure costs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managed Service&lt;/strong&gt;: AWS handles cluster setup, maintenance, and updates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with AWS Ecosystem&lt;/strong&gt;: Seamless connectivity with &lt;strong&gt;S3, Glue, Lambda, and Redshift&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faster Processing&lt;/strong&gt;: Optimized performance with AWS hardware.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key Components in Migration&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;On-Premises Hadoop&lt;/th&gt;
&lt;th&gt;AWS EMR Equivalent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HDFS&lt;/td&gt;
&lt;td&gt;Amazon S3 / EMRFS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pig Scripts&lt;/td&gt;
&lt;td&gt;EMR Pig (or Spark)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hive Queries&lt;/td&gt;
&lt;td&gt;EMR Hive / Athena&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oozie Workflows&lt;/td&gt;
&lt;td&gt;AWS Step Functions / Managed Workflows for Apache Airflow (MWAA)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Here’s a high-level architecture for the migrated solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Data Sources] --&amp;gt; [AWS DataSync/DistCp] --&amp;gt; [Amazon S3]
                      |
                      v
              [AWS Glue ETL or EMR with Pig/Spark]
                      |
                      v
              [AWS Step Functions or MWAA (Airflow)]
                      |
                      v
              [Data Destinations: S3, Redshift, RDS, etc.]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;2. Migration Steps: Pig, Hive, and Oozie to AWS EMR&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Tools and Services&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Migration&lt;/strong&gt;: AWS DataSync, DistCp, S3 CLI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Processing&lt;/strong&gt;: AWS Glue, EMR, Spark, PySpark.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration&lt;/strong&gt;: AWS Step Functions, Apache Airflow (MWAA).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt;: Amazon CloudWatch, AWS CloudTrail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: IAM, KMS, VPC.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 1: Assess Existing Workflows&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Document current &lt;strong&gt;Pig scripts, Hive queries, and Oozie workflows&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Identify dependencies (e.g., external databases, custom UDFs).&lt;/li&gt;
&lt;li&gt;Evaluate data storage (HDFS → S3 migration strategy).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 2: Set Up AWS EMR Cluster&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Choose EMR Release&lt;/strong&gt;: Select a version supporting Pig, Hive, and Oozie.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   aws emr create-cluster &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"Hadoop Migration Cluster"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--release-label&lt;/span&gt; emr-6.9.0 &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--applications&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Pig &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Hive &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Oozie &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--ec2-attributes&lt;/span&gt; &lt;span class="nv"&gt;KeyName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-key-pair &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--instance-type&lt;/span&gt; m5.xlarge &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--instance-count&lt;/span&gt; 3 &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--use-default-roles&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Configure Storage&lt;/strong&gt;: Replace HDFS with &lt;strong&gt;Amazon S3&lt;/strong&gt;.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;   &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
     &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;fs.defaultFS&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
     &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;s3://my-data-bucket/&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 3: Migrate Pig Scripts&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Option 1&lt;/strong&gt;: Run Pig scripts directly on EMR.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  -- Example: WordCount.pig
  data = LOAD 's3://input-data/wordcount.txt' AS (line:chararray);
  words = FOREACH data GENERATE FLATTEN(TOKENIZE(line)) AS word;
  grouped = GROUP words BY word;
  count = FOREACH grouped GENERATE group, COUNT(words);
  STORE count INTO 's3://output-data/wordcount_result';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Option 2&lt;/strong&gt;: Convert Pig to &lt;strong&gt;Spark SQL&lt;/strong&gt; (for better performance).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 4: Migrate Hive Queries&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Option 1&lt;/strong&gt;: Use EMR Hive with S3 as storage.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;  &lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-hive-tables/logs/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Option 2&lt;/strong&gt;: Use &lt;strong&gt;AWS Athena&lt;/strong&gt; for serverless HiveQL queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 5: Replace Oozie with AWS Workflow Solutions&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Option 1&lt;/strong&gt;: &lt;strong&gt;AWS Step Functions&lt;/strong&gt; for orchestration.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"StartAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RunHiveQuery"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"States"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"RunHiveQuery"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:states:::elasticmapreduce:addStep.sync"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"ClusterId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"j-2AXXXXXX"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"Step"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"HiveQueryStep"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"ActionOnFailure"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CONTINUE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"HadoopJarStep"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="nl"&gt;"Jar"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"command-runner.jar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="nl"&gt;"Args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"hive-script"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--run-hive-script"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--args"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"-f"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://scripts/query.hql"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"End"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Option 2&lt;/strong&gt;: &lt;strong&gt;Managed Workflows for Apache Airflow (MWAA)&lt;/strong&gt; for complex DAGs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 6: Validate and Optimize&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Test&lt;/strong&gt;: Run sample workflows in EMR.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize&lt;/strong&gt;: Adjust EMR configurations (e.g., instance types, spot instances).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor&lt;/strong&gt;: Use &lt;strong&gt;CloudWatch&lt;/strong&gt; for logging and performance tracking.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Migrate data from an on-premises Hadoop environment&lt;/strong&gt;&lt;br&gt;
     Using traditional Hadoop DistCp on the source cluster for data transfer can consume many resources. Instead, use S3DistCp with Direct Connect to migrate terabytes of data from an on-premises Hadoop environment to Amazon S3. This method runs the job on the target EMR cluster, reducing the load on the source cluster.&lt;br&gt;
Transfer data using S3DistCp&lt;br&gt;
To transfer the source HDFS folder to the target S3 bucket, use the following command:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;s3-dist-cp --src hdfs://hadoopcluster01.test.amazon.local/user/hive/warehouse/test.db/test_table01 --dest s3://&amp;lt;BUCKET_NAME&amp;gt;/user/hive/warehouse/test.db/test_table01&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;To transfer large files in multipart chunks, use the following command to set the chuck size:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;s3-dist-cp --src hdfs://hadoopcluster01.test.amazon.local/user/hive/warehouse/test.db/test_table01 --dest s3://&amp;lt;BUCKET_NAME&amp;gt;/user/hive/warehouse/test.db/test_table01 --multipartUploadChunkSize=1024&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  This will invoke a MapReduce job on the target EMR cluster. Depending on the volume of the data and the bandwidth speed, the job can take a few minutes up to a few hours to complete.
&lt;/h2&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Best Practices and Challenges&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Best Practices&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;✔ &lt;strong&gt;Use S3 Instead of HDFS&lt;/strong&gt;: Cheaper and more durable.&lt;br&gt;&lt;br&gt;
✔ &lt;strong&gt;Leverage Spot Instances&lt;/strong&gt;: Reduce costs for non-critical workloads.&lt;br&gt;&lt;br&gt;
✔ &lt;strong&gt;Automate Cluster Lifecycle&lt;/strong&gt;: Use &lt;strong&gt;AWS EMR Serverless&lt;/strong&gt; or &lt;strong&gt;EMR Steps API&lt;/strong&gt; for transient clusters.&lt;br&gt;&lt;br&gt;
✔ &lt;strong&gt;Security&lt;/strong&gt;: Enable &lt;strong&gt;IAM roles, encryption (KMS), and VPC isolation&lt;/strong&gt;.  &lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Common Challenges&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;⚠ &lt;strong&gt;Script Compatibility&lt;/strong&gt;: Some Pig/Hive scripts may need adjustments for S3.&lt;br&gt;&lt;br&gt;
⚠ &lt;strong&gt;Oozie Dependency Replacement&lt;/strong&gt;: Step Functions/MWAA may require workflow redesign.&lt;br&gt;&lt;br&gt;
⚠ &lt;strong&gt;Performance Tuning&lt;/strong&gt;: Optimize partition strategies for S3-based queries.  &lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Migrating Hadoop workflows from on-premises to &lt;strong&gt;AWS EMR&lt;/strong&gt; improves scalability, reduces costs, and leverages AWS-managed services. By following the steps outlined—assessing workflows, setting up EMR, migrating Pig/Hive scripts, and replacing Oozie with AWS-native orchestration—you can ensure a smooth transition.&lt;/p&gt;

&lt;p&gt;For further optimization, consider &lt;strong&gt;EMR Serverless&lt;/strong&gt; for sporadic workloads or &lt;strong&gt;AWS Glue&lt;/strong&gt; for ETL automation. Start with a &lt;strong&gt;proof-of-concept migration&lt;/strong&gt; to validate performance before full-scale deployment.&lt;/p&gt;

&lt;p&gt;Would you like a deeper dive into any specific migration step? Let us know in the comments!  &lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
