<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sagar Lakshmipathy</title>
    <description>The latest articles on DEV Community by Sagar Lakshmipathy (@sagarlakshmipathy).</description>
    <link>https://dev.to/sagarlakshmipathy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1211552%2F85f1f7cd-4cf8-444e-9a08-ce8651afe90e.JPG</url>
      <title>DEV Community: Sagar Lakshmipathy</title>
      <link>https://dev.to/sagarlakshmipathy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sagarlakshmipathy"/>
    <language>en</language>
    <item>
      <title>Apache Kafka on Amazon Linux EC2</title>
      <dc:creator>Sagar Lakshmipathy</dc:creator>
      <pubDate>Wed, 31 Jul 2024 03:49:05 +0000</pubDate>
      <link>https://dev.to/sagarlakshmipathy/apache-kafka-on-amazon-linux-ec2-579j</link>
      <guid>https://dev.to/sagarlakshmipathy/apache-kafka-on-amazon-linux-ec2-579j</guid>
      <description>&lt;p&gt;In this article, we will walk through the steps to set up Apache Kafka on an Amazon EC2 instance (Amazon Linux distribution). We'll start with updating the system, installing Java, and then proceed with the installation and configuration of Kafka. By the end, you'll have a running Kafka instance, ready for producing and consuming messages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;An Amazon EC2 instance (running Amazon Linux 2 or a similar distribution)&lt;br&gt;
Basic knowledge of using the terminal&lt;/p&gt;
&lt;h2&gt;
  
  
  Steps
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Step 1: Update the System
&lt;/h3&gt;

&lt;p&gt;First, ensure your system is up to date by running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo yum update -y
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Install Java
&lt;/h3&gt;

&lt;p&gt;Apache Kafka requires Java to run. Install Java 11 using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo yum install java-11-amazon-corretto-devel -y
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Download and Extract Kafka
&lt;/h3&gt;

&lt;p&gt;Next, download Kafka from the official Apache archive and extract it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.12-3.3.1.tgz
tar -xzf kafka_2.12-3.3.1.tgz
cd kafka_2.12-3.3.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Start Zookeeper
&lt;/h3&gt;

&lt;p&gt;Kafka requires Zookeeper to manage its cluster. Start Zookeeper with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nohup bin/zookeeper-server-start.sh config/zookeeper.properties &amp;gt; zookeeper.log 2&amp;gt;&amp;amp;1 &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Configure Kafka
&lt;/h3&gt;

&lt;p&gt;Edit the Kafka configuration file to set the advertised listeners. Replace &lt;code&gt;ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com&lt;/code&gt; with your EC2 instance's public DNS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vi config/server.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add the following line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;advertised.listeners=PLAINTEXT://ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com:9092
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 6: Start Kafka Server
&lt;/h3&gt;

&lt;p&gt;Now, start the Kafka server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nohup bin/kafka-server-start.sh config/server.properties &amp;gt; kafka-server.log 2&amp;gt;&amp;amp;1 &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 7: Create a Topic
&lt;/h3&gt;

&lt;p&gt;Create a new Kafka topic named test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bin/kafka-topics.sh --create --topic test --bootstrap-server ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com:9092 --partitions 1 --replication-factor 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 8: Produce Messages
&lt;/h3&gt;

&lt;p&gt;Start a Kafka producer to send messages to the test topic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bin/kafka-console-producer.sh --topic test --bootstrap-server ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com:9092
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now type messages into the console, and they will be sent to the Kafka topic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 9: Consume Messages
&lt;/h3&gt;

&lt;p&gt;Start a Kafka consumer to read messages from the test topic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bin/kafka-console-consumer.sh --topic test --bootstrap-server ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com:9092 --from-beginning
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see the messages you produced earlier displayed in the console.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By following these steps, you've successfully set up Apache Kafka on an Amazon EC2 instance. You can now produce and consume messages, enabling you to build real-time data streaming applications. This setup forms the foundation for more advanced Kafka configurations and use cases.&lt;/p&gt;

&lt;p&gt;Feel free to explore Kafka's extensive features and tailor the configuration to suit your specific requirements. Happy streaming!&lt;/p&gt;

</description>
      <category>apachekafka</category>
      <category>amazonlinux</category>
      <category>ec2</category>
    </item>
    <item>
      <title>Apache Hudi on AWS Glue</title>
      <dc:creator>Sagar Lakshmipathy</dc:creator>
      <pubDate>Sun, 19 May 2024 13:16:34 +0000</pubDate>
      <link>https://dev.to/sagarlakshmipathy/apache-hudi-on-aws-glue-450l</link>
      <guid>https://dev.to/sagarlakshmipathy/apache-hudi-on-aws-glue-450l</guid>
      <description>&lt;p&gt;Have you wondered how to write Hudi tables (Scala) in AWS Glue?&lt;br&gt;
Look no further.&lt;/p&gt;
&lt;h3&gt;
  
  
  Pre-requisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Create a Glue Database called &lt;code&gt;hudi_db&lt;/code&gt; from the &lt;code&gt;Databases&lt;/code&gt; under &lt;code&gt;Data Catalog&lt;/code&gt; menu in the Glue Console&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's pick the &lt;a href="https://hudi.apache.org/docs/quick-start-guide"&gt;Apache Hudi Spark QuickStart guide&lt;/a&gt; to drive this example.&lt;/p&gt;
&lt;h3&gt;
  
  
  Configuring the job
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;In Glue console, choose &lt;code&gt;ETL Jobs&lt;/code&gt; then choose &lt;code&gt;Script Editor&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Now in the tabs above, choose &lt;code&gt;Job details&lt;/code&gt; and in &lt;code&gt;Language&lt;/code&gt; choose &lt;code&gt;Scala&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Feel free to make any infra changes as required.&lt;/li&gt;
&lt;li&gt;Click on &lt;code&gt;Advanced properties&lt;/code&gt; and navigate to &lt;code&gt;Job parameters&lt;/code&gt; and add the below parameters one by one. Of course, change these variables as you prefer.

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--S3_OUTPUT_PATH&lt;/code&gt; as &lt;code&gt;s3://hudi-spark-quickstart/write-path/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--class&lt;/code&gt; as &lt;code&gt;GlueApp&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--conf&lt;/code&gt; as &lt;code&gt;spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--datalake-formats&lt;/code&gt; as &lt;code&gt;hudi&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: In this example, I'm using the default Hudi version - &lt;strong&gt;0.12.0&lt;/strong&gt; - that comes with Glue 4.0. If you want to use a different Hudi version, you might have to add the jar to the class path by adding one more property &lt;code&gt;--extra-jars&lt;/code&gt; and point to the S3 path of the Hudi JAR file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On to the cool stuff now. &lt;/p&gt;
&lt;h3&gt;
  
  
  Scripting
&lt;/h3&gt;

&lt;p&gt;Navigate to the &lt;code&gt;Script&lt;/code&gt; tab and add the below Scala code&lt;/p&gt;

&lt;p&gt;Let's add the boiler plate imports&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import com.amazonaws.services.glue.{GlueContext, DynamicFrame}
import com.amazonaws.services.glue.util.GlueArgParser
import org.apache.spark.SparkContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.sql.types._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import com.amazonaws.services.glue.log.GlueLogger
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add glue specific code, i.e. to parse the job parameters and to create a &lt;code&gt;glueContext&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;object GlueApp {
  def main(sysArgs: Array[String]) {
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME", "S3_OUTPUT_PATH").toArray)
    val spark: SparkSession = SparkSession.builder().appName("AWS Glue Hudi Job").getOrCreate()
    val glueContext: GlueContext = new GlueContext(spark.sparkContext)
    val logger = new GlueLogger()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prepping the data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import spark.implicits._

    val tableName = "trips"
    val recordKeyColumn = "uuid"
    val precombineKeyColumn = "ts"
    val partitionKeyColumn = "city"
    val s3OutputPath = args("S3_OUTPUT_PATH")
    val glueDbName = "hudi_db"
    val writePath = s"$s3OutputPath/$tableName"


    val columns = Seq("ts","uuid","rider","driver","fare","city")
    val data =
      Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
        (1695091554788L,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"),
        (1695046462179L,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"),
        (1695516137016L,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo"    ),
        (1695115999911L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"));
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add the options required by Hudi to write the table and sync it with Glue Database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    val hudiOptions = Map[String, String](
      "hoodie.table.name" -&amp;gt; tableName,
      "hoodie.datasource.write.recordkey.field" -&amp;gt; recordKeyColumn,
      "hoodie.datasource.write.precombine.field" -&amp;gt; precombineKeyColumn,
      "hoodie.datasource.write.partitionpath.field" -&amp;gt; partitionKeyColumn,
      "hoodie.datasource.write.hive_style_partitioning" -&amp;gt; "true",
      "hoodie.datasource.write.storage.type" -&amp;gt; "COPY_ON_WRITE",
      "hoodie.datasource.write.operation" -&amp;gt; "upsert",
      "hoodie.datasource.hive_sync.enable" -&amp;gt; "true",
      "hoodie.datasource.hive_sync.database" -&amp;gt; glueDbName,
      "hoodie.datasource.hive_sync.table" -&amp;gt; tableName,
      "hoodie.datasource.hive_sync.partition_fields" -&amp;gt; partitionKeyColumn,
      "hoodie.datasource.hive_sync.partition_extractor_class" -&amp;gt; "org.apache.hudi.hive.MultiPartKeysValueExtractor",
      "hoodie.datasource.hive_sync.use_jdbc" -&amp;gt; "false",
      "hoodie.datasource.hive_sync.mode" -&amp;gt; "hms",
      "path" -&amp;gt; writePath
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally create the dataframe and write it to S3.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    var inserts = spark.createDataFrame(data).toDF(columns:_*)

    inserts.write
      .format("hudi")
      .options(hudiOptions)
      .mode("overwrite")
      .save()

    logger.info("Data successfully written to S3 using Hudi")
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Querying
&lt;/h3&gt;

&lt;p&gt;Now that we have written the table to S3, we can query this table from Athena.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT * FROM "hudi_db"."trips" limit 10;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>awsglue</category>
      <category>spark</category>
      <category>apachehudi</category>
      <category>scala</category>
    </item>
  </channel>
</rss>
