DEV Community: Sagar Lakshmipathy

Apache Kafka on Amazon Linux EC2

Sagar Lakshmipathy — Wed, 31 Jul 2024 03:49:05 +0000

In this article, we will walk through the steps to set up Apache Kafka on an Amazon EC2 instance (Amazon Linux distribution). We'll start with updating the system, installing Java, and then proceed with the installation and configuration of Kafka. By the end, you'll have a running Kafka instance, ready for producing and consuming messages.

Prerequisites

An Amazon EC2 instance (running Amazon Linux 2 or a similar distribution)
Basic knowledge of using the terminal

Steps

Step 1: Update the System

First, ensure your system is up to date by running:

sudo yum update -y

Step 2: Install Java

Apache Kafka requires Java to run. Install Java 11 using the following command:

sudo yum install java-11-amazon-corretto-devel -y

Step 3: Download and Extract Kafka

Next, download Kafka from the official Apache archive and extract it:

wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.12-3.3.1.tgz
tar -xzf kafka_2.12-3.3.1.tgz
cd kafka_2.12-3.3.1

Step 4: Start Zookeeper

Kafka requires Zookeeper to manage its cluster. Start Zookeeper with:

nohup bin/zookeeper-server-start.sh config/zookeeper.properties > zookeeper.log 2>&1 &

Step 5: Configure Kafka

Edit the Kafka configuration file to set the advertised listeners. Replace ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com with your EC2 instance's public DNS:

vi config/server.properties

Add the following line:

advertised.listeners=PLAINTEXT://ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com:9092

Step 6: Start Kafka Server

Now, start the Kafka server:

nohup bin/kafka-server-start.sh config/server.properties > kafka-server.log 2>&1 &

Step 7: Create a Topic

Create a new Kafka topic named test:

bin/kafka-topics.sh --create --topic test --bootstrap-server ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com:9092 --partitions 1 --replication-factor 1

Step 8: Produce Messages

Start a Kafka producer to send messages to the test topic:

bin/kafka-console-producer.sh --topic test --bootstrap-server ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com:9092

You can now type messages into the console, and they will be sent to the Kafka topic.

Step 9: Consume Messages

Start a Kafka consumer to read messages from the test topic:

bin/kafka-console-consumer.sh --topic test --bootstrap-server ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com:9092 --from-beginning

You should see the messages you produced earlier displayed in the console.

Conclusion

By following these steps, you've successfully set up Apache Kafka on an Amazon EC2 instance. You can now produce and consume messages, enabling you to build real-time data streaming applications. This setup forms the foundation for more advanced Kafka configurations and use cases.

Feel free to explore Kafka's extensive features and tailor the configuration to suit your specific requirements. Happy streaming!

Apache Hudi on AWS Glue

Sagar Lakshmipathy — Sun, 19 May 2024 13:16:34 +0000

Have you wondered how to write Hudi tables (Scala) in AWS Glue?
Look no further.

Pre-requisites

Create a Glue Database called hudi_db from the Databases under Data Catalog menu in the Glue Console

Let's pick the Apache Hudi Spark QuickStart guide to drive this example.

Configuring the job

In Glue console, choose ETL Jobs then choose Script Editor
Now in the tabs above, choose Job details and in Language choose Scala
Feel free to make any infra changes as required.
Click on Advanced properties and navigate to Job parameters and add the below parameters one by one. Of course, change these variables as you prefer.
- --S3_OUTPUT_PATH as s3://hudi-spark-quickstart/write-path/
- --class as GlueApp
- --conf as spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false
- --datalake-formats as hudi

Note: In this example, I'm using the default Hudi version - 0.12.0 - that comes with Glue 4.0. If you want to use a different Hudi version, you might have to add the jar to the class path by adding one more property --extra-jars and point to the S3 path of the Hudi JAR file.

On to the cool stuff now.

Scripting

Navigate to the Script tab and add the below Scala code

Let's add the boiler plate imports

import com.amazonaws.services.glue.{GlueContext, DynamicFrame}
import com.amazonaws.services.glue.util.GlueArgParser
import org.apache.spark.SparkContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.sql.types._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import com.amazonaws.services.glue.log.GlueLogger

Add glue specific code, i.e. to parse the job parameters and to create a glueContext

object GlueApp {
  def main(sysArgs: Array[String]) {
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME", "S3_OUTPUT_PATH").toArray)
    val spark: SparkSession = SparkSession.builder().appName("AWS Glue Hudi Job").getOrCreate()
    val glueContext: GlueContext = new GlueContext(spark.sparkContext)
    val logger = new GlueLogger()

Prepping the data.

import spark.implicits._

    val tableName = "trips"
    val recordKeyColumn = "uuid"
    val precombineKeyColumn = "ts"
    val partitionKeyColumn = "city"
    val s3OutputPath = args("S3_OUTPUT_PATH")
    val glueDbName = "hudi_db"
    val writePath = s"$s3OutputPath/$tableName"


    val columns = Seq("ts","uuid","rider","driver","fare","city")
    val data =
      Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
        (1695091554788L,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"),
        (1695046462179L,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"),
        (1695516137016L,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo"    ),
        (1695115999911L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"));

Add the options required by Hudi to write the table and sync it with Glue Database.

    val hudiOptions = Map[String, String](
      "hoodie.table.name" -> tableName,
      "hoodie.datasource.write.recordkey.field" -> recordKeyColumn,
      "hoodie.datasource.write.precombine.field" -> precombineKeyColumn,
      "hoodie.datasource.write.partitionpath.field" -> partitionKeyColumn,
      "hoodie.datasource.write.hive_style_partitioning" -> "true",
      "hoodie.datasource.write.storage.type" -> "COPY_ON_WRITE",
      "hoodie.datasource.write.operation" -> "upsert",
      "hoodie.datasource.hive_sync.enable" -> "true",
      "hoodie.datasource.hive_sync.database" -> glueDbName,
      "hoodie.datasource.hive_sync.table" -> tableName,
      "hoodie.datasource.hive_sync.partition_fields" -> partitionKeyColumn,
      "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
      "hoodie.datasource.hive_sync.use_jdbc" -> "false",
      "hoodie.datasource.hive_sync.mode" -> "hms",
      "path" -> writePath
    )

Finally create the dataframe and write it to S3.

    var inserts = spark.createDataFrame(data).toDF(columns:_*)

    inserts.write
      .format("hudi")
      .options(hudiOptions)
      .mode("overwrite")
      .save()

    logger.info("Data successfully written to S3 using Hudi")
  }
}

Querying

Now that we have written the table to S3, we can query this table from Athena.

SELECT * FROM "hudi_db"."trips" limit 10;