Wash away the CRUD with Kafka Streaming Part1

#backend #tutorial

Kafka Streams with your web app.

What is Kafka?

A streaming platform has three key capabilities:

Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:

Building real-time streaming data pipelines that reliably get data between systems or applications
Building real-time streaming applications that transform or react to the streams of data
To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.

This description is from the Kafka website.
https://kafka.apache.org/intro

OK?? So why might we want to use it?

I can think of a few use cases.

It's highly effective in leveraging legacy systems that can't easily be modified or removed.
Migrating large clusters of data to the cloud.
Creating highly reactive systems based on events.

When I first started looking into Kafka and its event-based architecture it was a bit confusing. I didn't really see how I would write programs to use the technology and its benefits seamed at best ethereal. But once I wrote a simple app to demonstrate how I could use events it became much clearer. So let's go through a simple example.

Note: I come from a web dev background, and though I am working on improving my skills with other languages I am most comfortable with Node.js so that's what I reached for in this case. You will notice that the Kafka ecosystem is built with Java in mind, and has a lot of tooling for both Java and Python, I would probably go with either of those languages in production because of the tooling options.

Step one: Setting up the Kafka Environment with Confluent.

It's a bit of a high bar to get this all playing nice, but the documentation provided by Confluent is excellent.

Link to the Ubuntu/Debian instructions because Linux: https://docs.confluent.io/current/installation/installing_cp/deb-ubuntu.html#systemd-ubuntu-debian-install


 -qO - https://packages.confluent.io/deb/5.2/archive.key | sudo apt-key add -


 add-apt-repository "deb [arch=amd64] https://packages.confluent.io/deb/5.2 stable main"


 apt-get update && sudo apt-get install confluent-community-2.12

Once installed we can run confluent start and you should see all the components start.

For this example to work with Kafka we are going to need to set up a stream of data called a producer, and something that reads that data called a consumer. We also will want some interactive element to our application

First, we will need to navigate to the Kafka install location and run a command to create the topic we will use for our messages. The trick here might be finding the install location to run the script, but once you do we will set up a simple topic as follows.


 --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic employees

Then, we will create a Kafka Producer to get some of our data flowing into the events system.



//employeeProducer.js
const fs = require("fs");
const parse = require("csv-parse");

const kafka = require("kafka-node");
const Producer = kafka.Producer;
const client = new kafka.Client("localhost:2181");

const topic = "employees";

(KeyedMessage = kafka.KeyedMessage),
  (producer = new Producer(client)),
  (km = new KeyedMessage("key", "message")),
  (testProducerReady = false);

This is the config stuff that sets up which Kafka topic we will be dealing with and where to find the Kafka cluster. Think of topics like an equivalent to a database, and a message is equivalent to a record.

Next, we are going to use some of the methods provided by the Kafka Producer package we imported. We are also going to parse a CSV file with some employee data. I use this site Mockaroo to create some data for me with just first name, last, name and hire date and saved it in a directory in my application folder /datafiles/.



producer.on("ready", function() {
  console.log("Producer for tests is ready");
  testProducerReady = true;
});

producer.on("error", function(err) {
  console.error("Problem with producing Kafka message " + err);
});

//we are bringing in some data in a csv, as in this example of a legacy environment that is how it is done.

const inputFile = "./dataFiles/MOCK_Employee_DATA.csv";

let dataArray = [];

let parser = parse({ delimiter: "," }, function(err, data) {
  dataArray = data;
  handleData(1);
});

fs.createReadStream(inputFile).pipe(parser);

The function handleData will create a data node of key-value pairs for us and we are using that in the produceMessage function to pass to our Kafka Producer. I am using a set time out on the data just to see it stream in the console so we can see what's going on.



const handleData = currentData => {
  let line = dataArray[currentData];
  let dataNode = {
    f_name: line[0],
    l_name: line[1],
    hire_date: line[2],
    event_id: "emp_chg_01"
  };
  console.log(JSON.stringify(dataNode));
  produceDataMessage(dataNode);
  let delay = 20;
  setTimeout(handleData.bind(null, currentData + 1), delay);
}

//create Keyed message from parsed json data and send it to kafka
const produceDataMessage = dataNode => {
  (KeyedMessage = kafka.KeyedMessage),
    (dataNodeKM = new KeyedMessage(dataNode.code, JSON.stringify(dataNode))),
    (payloads = [{ topic: topic, messages: dataNodeKM, partition: 0 }]);
  if (testProducerReady) {
    producer.send(payloads, function(err, data) {
      console.log(data);
      if (err) {
        console.log(
          "Sorry, TestProducer is not ready yet, failed to produce message to kafka with error ==== " +
            err
        );
      }
    });
  }
}

If we run this now we should see some messages with the data from our file in it something like this.



{ employees: { '0': 948 } }
{"f_name":"Fax","l_name":"Giraudat","hire_date":"7/20/2014","event_id":"emp_chg_01"}
{ employees: { '0': 949 } }
{"f_name":"Brande","l_name":"Jewks","hire_date":"6/4/2016","event_id":"emp_chg_01"}
{ employees: { '0': 950 } }
{"f_name":"Bel","l_name":"Bromfield","hire_date":"8/10/2016","event_id":"emp_chg_01"}
{ employees: { '0': 951 } }
{"f_name":"Christophorus","l_name":"Kimbrey","hire_date":"6/4/2015","event_id":"emp_chg_01"}
{ employees: { '0': 952 } }
{"f_name":"Inga","l_name":"Reedyhough","hire_date":"7/12/2012","event_id":"emp_chg_01"}

If we have that we are in pretty good shape. So we have Kafka set up, and we can stream data to the cluster from our script, next we will create a consumer to expose these messages to a front end and finaly we will create an event based system that can react to events based on event ids.

Part 2
Part 3