DEV Community: Dhoomil B Sheta

Processing Streaming Twitter Data using Kafka and Spark - Part 2: Creating Kafka Twitter producer

Dhoomil B Sheta — Mon, 05 Nov 2018 19:08:44 +0000

Processing Streaming Twitter Data using Kafka and Spark series.
Part 0: The Plan
Part 1: Setting Up Kafka

Architecture

Before we start implementing any component, let’s lay out an architecture or a block diagram which we will try to build throughout this series one-by-one. As our intention is getting to learn more technologies using one use case, this fits just right.

This diagram covers all points I laid out in The Plan. We already finished setting up a Kafka Cluster in Part 1.

In this article, we’ll focus on building a Producer which will fetch latest tweets on #bigdata and push them to our cluster.

What is a Producer?

Everyone may want to use Kafka for different purposes. Some might want to use it as a queue, some as a message bus, while some as a data storage platform. Whatever might be the case, you will always use Kafka by writing a producer that writes data to Kafka, a consumer that reads data from Kafka, or an application that serves both roles.

Kafka has built-in client APIs that developers can use when developing applications that interact with Kafka. In this article we’ll use Producer API to create a client which will fetch tweets from Twitter and send them to Kafka.

A Note from Kafka: The Definitive Guide:

In addition to the built-in clients, Kafka has a binary wire protocol which you can implement in programming language of your choice. This means that it is possible for applications to read messages from Kafka or write messages to Kafka simply by sending the correct byte sequences to Kafka’s network port. Such clients are not part of Apache Kafka project, but a list of non-Java clients is maintained in the project wiki.

Following are the features of the Java Producer API that ships with Kafka:

The producer is thread safe and sharing a single producer instance across threads will generally be faster than having multiple instances.
The producer has a pool of buffer space that holds records that haven’t yet been transmitted to the server
It also has a background I/O thread that is responsible for turning these records into requests and transmitting them to the cluster.
Failure to close the producer after use will leak these resources.

How to fetch latest Tweets?

Twitter provides an open source client called Hosebird(hbc), a robust Java HTTP library for consuming Twitter’s Streaming API.

It is a robust Java HTTP library for consuming Twitter’s Streaming API. It enables clients to receive Tweets in near real-time. Every Twitter account has access to the Streaming API and any developer can build applications using it.

Generating Twitter API Keys

If you don’t have developer access, go to https://dev.twitter.com/apps/new and apply for a Developer access.
Go to https://developer.twitter.com/en/apps and Create a new Application. (Leave callback URL blank)
Go to Keys and tokens tab and copy the consumer key and secret pair to a file for later use.
Click on “Create” to generate Access Token and Secret. Copy both of them to a file.
Now you have all things needed for developing the producer.

Let’s go ahead and start implementing a Kafka Producer Client which will utilize this service. For all those who want to see the completed code, here is the link: https://github.com/dbsheta/kafka-twitter-producer

Create Maven Project

Open IDE of your choice and create a new maven project. I’ll name mine kafka-twitter-producer
Add Kafka,Twitter and Gson dependencies in pom.xml and rebuild the project.

Implement Producer

First of all, let’s define constants to configure Kafka Producer.

Now, we’ll copy the secrets and tokens from Twitter Developer console.

The tweet returned by Twitter API is very large string(json) and contains all details we require for our project. You can find full response here.

We create two entities Tweet and User to hold json responses since it would be easier to work with POJOs than with String responses. At this point, while sending tweets to Kafka, we’ll call toString() on the Tweet object so we don’t have to write serializer for our custom class.

Note: It is better to use a serialization library in such scenarios. We’ll see in a future post, how we can use Avro to serialize/de-serialize java objects while sending to or consuming from Kafka. We’ll discuss benefits of using Avro with Schema registry at that point.

Now, we have all the basic things needed for implementing producer. Let’s start creating TwitterKafkaProducer.

We will initialize our Twitter client in the constructor for our producer class. We have to pass key, secrets and token for authentication. Then we have to pass a list of terms which we want to track. Currently, I’m focused only on #bigdata

This completes the configuration of twitter client. Now we have to configure Kafka producer. I have created below a fairly simple producer.

Let’s go over the main knobs that we turned here. Rest you can easily find in Kafka Documentation and are pretty much self-explanatory.

BOOTSTRAP_SERVERS_CONFIG: List of brokers that act as initial contact point to the cluster. It is advisable to pass more than one broker in case one goes down, producer still should have options to connect to the cluster.
ACKS_CONFIG: 0, 1 or All. ‘0’ means producer doesn’t wait for acknowledgement. ‘1’ means producer waits for leader to acknowledge that it has written to the disk. ‘all’ means producer waits for acknowledgement that all the in-sync replicas have persisted the message. We have used ‘1’ as the data in our case does not require strict acknowledgement. It’s okay for us even if we get one confirmation as data is not that sensitive. Minor loss of data is okay for us.
RETRIES_CONFIG: Number of times producer retires when message fails to be acknowledged (in case acks is set to ‘1’ or ‘all’). Note that setting this to more than 0 may lead to retried message being delivered out of sequence. You may need to turn a few more knobs to ensure same sequence which is out of scope of this article. Interested folks can ask in the comments section.

Streaming Tweets to Kafka Cluster

Now, after configuring twitter client as well as producer, we only need to make a connection to twitter using the client, wait for someone to tweet with #bigdata. Once we get a tweet, send it to kafka using producer.

The client is responsible for fetching latest tweets on #bigdata and push it to BlockingQueue. In the infinite loop, we take one tweet at a time from the queue and push it to kafka by using Tweet ID as key and the whole tweet as value. Since we have used BlockingQueue, queue.take() will block the flow until twitter client fetches new tweet.

Full code available at: https://github.com/dbsheta/kafka-twitter-producer

Lights. Camera. Action.

Let’s see our code in action! First, I will create a new topic bigdata-tweets with replication factor of 2 and number of partitions 3.

> bin/kafka-topics.sh --create --zookeeper X.X.X.X:2181 --replication-factor 2 --partitions 3 --topic bigdata-tweets


> bin/kafka-topics.sh --describe --zookeeper X.X.X.X:2181 --topic bigdata-tweets

    Topic:bigdata-tweets    PartitionCount:3    ReplicationFactor:2    Configs:
    Topic: bigdata-tweets    Partition: 0    Leader: 0    Replicas: 0,1    Isr: 0,1
    Topic: bigdata-tweets    Partition: 1    Leader: 1    Replicas: 1,2    Isr: 1,2
    Topic: bigdata-tweets    Partition: 2    Leader: 2    Replicas: 2,0    Isr: 2,0

Now, just to verify that the tweets really were persisted by kafka, we’ll start a simple console consumer provided with Kafka distribution.

    > bin/kafka-console-consumer.sh --bootstrap-server bigdata-1:9092 --topic bigdata-tweets --from-beginning

Run the TwitterKafkaProducer app. It should start sending data to Kafka.

You should see something like this on your console consumer:

Tweet{id=1059434252306210817, text='I want to assist to meet you and see your latest tools', lang='en', user=User{id=198639877, name='Antonio Molina', screenName='amj_69', location='Moralzarzal-Madrid-Spain', followersCount=399}, retweetCount=0, favoriteCount=0}

Tweet{id=1059434263232348160, text='RT @InclineZA: #AI &amp; #MachineLearning: Building use cases &amp; creating Real-Life Benefits &gt;&gt;  https://t.co/noWy1NS3OU

If you see the tweets like these, congrats my friend, you have created one data pipline! You fetched data from a source (Twitter), pushed it to a message queue, and ultimately consumed it (printed on console).

Conclusion

We used Twitter Streaming API along with Kafka Clients API to implement a Producer app which fetches data from twitter and sends it to kafka in real-time. In the next part, we’ll see how we can consume this data to do collect some stats in real-time on streaming data.

Until then…

Processing Streaming Twitter Data using Kafka and Spark — Part 1: Setting Up Kafka Cluster

Dhoomil B Sheta — Mon, 05 Nov 2018 13:52:42 +0000

As per the plan I laid out in my previous post, I’ll start by setting up a Kafka Cluster. I’ll primarily be working on Google Cloud instances throughout this series, however, I’ll also lay down steps to setup the same in your local machines as well.

Also, in this series, main focus will be on how-to rather than how-does-it. We’ll spend most of time learning how to implement various use cases than how does Kafka/Spark/Zookeeper does it. However, We’ll go into theory mode if there aren’t any sources already available on the web.

Apache Zookeeper

Kafka uses Zookeeper to store metadata about the Kafka cluster, as well as consumer client details. There are many articles online which explain why Kafka needs Zookeeper. This article by Data-Flair does it very well.

While you can get a quick-and-dirty single node Zookeeper server up and running directly using scripts contained in the Kafka distribution, it is trivial to install a full version of Zookeeper from the distribution.

I assume you have JDK1.8 installed. If not Linux/macOS users can download openJDK using package managers. Windows users can go to Oracle’s website and install the same.

Zookeeper Standalone Mode

Those who don’t have any cloud resources available like Google Cloud or Azure or AWS, can run a single node standalone Zookeeper instance. Spinning such an instance is fairly simple

Download latest version of Zookeeper.

    tar -xvf zookeeper-X.X.X.tar.gz -C /opt
    ln -s /opt/zookeeper-X.X.X /opt/zookeeper
    cd /opt/zookeeper
    cat conf/zoo_sample.cfg >> zookeeper.properties
    bin/zkServer.sh start conf/zookeeper.properties

Zookeeper Ensemble

A Zookeeper cluster is called an ensemble. Due to the algorithm used, it is recommended that ensembles contain an odd number of servers (3, 5,…) as a majority of ensemble members (a quorum) must be working in order for Zookeeper to respond to requests. It is also not recommended to run more than seven nodes, as performance can start to degrade due to the nature of the consensus protocol.

To configure a Zookeeper ensemble, all servers must have a common configuration and each server needs a myid file in the data directory that specifies the ID number of the server

Except the last command, run all previous commands on all servers. In addition to that following are to be run on all servers:

Add list of your servers (hostname/IP) to bottom of the zookeeper configuration file:

    server.1=X.X.X.X:2888:3888
    server.2=Y.Y.Y.Y:2888:3888
    server.3=Z.Z.Z.Z:2888:3888

Add myid file at dataDir location which in my case is /tmp/zookeeper:

    touch /tmp/zookeeper/myid
    echo 1 >> /tmp/zookeeper/myid

After making the above changes, start zookeeper on all servers one by one.

    bin/zkServer.sh start conf/zookeeper.properties

Setting up Kafka

Download latest version of Kafka on all your serves.

    tar -xvf kafka_2.11-0.11.0.0.tgz –C /opt
    ln -s /opt/kafka_2.XX /opt/kafka

Update kafka server.properties file in all instances (hostname/IP) to contain the below line. This file is located in /opt/kafka/config/server.properties

    zookeeper.connect=X.X.X.X:2181,Y.Y.Y.Y:2181,Z.Z.Z.Z:2181

Single Node Multi Broker (SNMB):

For folks who don’t have cloud instances handy, you can setup a cluster locally.

You have to copy server.properties file and copy it 3 times with different name like server1.properties, server2.properties, etc.
Every Kafka broker must have an integer identifier. Open configuration file and change broker.id=1 for 1st server broker.id=2 for 2nd and so on. A good guideline is to set this value to something intrinsic to the host so that when performing maintenance it is easier to map broker IDs to hosts
As you will be running multiple instances on same machine, change port configuration so each process uses different port number. Open configuration file and change port=9092 on 1st, 9093 on 2nd and so on.
Also, in future, whenever I give a command for MNMB setup, you should automatically map it to your configuration

Multi Node Multi Broker (MNMB):

Open configuration file and change broker.id=1 for 1st server, broker.id=2 for 2nd and so on.
Add the canonical hostnames of your servers in your hosts file if they are not public. Or you’ll need to overwite on each server instance advertised.listeners=PLAINTEXT://your.host.name:9092

Testing Full Setup

We have already started a zookeeper ensemble, now lets start Kafka in all our servers as well.

    cd /opt/kafka
    bin/kafka-server-start.sh config/server.properties

Let’s create a sample topic with 3 partitions and 2 replicas:

    bin/kafka-topics.sh --create --zookeeper X.X.X.X:2181 --replication-factor 2 --partitions 3--topic sample_test

Kafka has a command line consumer that will dump out messages to standard output.

    bin/kafka-console-consumer.sh — zookeeper X.X.X.X:2181 — topic sample_test — from-beginning

Run the producer and then type a few messages into the console to send to the server.

    bin/kafka-console-producer.sh --broker-list  X.X.X.X:9092 --topic sample_test
    > Hello, World!
    > Hello from the other side.

If you see output in the console consumer window, congratulations! You successfully setup zookeeper and kafka cluster locally/ on cloud. If for some reasons you are getting errors or not able to get the desired ouput, please leave a comment.

We will use the same setup in the upcoming few articles. In the next article, we will see how we can implement a Kafka Client which will read latest tweets from Twitter and push them into Kafka.

Until then,

Processing Streaming Twitter Data using Kafka and Spark — The Plan

Dhoomil B Sheta — Mon, 05 Nov 2018 13:49:45 +0000

What is Apache Kafka?

Apache Kafka is a publish/subscribe messaging system. It is often described as a “distributed commit log” or more recently as a “distributed streaming platform.”
Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged streaming platform

Source: https://kafka.apache.org/images/kafka_diagram.png

The Inspiration

I recently read the book Kafka: The Definitive Guide by the creators of Kafka. It is truly a wonderful book for anyone who wants to start developing applications with Kafka as well as anyone who wants to know the internals of such a unique platform which is used by most of the Fortune 500 companies.

The Plan

In this series, I’ll be exploring various aspects of Apache Kafka, all by implementing cool data pipeline:

We’ll start by setting up a Kafka Cluster in cloud/locally
After that, we’ll write a Producer Client which will fetch latest tweets continuously using Twitter API and push them to Kafka.
Then, we will implement an app using Kafka Streams API, which will consume the tweets from Kafka in real-time and do basic processing on them like finding number of tweets per user and most used words (i.e word count).
We’ll then venture into more cool stuff like writing our own Kafka Connector which will use twitter as data source and learning to use Apache NiFi to achieve the same with less effort.
We’ll use Spark Streaming to do sentiment analysis on real-time twitter data
Finally, if everything goes well, we’ll try to tweak our architecture and implement Notification service using Firebase and Kafka which will send push notifications to user if his/her tweet has negative sentiment!