<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dhoomil B Sheta</title>
    <description>The latest articles on DEV Community by Dhoomil B Sheta (@dbsheta).</description>
    <link>https://dev.to/dbsheta</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F112468%2Fc84895aa-d6c8-4953-a199-7534f1c16215.jpeg</url>
      <title>DEV Community: Dhoomil B Sheta</title>
      <link>https://dev.to/dbsheta</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dbsheta"/>
    <language>en</language>
    <item>
      <title>Processing Streaming Twitter Data using Kafka and Spark - Part 2: Creating Kafka Twitter producer</title>
      <dc:creator>Dhoomil B Sheta</dc:creator>
      <pubDate>Mon, 05 Nov 2018 19:08:44 +0000</pubDate>
      <link>https://dev.to/dbsheta/processing-streaming-twitter-data-using-kafka-and-spark-part-2-creating-kafka-twitter-streamproducer-12ko</link>
      <guid>https://dev.to/dbsheta/processing-streaming-twitter-data-using-kafka-and-spark-part-2-creating-kafka-twitter-streamproducer-12ko</guid>
      <description>&lt;p&gt;Processing Streaming Twitter Data using Kafka and Spark series.&lt;br&gt;
Part 0: &lt;a href="https://medium.com/dhoomil-sheta/processing-streaming-twitter-data-using-kafka-and-spark-the-plan-58b893e42403" rel="noopener noreferrer"&gt;The Plan&lt;/a&gt;&lt;br&gt;
Part 1: &lt;a href="https://medium.com/dhoomil-sheta/processing-streaming-twitter-data-using-kafka-and-spark-part-1-setting-up-kafka-cluster-6e491809fa6d" rel="noopener noreferrer"&gt;Setting Up Kafka&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Before we start implementing any component, let’s lay out an architecture or a block diagram which we will try to build throughout this series one-by-one. As our intention is getting to learn more technologies using one use case, this fits just right.&lt;/p&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AzTHQFs9KDNV24Phd98u60w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AzTHQFs9KDNV24Phd98u60w.png" width="526" height="531"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This diagram covers all points I laid out in &lt;a href="https://medium.com/dhoomil-sheta/processing-streaming-twitter-data-using-kafka-and-spark-the-plan-58b893e42403" rel="noopener noreferrer"&gt;The Plan&lt;/a&gt;. We already finished setting up a Kafka Cluster in Part 1. &lt;/p&gt;

&lt;p&gt;In this article, we’ll focus on building a Producer which will fetch latest tweets on &lt;em&gt;#bigdata&lt;/em&gt; and push them to our cluster.&lt;/p&gt;


&lt;h2&gt;
  
  
  What is a Producer?
&lt;/h2&gt;

&lt;p&gt;Everyone may want to use Kafka for different purposes. Some might want to use it as a queue, some as a message bus, while some as a data storage platform. Whatever might be the case, you will always use Kafka by writing a producer that writes data to Kafka, a consumer that reads data from Kafka, or an application that serves both roles.&lt;/p&gt;

&lt;p&gt;Kafka has built-in client APIs that developers can use when developing applications that interact with Kafka. In this article we’ll use Producer API to create a client which will fetch tweets from Twitter and send them to Kafka.&lt;/p&gt;

&lt;p&gt;A Note from &lt;strong&gt;Kafka: The Definitive Guide:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In addition to the built-in clients, Kafka has a binary wire protocol which you can implement in programming language of your choice. This means that it is possible for applications to read messages from Kafka or write messages to Kafka simply by sending the correct byte sequences to Kafka’s network port. Such clients are not part of Apache Kafka project, but a list of non-Java clients is maintained in the &lt;a href="https://cwiki.apache.org/confluence/display/KAFKA/Clients" rel="noopener noreferrer"&gt;project wiki&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Following are the features of the Java Producer API that ships with Kafka:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;The producer is thread safe and sharing a single producer instance across threads will generally be faster than having multiple instances.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The producer has a pool of buffer space that holds records that haven’t yet been transmitted to the server&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It also has a background I/O thread that is responsible for turning these records into requests and transmitting them to the cluster. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Failure to close the producer after use will leak these resources.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  How to fetch latest Tweets?
&lt;/h2&gt;

&lt;p&gt;Twitter provides an open source client called Hosebird(hbc), a robust Java HTTP library for consuming Twitter’s Streaming API.&lt;/p&gt;

&lt;p&gt;It is a robust Java HTTP library for consuming Twitter’s &lt;a href="https://dev.twitter.com/docs/streaming-apis" rel="noopener noreferrer"&gt;Streaming API&lt;/a&gt;. It enables clients to receive Tweets in near real-time. Every Twitter account has access to the Streaming API and any developer can build applications using it. &lt;/p&gt;
&lt;h2&gt;
  
  
  Generating Twitter API Keys
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;If you don’t have developer access, go to &lt;a href="https://dev.twitter.com/apps/new" rel="noopener noreferrer"&gt;https://dev.twitter.com/apps/new&lt;/a&gt; and apply for a Developer access.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Go to &lt;a href="https://developer.twitter.com/en/apps" rel="noopener noreferrer"&gt;https://developer.twitter.com/en/apps&lt;/a&gt; and Create a new Application. (Leave callback URL blank)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Go to &lt;strong&gt;&lt;em&gt;Keys&lt;/em&gt;&lt;/strong&gt; and tokens &lt;strong&gt;&lt;em&gt;tab&lt;/em&gt;&lt;/strong&gt; and copy the consumer key and secret pair to a file for later use.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Click on “Create” to generate Access Token and Secret. Copy both of them to a file.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Now you have all things needed for developing the producer.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;
Let’s go ahead and start implementing a Kafka Producer Client which will utilize this service. For all those who want to see the completed code, here is the link: &lt;a href="https://github.com/dbsheta/kafka-twitter-producer" rel="noopener noreferrer"&gt;https://github.com/dbsheta/kafka-twitter-producer&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Create Maven Project
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Open IDE of your choice and create a new maven project. I’ll name mine &lt;em&gt;kafka-twitter-producer&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Add Kafka,Twitter and Gson dependencies in &lt;em&gt;pom.xml&lt;/em&gt; and rebuild the project.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Implement Producer
&lt;/h2&gt;

&lt;p&gt;First of all, let’s define constants to configure Kafka Producer.&lt;/p&gt;



&lt;p&gt;Now, we’ll copy the secrets and tokens from Twitter Developer console.&lt;/p&gt;



&lt;p&gt;The tweet returned by Twitter API is very large string(json) and contains all details we require for our project. You can find full response &lt;a href="https://github.com/dbsheta/kafka-twitter-producer/blob/master/src/main/resources/tweet.json" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We create two entities &lt;em&gt;Tweet&lt;/em&gt; and &lt;em&gt;User&lt;/em&gt; to hold json responses since it would be easier to work with POJOs than with String responses. At this point, while sending tweets to Kafka, we’ll call &lt;strong&gt;&lt;em&gt;toString()&lt;/em&gt;&lt;/strong&gt; on the Tweet object so we don’t have to write serializer for our custom class.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: It is better to use a serialization library in such scenarios. We’ll see in a future post, how we can use Avro to serialize/de-serialize java objects while sending to or consuming from Kafka. We’ll discuss benefits of using Avro with Schema registry at that point.&lt;/p&gt;
&lt;/blockquote&gt;



&lt;p&gt;Now, we have all the basic things needed for implementing producer. Let’s start creating TwitterKafkaProducer.&lt;/p&gt;

&lt;p&gt;We will initialize our Twitter client in the constructor for our producer class. We have to pass key, secrets and token for authentication. Then we have to pass a list of terms which we want to track. Currently, I’m focused only on &lt;em&gt;#bigdata&lt;/em&gt;&lt;/p&gt;



&lt;p&gt;This completes the configuration of twitter client. Now we have to configure Kafka producer. I have created below a fairly simple producer.&lt;/p&gt;



&lt;p&gt;Let’s go over the main knobs that we turned here. Rest you can easily find in Kafka Documentation and are pretty much self-explanatory.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;BOOTSTRAP_SERVERS_CONFIG&lt;/em&gt;&lt;/strong&gt;: List of brokers that act as initial contact point to the cluster. It is advisable to pass more than one broker in case one goes down, producer still should have options to connect to the cluster.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;ACKS_CONFIG&lt;/em&gt;&lt;/strong&gt;: 0, 1 or All. ‘0’ means producer doesn’t wait for acknowledgement. ‘1’ means producer waits for leader to acknowledge that it has written to the disk. ‘all’ means producer waits for acknowledgement that all the in-sync replicas have persisted the message. We have used ‘1’ as the data in our case does not require strict acknowledgement. It’s okay for us even if we get one confirmation as data is not that sensitive. Minor loss of data is okay for us.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;RETRIES_CONFIG&lt;/em&gt;&lt;/strong&gt;: Number of times producer retires when message fails to be acknowledged (in case &lt;em&gt;acks&lt;/em&gt; is set to ‘1’ or ‘all’). Note that setting this to more than 0 may lead to retried message being delivered out of sequence. You may need to turn a few more knobs to ensure same sequence which is out of scope of this article. Interested folks can ask in the comments section.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  Streaming Tweets to Kafka Cluster
&lt;/h2&gt;

&lt;p&gt;Now, after configuring twitter client as well as producer, we only need to make a connection to twitter using the client, wait for someone to tweet with #bigdata. Once we get a tweet, send it to kafka using producer.&lt;/p&gt;



&lt;p&gt;The client is responsible for fetching latest tweets on #bigdata and push it to BlockingQueue. In the infinite loop, we take one tweet at a time from the queue and push it to kafka by using Tweet ID as &lt;strong&gt;&lt;em&gt;key&lt;/em&gt;&lt;/strong&gt; and the whole tweet as &lt;strong&gt;&lt;em&gt;value&lt;/em&gt;&lt;/strong&gt;. Since we have used BlockingQueue, &lt;em&gt;queue.take()&lt;/em&gt; will block the flow until twitter client fetches new tweet.&lt;/p&gt;

&lt;p&gt;Full code available at: &lt;a href="https://github.com/dbsheta/kafka-twitter-producer" rel="noopener noreferrer"&gt;https://github.com/dbsheta/kafka-twitter-producer&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Lights. Camera. Action.
&lt;/h2&gt;

&lt;p&gt;Let’s see our code in action! First, I will create a new topic &lt;em&gt;bigdata-tweets&lt;/em&gt; with replication factor of 2 and number of partitions 3.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bin/kafka-topics.sh &lt;span class="nt"&gt;--create&lt;/span&gt; &lt;span class="nt"&gt;--zookeeper&lt;/span&gt; X.X.X.X:2181 &lt;span class="nt"&gt;--replication-factor&lt;/span&gt; 2 &lt;span class="nt"&gt;--partitions&lt;/span&gt; 3 &lt;span class="nt"&gt;--topic&lt;/span&gt; bigdata-tweets


&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bin/kafka-topics.sh &lt;span class="nt"&gt;--describe&lt;/span&gt; &lt;span class="nt"&gt;--zookeeper&lt;/span&gt; X.X.X.X:2181 &lt;span class="nt"&gt;--topic&lt;/span&gt; bigdata-tweets

    Topic:bigdata-tweets    PartitionCount:3    ReplicationFactor:2    Configs:
    Topic: bigdata-tweets    Partition: 0    Leader: 0    Replicas: 0,1    Isr: 0,1
    Topic: bigdata-tweets    Partition: 1    Leader: 1    Replicas: 1,2    Isr: 1,2
    Topic: bigdata-tweets    Partition: 2    Leader: 2    Replicas: 2,0    Isr: 2,0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, just to verify that the tweets really were persisted by kafka, we’ll start a simple console consumer provided with Kafka distribution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bin/kafka-console-consumer.sh &lt;span class="nt"&gt;--bootstrap-server&lt;/span&gt; bigdata-1:9092 &lt;span class="nt"&gt;--topic&lt;/span&gt; bigdata-tweets &lt;span class="nt"&gt;--from-beginning&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the TwitterKafkaProducer app. It should start sending data to Kafka.&lt;/p&gt;

&lt;p&gt;You should see something like this on your console consumer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Tweet&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1059434252306210817, &lt;span class="nv"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'I want to assist to meet you and see your latest tools'&lt;/span&gt;, &lt;span class="nv"&gt;lang&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'en'&lt;/span&gt;, &lt;span class="nv"&gt;user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;User&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;198639877, &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Antonio Molina'&lt;/span&gt;, &lt;span class="nv"&gt;screenName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'amj_69'&lt;/span&gt;, &lt;span class="nv"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Moralzarzal-Madrid-Spain'&lt;/span&gt;, &lt;span class="nv"&gt;followersCount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;399&lt;span class="o"&gt;}&lt;/span&gt;, &lt;span class="nv"&gt;retweetCount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0, &lt;span class="nv"&gt;favoriteCount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0&lt;span class="o"&gt;}&lt;/span&gt;

Tweet&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1059434263232348160, &lt;span class="nv"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'RT @InclineZA: #AI &amp;amp;amp; #MachineLearning: Building use cases &amp;amp;amp; creating Real-Life Benefits &amp;amp;gt;&amp;amp;gt;  https://t.co/noWy1NS3OU
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see the tweets like these, congrats my friend, you have created one data pipline! You fetched data from a source (Twitter), pushed it to a message queue, and ultimately consumed it (printed on console).&lt;/p&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We used Twitter Streaming API along with Kafka Clients API to implement a Producer app which fetches data from twitter and sends it to kafka in real-time. In the next part, we’ll see how we can consume this data to do collect some stats in real-time on streaming data. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Until then&lt;/strong&gt;…&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AOwkfEOCQTC0YGxZLWx2juQ.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AOwkfEOCQTC0YGxZLWx2juQ.gif" alt="Peace out" width="465" height="250"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>spark</category>
      <category>firebase</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Processing Streaming Twitter Data using Kafka and Spark — Part 1: Setting Up Kafka Cluster</title>
      <dc:creator>Dhoomil B Sheta</dc:creator>
      <pubDate>Mon, 05 Nov 2018 13:52:42 +0000</pubDate>
      <link>https://dev.to/dbsheta/processing-streaming-twitter-data-using-kafka-and-spark--part-1-setting-up-kafka-cluster-32lh</link>
      <guid>https://dev.to/dbsheta/processing-streaming-twitter-data-using-kafka-and-spark--part-1-setting-up-kafka-cluster-32lh</guid>
      <description>&lt;p&gt;As per the plan I laid out in my previous post, I’ll start by setting up a Kafka Cluster. I’ll primarily be working on Google Cloud instances throughout this series, however, I’ll also lay down steps to setup the same in your local machines as well.&lt;/p&gt;

&lt;p&gt;Also, in this series, main focus will be on how-to rather than how-does-it. We’ll spend most of time learning how to implement various use cases than how does Kafka/Spark/Zookeeper does it. However, We’ll go into theory mode if there aren’t any sources already available on the web.&lt;/p&gt;



&lt;h2&gt;
  
  
  Apache Zookeeper
&lt;/h2&gt;

&lt;p&gt;Kafka uses &lt;a href="https://zookeeper.apache.org/" rel="noopener noreferrer"&gt;Zookeeper&lt;/a&gt; to store metadata about the Kafka cluster, as well as consumer client details. There are many articles online which explain why Kafka needs Zookeeper. &lt;a href="https://data-flair.training/blogs/zookeeper-in-kafka/" rel="noopener noreferrer"&gt;This&lt;/a&gt; article by Data-Flair does it very well.&lt;/p&gt;

&lt;p&gt;While you can get a quick-and-dirty single node  Zookeeper server up and running directly using scripts contained in the Kafka distribution, it is trivial to install a full version of Zookeeper from the distribution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwewletm4to1gz1dnj90e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwewletm4to1gz1dnj90e.png" alt="Source: Kafka- The Definitive Guide" width="530" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I assume you have JDK1.8 installed. If not Linux/macOS users can download openJDK using package managers. Windows users can go to Oracle’s website and install the same.&lt;/p&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Zookeeper Standalone Mode&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Those who don’t have any cloud resources available like Google Cloud or Azure or AWS, can run a single node standalone Zookeeper instance. Spinning such an instance is fairly simple&lt;/p&gt;

&lt;p&gt;Download latest version of &lt;a href="https://www.apache.org/dyn/closer.cgi/zookeeper/" rel="noopener noreferrer"&gt;Zookeeper&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    &lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xvf&lt;/span&gt; zookeeper-X.X.X.tar.gz &lt;span class="nt"&gt;-C&lt;/span&gt; /opt
    &lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /opt/zookeeper-X.X.X /opt/zookeeper
    &lt;span class="nb"&gt;cd&lt;/span&gt; /opt/zookeeper
    &lt;span class="nb"&gt;cat &lt;/span&gt;conf/zoo_sample.cfg &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; zookeeper.properties
    bin/zkServer.sh start conf/zookeeper.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;h3&gt;
  
  
  &lt;strong&gt;Zookeeper Ensemble&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A Zookeeper cluster is called an &lt;em&gt;ensemble&lt;/em&gt;. Due to the algorithm used, it is recommended that ensembles contain an odd number of servers (3, 5,…) as a majority of ensemble members (a quorum) must be working in order for Zookeeper to respond to requests. It is also &lt;em&gt;not&lt;/em&gt; &lt;em&gt;recommended&lt;/em&gt; to run more than seven nodes, as performance can start to degrade due to the nature of the consensus protocol.&lt;/p&gt;

&lt;p&gt;To configure a Zookeeper ensemble, all servers must have a common configuration and each server needs a &lt;strong&gt;&lt;em&gt;myid&lt;/em&gt;&lt;/strong&gt; file in the data directory that specifies the ID number of the server&lt;/p&gt;

&lt;p&gt;Except the last command, run all previous commands on all servers. In addition to that following are to be run on all servers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add list of your servers (hostname/IP) to bottom of the zookeeper configuration file:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;    &lt;span class="py"&gt;server.1&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;X.X.X.X:2888:3888&lt;/span&gt;
    &lt;span class="py"&gt;server.2&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Y.Y.Y.Y:2888:3888&lt;/span&gt;
    &lt;span class="py"&gt;server.3&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Z.Z.Z.Z:2888:3888&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Add myid file at dataDir location which in my case is &lt;em&gt;/tmp/zookeeper&lt;/em&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    &lt;span class="nb"&gt;touch&lt;/span&gt; /tmp/zookeeper/myid
    &lt;span class="nb"&gt;echo &lt;/span&gt;1 &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /tmp/zookeeper/myid
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;After making the above changes, start zookeeper on all servers one by one.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    bin/zkServer.sh start conf/zookeeper.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;h2&gt;
  
  
  Setting up Kafka
&lt;/h2&gt;

&lt;p&gt;Download latest version of &lt;a href="http://mirrors.wuchna.com/apachemirror/kafka/2.0.0/kafka_2.11-2.0.0.tgz" rel="noopener noreferrer"&gt;Kafka&lt;/a&gt; on all your serves.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    &lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xvf&lt;/span&gt; kafka_2.11-0.11.0.0.tgz –C /opt
    &lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /opt/kafka_2.XX /opt/kafka
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Update kafka server.properties file in all instances (hostname/IP) to contain the below line. This file is located in &lt;em&gt;/opt/kafka/config/server.properties&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;    &lt;span class="py"&gt;zookeeper.connect&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;X.X.X.X:2181,Y.Y.Y.Y:2181,Z.Z.Z.Z:2181&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;h3&gt;
  
  
  &lt;strong&gt;Single Node Multi Broker (SNMB):&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;For folks who don’t have cloud instances handy, you can setup a cluster locally. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;You have to copy &lt;em&gt;server.properties&lt;/em&gt; file and copy it 3 times with different name like server1.properties, server2.properties, etc.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Every Kafka broker must have an integer identifier. Open configuration file and change &lt;em&gt;broker.id=1&lt;/em&gt; for 1st server &lt;em&gt;broker.id=2&lt;/em&gt; for 2nd and so on. A good guideline is to set this value to something intrinsic to the host so that when performing maintenance it is easier to map broker IDs to hosts&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;As you will be running multiple instances on same machine, change port configuration so each process uses different port number. Open configuration file and change &lt;em&gt;port=9092&lt;/em&gt; on 1st, 9093 on 2nd and so on.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Also, in future, whenever I give a command for MNMB setup, you should automatically map it to your configuration&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Multi Node Multi Broker (MNMB):&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Open configuration file and change &lt;em&gt;broker.id=1&lt;/em&gt; for 1st server, &lt;em&gt;broker.id=2&lt;/em&gt; for 2nd and so on.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Add the canonical hostnames of your servers in your hosts file if they are not public. Or you’ll need to overwite on each server instance &lt;em&gt;advertised.listeners=PLAINTEXT://your.host.name:9092&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;



&lt;h2&gt;
  
  
  Testing Full Setup
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;We have already started a zookeeper ensemble, now lets start Kafka in all our servers as well.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    &lt;span class="nb"&gt;cd&lt;/span&gt; /opt/kafka
    bin/kafka-server-start.sh config/server.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Let’s create a sample topic with 3 partitions and 2 replicas:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    bin/kafka-topics.sh &lt;span class="nt"&gt;--create&lt;/span&gt; &lt;span class="nt"&gt;--zookeeper&lt;/span&gt; X.X.X.X:2181 &lt;span class="nt"&gt;--replication-factor&lt;/span&gt; 2 &lt;span class="nt"&gt;--partitions&lt;/span&gt; 3--topic sample_test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Kafka has a command line consumer that will dump out messages to standard output.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    bin/kafka-console-consumer.sh — zookeeper X.X.X.X:2181 — topic sample_test — from-beginning
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Run the producer and then type a few messages into the console to send to the server.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    bin/kafka-console-producer.sh &lt;span class="nt"&gt;--broker-list&lt;/span&gt;  X.X.X.X:9092 &lt;span class="nt"&gt;--topic&lt;/span&gt; sample_test
    &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Hello, World!
    &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Hello from the other side.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see output in the console consumer window, congratulations! You successfully setup zookeeper and kafka cluster locally/ on cloud. If for some reasons you are getting errors or not able to get the desired ouput, please leave a comment.&lt;/p&gt;

&lt;p&gt;We will use the same setup in the upcoming few articles. In the next article, we will see how we can implement a Kafka Client which will read latest tweets from Twitter and push them into Kafka.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Until then,&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthumbs.gfycat.com%2FGenuineMenacingImperialeagle-size_restricted.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthumbs.gfycat.com%2FGenuineMenacingImperialeagle-size_restricted.gif" alt="Goodbye" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>zookeeper</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Processing Streaming Twitter Data using Kafka and Spark — The Plan</title>
      <dc:creator>Dhoomil B Sheta</dc:creator>
      <pubDate>Mon, 05 Nov 2018 13:49:45 +0000</pubDate>
      <link>https://dev.to/dbsheta/processing-streaming-twitter-data-using-kafka-and-sparkthe-plan-129a</link>
      <guid>https://dev.to/dbsheta/processing-streaming-twitter-data-using-kafka-and-sparkthe-plan-129a</guid>
      <description>&lt;h2&gt;
  
  
  What is Apache Kafka?
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Apache Kafka is a publish/subscribe messaging system. It is often described as a “distributed commit log” or more recently as a “distributed streaming platform.”&lt;br&gt;
Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged streaming platform&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2706%2F0%2Af_7HXjtx0Nva3RQR.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2706%2F0%2Af_7HXjtx0Nva3RQR.png" alt="Source: [https://kafka.apache.org/images/kafka_diagram.png](https://kafka.apache.org/images/kafka_diagram.png)" width="800" height="771"&gt;&lt;/a&gt;&lt;em&gt;Source: &lt;a href="https://kafka.apache.org/images/kafka_diagram.png" rel="noopener noreferrer"&gt;https://kafka.apache.org/images/kafka_diagram.png&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;



&lt;h2&gt;
  
  
  The Inspiration
&lt;/h2&gt;

&lt;p&gt;I recently read the book &lt;a href="https://www.confluent.io/resources/kafka-the-definitive-guide/" rel="noopener noreferrer"&gt;Kafka: The Definitive Guide&lt;/a&gt; by the creators of Kafka. It is truly a wonderful book for anyone who wants to start developing applications with Kafka as well as anyone who wants to know the internals of such a unique platform which is used by most of the Fortune 500 companies.&lt;/p&gt;



&lt;h2&gt;
  
  
  The Plan
&lt;/h2&gt;

&lt;p&gt;In this series, I’ll be exploring various aspects of Apache Kafka, all by implementing cool data pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;We’ll start by setting up a Kafka Cluster in cloud/locally&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;After that, we’ll write a Producer Client which will fetch latest tweets continuously using Twitter API and push them to Kafka.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Then, we will implement an app using Kafka Streams API, which will consume the tweets from Kafka in real-time and do basic processing on them like finding number of tweets per user and most used words (i.e word count).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We’ll then venture into more cool stuff like writing our own Kafka Connector which will use twitter as data source and learning to use Apache NiFi to achieve the same with less effort.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We’ll use Spark Streaming to do sentiment analysis on real-time twitter data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Finally, if everything goes well, we’ll try to tweak our architecture and implement Notification service using Firebase and Kafka which will send push notifications to user if his/her tweet has negative sentiment!&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;



&lt;h2&gt;
  
  
  Let’s begin!
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F10368%2F0%2AU_EwY9N-91IXddnk" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F10368%2F0%2AU_EwY9N-91IXddnk" alt="[By Amine Rock Hoovr](https://unsplash.com/@hoovr01?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)" width="720" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>spark</category>
      <category>bigdata</category>
    </item>
  </channel>
</rss>
