bm.ptoo

Posted on May 30

BEGINNER'S GUIDE TO STREAM REAL-TIME DATA USING APACHE KAFKA

#python #kafka #dataengineering

NOTE:

In this project, I will be using ubuntu terminal, for parallelism, use the same.
We will also be using fake data generated from https://pypi.org/project/Faker/ this package
You will need to open four different terminals, each for zookeeper, Kafka, producer and consumer respectively.

-Let us begin-

Step 1:
Log in into your server

ssh <your_server_ip_address> e.g ssh main@70.158.43.200

You will then be prompted to fill in your password, and after filling it correctly, you will automatically be logged into the server.

Step 2:
Once in the server, you will first need to update all the dependencies therein using the following command.

sudo apt update

You will then check the availability of java and python in your server using this command:

java --version

python3 --version

If you ascertain that you have java and python, we can proceed to the next step. Otherwise, you will need to install the missing one(s) using the code you will be provided with in your terminal.

Step 3:

Go to your browser and maneuver to this page: https://kafka.apache.org/downloads

Choose the version you want and then copy the link address from among the binary downloads.

For this,I will be using version 3.7.0 Scala 2.12 , using the link below.
https://archive.apache.org/dist/kafka/3.7.0/kafka_2.12-3.7.0.tgz

Back on our terminal, using wget (GNU Wget is a computer program that retrieves content from web servers. Its name derives from "World Wide Web" and "get", a HTTP request method. It supports downloading via HTTP, HTTPS, and FTP.)
We download the Kafka version we copied the link address of.

wget https://archive.apache.org/dist/kafka/3.7.0/kafka_2.12-3.7.0.tgz

As the downloaded Kafka is in a compressed format (.tgz), we will need to extract it.

tar -xvzf kafka_2.12-3.7.0.tgz

When you list ls in your terminal, you will discover that you have two files, the old compressed one and the new extracted one (one with .tgz and another without).

We will then need to remove the compressed file (.tgz) as we do not have a use with it.

rm -r kafka_2.12-3.7.0.tgz
If you then list ls the files you have, you will see that only one remains.

Then remaining with the new file, we see that it has an unnecessarily long name (kafka_2.12-3.7.0). So, we will need to clip the name to a format short enough for us (kafka).

mv kafka_2.12-3.7.0 kafka

Step 4
As we already have Kafka on our server, we then proceed to initiate our Zookeeper and our Kafka server

Zookeeper

NOTE: This is run in its own different terminal

Navigate inside the kafka folder cd kafka and then list ls. You will see different files there, but we are only interested with two folders for now, bin and config folders.

Inside the bin folder, list ls to see the files therein. And in the context of zookeeper, we are interested in one file here: zookeeper-server-start.sh

And inside the config folder, we need one file zookeeper.properties

With the two files, we move back to the kafka folder and pass the following code:

bin/zookeeper-server-start.sh config/zookeeper.properties

HINT: This code shows the path to the specific files we need to start our zookeeper.
Qn: Why do we need to start our zookeeper before starting our Kafka servers?

Kafka Server

NOTE: This is run in its own different terminal

The next step is to start our kafka servers. We will first need to cd into the specific folders we mentioned.

Inside the bin folder, we need one file: kafka-server-start.sh
And inside the config folder, we need: server.properties

We then move to the kafka folder and pass the following code:

bin/kafka-server-start.sh config/server. Properties

After running the two commands in different terminals, it will initiate our zookeeper and our kafka server respectively.

Step 5

Now on to the producer and the consumer

Faker_producer code:

`from kafka import KafkaProducer
from faker import Faker
import json
import time

fake = Faker()

producer = KafkaProducer(
bootstrap_servers = 'localhost:9092',
value_serializer = lambda v: json.dumps(v).encode('utf-8')
)

def generate_users():
return{
'name':fake.name(),
'email':fake.email(),
'address':fake.address(),
'phone_no':fake.phone_number()
}
while True:
user = generate_users()
producer.send('fakes',user)
print(f"sent:{user}")
time. Sleep(5)`

Faker_consumer code:

`from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
'fakes',
bootstrap_servers = 'localhost:9092',
auto_offset_reset = 'earliest',
value_deserializer = lambda m: json.loads(m.decode('utf-8'))
)
for message in consumer:
print(f"received:{message. Value}")`

Afterwards, we then push our code to our server. It is preferred if you create a folder in your server mkdir your_name where we will push the code to.

On your terminal, write the following code to push the file(s) into the server:

scp name_of_your_file your_server_ip_address:your_path

scp faker_producer.py main@70.100.22.221:/home/main/Ptoo

scp faker_consumer.py main@70.100.22.221:/home/main/Ptoo

Step 6

After ascertaining that our codes have successfuly been pushed into our server, we then create a virtual environment, preferrably in the folder we created.

python3 -m venv env_name

python3 -m venv fakerenv

We then install kafka and all the other missing dependencies using pip

pip install kafka-python
pip install faker

And so on.

Step 7

We then navigate to our third terminal meant for the producer and log in into the server accordingly.

We will then activate our virtual environment:

source fakerenv/bin/activate

After activation, we will then run our producer file:

python3 faker_producer.py

Our stream of never-ending data should be visible on our terminal.

Step 8

On our fourth terminal meant for the consumer, we will log in into the server as usual and we will ascertain that our files are there.

We will then activate our virtual environment:

source fakerenv/bin/activate

And moving to the folder we created, we will run our faker_consumer file.

python3 faker_consumer.py

If you open the producer and consumer terminals side by side, you will be able to see the data stream in real time on both screens in real time.

And that is it.

DEV Community

BEGINNER'S GUIDE TO STREAM REAL-TIME DATA USING APACHE KAFKA

Top comments (0)