DEV Community: Boyu

Intro to Kafka using Docker and Python

Boyu — Wed, 25 Nov 2020 05:38:09 +0000

Note: To get a copy of all code used in this tutorial, please clone this GitHub repository. Reference the README file on how to quickly run all experiments mentioned in this tutorial.

Object vs Log

In database design, the common practice is to think of the world in terms of things. This way of thinking can describe the design of SQL databases like MySQL and PostgreSQL. Developers came up with objects to describe things in the world and the objects get stored in a table with a defined schema. For example, if we are describing a light bulb, we might think of things like brand, manufacturer, brightness, energy consumption, and the current state of the light bulb (on or off). This is useful when the focus is to store data or creating a digital copy of a real-world object. However, object databases face challenges in handling streaming data. In the light bulb example, if the light bulb is used as an indication for Morse code that the state of the light bulb is constantly switching between on and off. It will be difficult to come up with an object representation of the changing light bulb. This is when the thinking of data as event logs become useful.

What is Kafka?

Kafka is a distributed system designed for streaming data. It replaces the traditional object-based database storage with an event-based system. Kafka can be thought of as a distributed log, with information stored as events with data attached to it.

Kafka's design works well with the modern containerization trend and the shift from one big software that operates everything to many small containers that operate independent tasks. By thinking of data in terms of logs, processes can be easily decoupled by splitting an application into many independent read and write tasks. Kafka facilitates data communication by providing write access to data producers and read access to data consumers.

Kafka can handle large traffic by being a distributed system. A Kafka deployment can have many broker servers, which allows for horizontal scaling. Event data in Kafka is organized by topic and each topic is split into multiple partition. This allows horizontal scaling of Kafka as new broker nodes can be assigned to handle and rebalance the partition load. Kafka ensures data fault tolerance by replicating each partition multiple times.

Deploy Kafka on Docker

We will deploy a Kafka cluster with three broker nodes using docker. For this example, we will use the Kafka-Zookeeper setup. An additional Kafdrop node will be used to provide a web user interface for monitoring the Kafka cluster. The architecture looks like the following:

Environment Setup

Docker, information on installing Docker here
Docker Compose, information on installing Docker Compose here

Writing Docker Compose

The configuration for Zookeeper. We use the zookeeper docker image, exposing port 2181, which is the default port for Zookeeper.



zookeeper:
  image: zookeeper:3.4.9
  hostname: zookeeper
  ports:
    - "2181:2181"
  environment:
    ZOO_MY_ID: 1
    ZOO_PORT: 2181
    ZOO_SERVERS: server.1=zookeeper:2888:3888
  volumes:
    - ./data/zookeeper/data:/data
    - ./data/zookeeper/datalog:/datalog

The configuration for Kafka broker nodes. We use the Confluent docker image for Kafka and configures the Kafka node to communicate with the Zookeeper node. The following configuration is replicated three times with hostname kafka1, kakfa2, and kafka3 on port 9091, 9092, and 9093.



kafka1:
  image: confluentinc/cp-kafka:5.3.0
  hostname: kafka1
  ports:
    - "9091:9091"
  environment:
    KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka1:19091,LISTENER_DOCKER_EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9091
    KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_EXTERNAL:PLAINTEXT
    KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
    KAFKA_ZOOKEEPER_CONNECT: "zookeeper:2181"
    KAFKA_BROKER_ID: 1
  volumes:
    - ./data/kafka1/data:/var/lib/kafka/data
  depends_on:
    - zookeeper

The configuration for Kafdrop. We only need to configure Kafdrop to connect to one of the Kafka brokers and Kafdrop can get information about other Kafka brokers from its metadata.



kafdrop:
  image: obsidiandynamics/kafdrop
  restart: "no"
  ports:
    - "9000:9000"
  environment:
    KAFKA_BROKERCONNECT: "kafka1:19091"
  depends_on:
    - kafka1
    - kafka2
    - kafka3

The final configuration should look like this docker-compose.yml file.

Start the Kafka cluster by running docker-compose up, this will deploy 5 docker containers. You can check this using docker ps.

Goto localhost:9000 and you should see the Kafdrop page showing your Kafka deployment with three broker nodes named kafka1, kafka2, and kafka3.

Kafka Dataflow

With the local Kafka service running, we can start interacting with Kafka. In this part, we will build a publisher using the light bulb Morse code example mentioned at the beginning, and a consumer to count the numbers of "dots" and "dashes" being transmitted.

Create a Kafka Topic

We will first need to create a topic in Kafka for the publisher to send data to. In Kafdrop, click on New at the bottom of the page, give the topic name light_bulb to the new topic, and set the number of partitions to 3. For the replication factor, we can keep the default setting of 3, which means each partition of light_bulb will be replicated three times. Back to the home page, you should see the topic light_bulb with 3 partitions.

Building Publisher

To add data to the light_bulb topic, we will need to have a publisher to talk to Kafka. A simple way is to use a Python library. Install the Kafka Python connector by Confluent using pip install confluent-kafka and we can start sending data to Kafka using:



from confluent_kafka import Producer

p = Producer({'bootstrap.servers': 'localhost:9091'})
p.produce('light_bulb', key='hello', value='world')
p.flush(30)

The Producer class takes a configuration dictionary and we specify the address for the Kafka broker. Only one address is needed because the brokers contain metadata for reaching other brokers. The produce function sends data without waiting for confirmation, it takes three inputs: the Kafka topic name, a key which is used to determine which partition the data is added to, and a value string of the log data. The flush function is used at the end of the process is completed.

This producer.py file provides an implementation of the Morse code sending light bulb example. We can have the light bulb to send Morse code by publishing on/off status log to the Kafka cluster.



python3 producer.py --key="light-1" --string="XYZ"

With some published data, we can see the information about the data on the Kafka Interface.

Building Consumer

Here is a simple example of consuming messages using the Python library.



from confluent_kafka import Consumer, KafkaError

c = Consumer({
    'bootstrap.servers': 'localhost:9091',
    'group.id': 'counting-group',
    'enable.auto.commit': True,
    'session.timeout.ms': 6000,
    'default.topic.config': {'auto.offset.reset': 'smallest'}
})

c.subscribe(['light_bulb'])
while True:
    msg = c.poll(0.1)
    if msg is None:
        continue
    elif not msg.error():
        print(msg.value())
    elif msg.error().code() == KafkaError._PARTITION_EOF:
        print("End of partition reached")
    else:
        print("Error")

The Consumer class also takes a configuration dictionary. We specify the name of the consumer group, enable auto-commit of the consumer offset, set timeout, and configure the offset to start at the smallest element. We can use the consumer object to subscribe to a list of topics and the while loop allows continuously check for new log data in the subscribed topics.

This consumer.py file provides an implementation of counting the number of "dots" and "dashes" in the Morse code transmitted in the light bulb example.

Failure Handling

Single Node Failure

Because Kafka is designed to be fault-tolerant, we can simulate node failure by stopping a container. This will simulate the effect of a server being brought down by faults that might present in the software, hardware, or network. To stop broker node number 3, we can run docker stop kafka_kafka3_1

Refreshing the Kafdrop page, we should see that host kafka3 is missing and the partition that was served by kafka3 is now allocated between kafka1 and kafka2. In this case, we see each of the two remaining broker nodes is handling 2 partitions, which sum to 4 partitions in the system, which indicates no data was lost by the failure of the broker node kafaka3. We can click into the topic light_bulb and under partitions, we can see the partition that was lead by kafka3 was updated to another available node. Also, the loss of broker node 3 results in all of the partitions being under replicated, as the configuration calls for three replications but only two broker nodes are available.

Multi-Node Failure

We will also stop broker node number 2 by running docker stop kafka_kafka2_1. In this case, because we configured the replication factor as three, failing two nodes will not result in data loss. We can see that the leader node for all partitions is switched to the remaining broker node 1.

All Nodes Failure

To stop the last remaining node, run docker stop kafka_kafka1_1. In this case, we will result in data loss. As we lose all replications of data. However, this is an unlikely situation and it is more likely caused by network issues than all of the broker nodes failed. So it is likely that when the network is resumed, data can be retrieved. However, having no broker node will mean the Kafka server is unavailable.

Summary

In general, Kafka is a highly configurable distributed system that fits the need of modern agile applications. It is the data service layer between the publisher and consumer of application data.

Run Python MapReduce on local Docker Hadoop Cluster

Boyu — Mon, 05 Oct 2020 21:54:59 +0000

Introduction

This post covers how to deploy a local Docker Hadoop Cluster to run custom Python mapper and reducer function using the classic word count example.

Environment Setup

Docker, get Docker here
Docker Compose, get Docker Compose here
Git, get Git here

Deploy Hadoop Cluster using Docker

We will use the Docker image by big-data-europe repository to set up Hadoop.

git clone git@github.com:big-data-europe/docker-hadoop.git

With the Docker image for Hadoop on your local machine, we can use docker-compose to configure the local Hadoop cluster. Replace the docker-compose.yml file with the following file from this GitHub Gist.

This docker-compose file configures a Hadoop cluster with a master node (namenode) and three worker nodes, it also configures the network port to allow communication between the nodes. To start the cluster, run:

docker-compose up -d

Use docker ps to verify the containers are up, you should see a container list similar to the following:

IMAGE                           PORTS                    NAMES
docker-hadoop_resourcemanager                            resourcemanager
docker-hadoop_nodemanager1      0.0.0.0:8042->8042/tcp   nodemanager1
docker-hadoop_historyserver     0.0.0.0:8188->8188/tcp   historyserver
docker-hadoop_datanode3         9864/tcp                 datanode3
docker-hadoop_datanode2         9864/tcp                 datanode2
docker-hadoop_datanode1         9864/tcp                 datanode1
docker-hadoop_namenode          0.0.0.0:9870->9870/tcp   namenode

The current status of the local Hadoop cluster will be available at localhost:9870

Running Python MapReduce function

For this simple MapReduce program, we will use the classical word count example. The program reads text files and counts how often each word occurs.

The mapper function will read the text and emit the key-value pair, which in this case is <word, 1>. Copy the following code into mapper.py

#!/usr/bin/env python
"""mapper.py"""

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print ('%s\t%s' % (word, 1))

The reducer function processes the result from the mapper and returns the word count. Copy the following code into reducer.py

#!/usr/bin/env python
"""reducer.py"""

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print ('%s\t%s' % (current_word, current_count))
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print ('%s\t%s' % (current_word, current_count))

Note because Hadoop runs on Apache server which is built in Java, the program takes a Java JAR file as an input. To execute Python in Hadoop, we will need to use the Hadoop Streaming library to pipe the Python executable into the Java framework. As a result, we need to process the Python input from STDIN.

Copy the local mapper.py and reducer.py to the namenode:

docker cp LOCAL_PATH/mapper.py namenode:mapper.py
docker cp LOCAL_PATH/reducer.py namenode:reducer.py

Enter the namenode container of the Hadoop cluster:

docker exec -it namenode bash

Run ls and you should find mapper.py and reducer.py in the namenode container.

Now let's prepare the input. For this simple example, we will use a set of text files with a short string. For a more realistic example, you can use e-book from Project Gutenberg, download the Plain Text UTF-8 encoding.

mkdir input
echo "Hello World" >input/f1.txt
echo "Hello Docker" >input/f2.txt
echo "Hello Hadoop" >input/f3.txt
echo "Hello MapReduce" >input/f4.txt

The MapReduce program access files from the Hadoop Distributed File System (HDFS). Run the following to transfer the input directory and files to HDFS:

hadoop fs -mkdir -p input
hdfs dfs -put ./input/* input

Use find / -name 'hadoop-streaming*.jar' to locate the hadoop string library JAR file. The path should look something like PATH/hadoop-streaming-3.2.1.jar
Finally, we can execute the MapReduce program:

hadoop jar /opt/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar \
-file mapper.py    -mapper mapper.py \
-file reducer.py   -reducer reducer.py \
-input input -output output

To safely shut down the cluster and remove containers, run:

docker-compose down

Reference

Yen V. (2019). How to set up a Hadoop cluster in Docker.
Retrieved from: here

Noll M. Writing An Hadoop MapReduce Program In Python.
Retrieved from: here

Creating an interactive application for explaining machine learning project to non-cs major students

Boyu — Tue, 26 May 2020 03:51:48 +0000

Introduction

As a CS + statistic major student working on my graduation project. For the year 2020, it will be almost obscured if it is not somehow related to machine learning. So I did an experiment with DenseNet architecture, to experiment with different level of dense connections and how that influence network performance. (More about DenseNet read the paper and this article)

However, there is also this requirement by my school to teach my project to my classmates from different majors in a 60 minutes session. So I can't assume them knowing what is a neural network or having the time to give a crash course in machine learning before explaining my project. So what is a better way to explain a technical project to a non-technical audience than using visualization? This reminds me of the amazing TensorFlow Playground I played with while first learning machine learning. So I start with building a similar, though a much simpler playground application.

Demo Link

http://cnn-dense-connection.boyu.io/

Link to Code

https://github.com/Boyu1997/cnn-dense-connection
navigate to playground for code specific to this application

How I built it

The idea of this project is simple, to give users the ability to add and remove dense connections in the network by simply clicking the connection edges and show an instant update of the prediction result.

To simplify the project, all of the models are trained locally using code similar to my dense connections experiment, and all the prediction results are saved into a JSON file. The main change to the model besides being a much smaller one is the final layer activation to Softmax so that the prediction is between 0% and 100% and sum up nicely to one, comparing to the more common use of ReLU and takes the maximum as the first-class label. The web application needs only be a frontend application, handling click input, and updating the page render. The main challenge of this project for me is to learn and use D3.js quickly. Examining the code for TensorFlow Playground, I see their implementation of the network configuration component is based on D3. Though I feel a bit hesitant to start learning and using D3 for a side project with a tight deadline. After searching and going through some alternative network visualization library, it comes clear to me that other higher-level libraries do not give sufficient control so that D3 is the most sensible option.

The front-end application is about managing state and render the page accordingly. So my initial thought is to use React, which I have experience working with. However, I soon find out using D3 in React is not the most intuitive way to code. As this is only a one-page application, I decided to use D3 to render all the interactive parts of the page into an SVG element and keeping the rest as plain HTML. For hosting the project, I used Webpack to build the JavaScript file and place all the static files to a GCP storage bucket and setup website access.

Another interesting thing I totally missed while making the first iteration of this project is how big the number of possible ways to configure the model. I originally went for a three dense blocks six total dense layers network as in the illustration used in CondenseNet (see figure 2 in the repository). This model with 21 dense connections (the curved edges in the plot) gives a total of 2^21 = 2097152 possible configurations for allowing each dense connection to be set to on and off. Given I plan to pre-train the models locally, this is something pretty impossible to carry out. So I reduced the network to two dense blocks four total layers as in the application. This gives me 10 dense connections and 1024 possible configurations, which takes a few hours on GPU to train all of the 1024 models.