<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Boyu</title>
    <description>The latest articles on DEV Community by Boyu (@boyu1997).</description>
    <link>https://dev.to/boyu1997</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F390833%2F909f51aa-7baa-495a-bc53-0ee09328146a.jpeg</url>
      <title>DEV Community: Boyu</title>
      <link>https://dev.to/boyu1997</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/boyu1997"/>
    <language>en</language>
    <item>
      <title>Intro to Kafka using Docker and Python</title>
      <dc:creator>Boyu</dc:creator>
      <pubDate>Wed, 25 Nov 2020 05:38:09 +0000</pubDate>
      <link>https://dev.to/boyu1997/intro-to-kafka-4hn2</link>
      <guid>https://dev.to/boyu1997/intro-to-kafka-4hn2</guid>
      <description>&lt;p&gt;&lt;em&gt;Note: To get a copy of all code used in this tutorial, please clone this &lt;a href="https://github.com/Boyu1997/kafka-light-bulb" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;. Reference the README file on how to quickly run all experiments mentioned in this tutorial.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Object vs Log
&lt;/h2&gt;

&lt;p&gt;In database design, the common practice is to think of the world in terms of things. This way of thinking can describe the design of SQL databases like MySQL and PostgreSQL. Developers came up with objects to describe things in the world and the objects get stored in a table with a defined schema. For example, if we are describing a light bulb, we might think of things like brand, manufacturer, brightness, energy consumption, and the current state of the light bulb (on or off). This is useful when the focus is to store data or creating a digital copy of a real-world object. However, object databases face challenges in handling streaming data. In the light bulb example, if the light bulb is used as an indication for Morse code that the state of the light bulb is constantly switching between on and off. It will be difficult to come up with an object representation of the changing light bulb. This is when the thinking of data as event logs become useful.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F5syuxej2fgmszndke5k0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F5syuxej2fgmszndke5k0.png" alt="Light Bulb Example"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Kafka?
&lt;/h2&gt;

&lt;p&gt;Kafka is a distributed system designed for streaming data. It replaces the traditional object-based database storage with an event-based system. Kafka can be thought of as a distributed log, with information stored as events with data attached to it.&lt;/p&gt;

&lt;p&gt;Kafka's design works well with the modern containerization trend and the shift from one big software that operates everything to many small containers that operate independent tasks. By thinking of data in terms of logs, processes can be easily decoupled by splitting an application into many independent read and write tasks. Kafka facilitates data communication by providing write access to data producers and read access to data consumers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Felr341e6g2cl16ktbvql.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Felr341e6g2cl16ktbvql.png" alt="Kafka Architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Kafka can handle large traffic by being a distributed system. A Kafka deployment can have many broker servers, which allows for horizontal scaling. Event data in Kafka is organized by &lt;strong&gt;topic&lt;/strong&gt; and each topic is split into multiple &lt;strong&gt;partition&lt;/strong&gt;. This allows horizontal scaling of Kafka as new broker nodes can be assigned to handle and rebalance the partition load. Kafka ensures data fault tolerance by replicating each partition multiple times.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploy Kafka on Docker
&lt;/h2&gt;

&lt;p&gt;We will deploy a Kafka cluster with three broker nodes using docker. For this example, we will use the Kafka-Zookeeper setup. An additional Kafdrop node will be used to provide a web user interface for monitoring the Kafka cluster. The architecture looks like the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fi2idrh2r927kv6nmxrou.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fi2idrh2r927kv6nmxrou.png" alt="Kafka Interface"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Environment Setup
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Docker&lt;/strong&gt;, information on installing Docker &lt;a href="https://docs.docker.com/get-docker/" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Docker Compose&lt;/strong&gt;, information on installing Docker Compose &lt;a href="https://docs.docker.com/compose/install/" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing Docker Compose
&lt;/h3&gt;

&lt;p&gt;The configuration for Zookeeper. We use the zookeeper docker image, exposing port 2181, which is the default port for Zookeeper.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

zookeeper:
  image: zookeeper:3.4.9
  hostname: zookeeper
  ports:
    - "2181:2181"
  environment:
    ZOO_MY_ID: 1
    ZOO_PORT: 2181
    ZOO_SERVERS: server.1=zookeeper:2888:3888
  volumes:
    - ./data/zookeeper/data:/data
    - ./data/zookeeper/datalog:/datalog


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The configuration for Kafka broker nodes. We use the Confluent docker image for Kafka and configures the Kafka node to communicate with the Zookeeper node. The following configuration is replicated three times with hostname kafka1, kakfa2, and kafka3 on port 9091, 9092, and 9093.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

kafka1:
  image: confluentinc/cp-kafka:5.3.0
  hostname: kafka1
  ports:
    - "9091:9091"
  environment:
    KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka1:19091,LISTENER_DOCKER_EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9091
    KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_EXTERNAL:PLAINTEXT
    KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
    KAFKA_ZOOKEEPER_CONNECT: "zookeeper:2181"
    KAFKA_BROKER_ID: 1
  volumes:
    - ./data/kafka1/data:/var/lib/kafka/data
  depends_on:
    - zookeeper


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The configuration for Kafdrop. We only need to configure Kafdrop to connect to one of the Kafka brokers and Kafdrop can get information about other Kafka brokers from its metadata.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

kafdrop:
  image: obsidiandynamics/kafdrop
  restart: "no"
  ports:
    - "9000:9000"
  environment:
    KAFKA_BROKERCONNECT: "kafka1:19091"
  depends_on:
    - kafka1
    - kafka2
    - kafka3


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The final configuration should look like this &lt;a href="https://github.com/Boyu1997/kafka-light-bulb/blob/master/docker-compose.yml" rel="noopener noreferrer"&gt;docker-compose.yml&lt;/a&gt; file.&lt;/p&gt;

&lt;p&gt;Start the Kafka cluster by running &lt;code&gt;docker-compose up&lt;/code&gt;, this will deploy 5 docker containers. You can check this using &lt;code&gt;docker ps&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Goto &lt;a href="http://localhost:9000/" rel="noopener noreferrer"&gt;localhost:9000&lt;/a&gt; and you should see the Kafdrop page showing your Kafka deployment with three broker nodes named kafka1, kafka2, and kafka3.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fu6t4im385ykncal21ez9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fu6t4im385ykncal21ez9.png" alt="Kafdrop interface"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Kafka Dataflow
&lt;/h2&gt;

&lt;p&gt;With the local Kafka service running, we can start interacting with Kafka. In this part, we will build a publisher using the light bulb Morse code example mentioned at the beginning, and a consumer to count the numbers of "dots" and "dashes" being transmitted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create a Kafka Topic
&lt;/h3&gt;

&lt;p&gt;We will first need to create a topic in Kafka for the publisher to send data to. In Kafdrop, click on &lt;em&gt;New&lt;/em&gt; at the bottom of the page, give the topic name &lt;code&gt;light_bulb&lt;/code&gt; to the new topic, and set the number of partitions to 3. For the replication factor, we can keep the default setting of 3, which means each partition of &lt;em&gt;light_bulb&lt;/em&gt; will be replicated three times. Back to the home page, you should see the topic &lt;em&gt;light_bulb&lt;/em&gt; with 3 partitions.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fpvomebcj08eauvi9u25x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fpvomebcj08eauvi9u25x.png" alt="Kafka Interface"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Building Publisher
&lt;/h3&gt;

&lt;p&gt;To add data to the &lt;em&gt;light_bulb&lt;/em&gt; topic, we will need to have a publisher to talk to Kafka. A simple way is to use a Python library. Install the Kafka Python connector by Confluent using &lt;code&gt;pip install confluent-kafka&lt;/code&gt; and we can start sending data to Kafka using:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

from confluent_kafka import Producer

p = Producer({'bootstrap.servers': 'localhost:9091'})
p.produce('light_bulb', key='hello', value='world')
p.flush(30)


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;Producer&lt;/code&gt; class takes a configuration dictionary and we specify the address for the Kafka broker. Only one address is needed because the brokers contain metadata for reaching other brokers. The &lt;code&gt;produce&lt;/code&gt; function sends data without waiting for confirmation, it takes three inputs: the Kafka topic name, a key which is used to determine which partition the data is added to, and a value string of the log data. The &lt;code&gt;flush&lt;/code&gt; function is used at the end of the process is completed.&lt;/p&gt;

&lt;p&gt;This &lt;a href="https://github.com/Boyu1997/kafka-light-bulb/blob/master/producer.py" rel="noopener noreferrer"&gt;producer.py&lt;/a&gt; file provides an implementation of the Morse code sending light bulb example. We can have the light bulb to send Morse code by publishing on/off status log to the Kafka cluster.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

python3 producer.py --key="light-1" --string="XYZ"


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;With some published data, we can see the information about the data on the Kafka Interface.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fy6ia13086vy6imslm7xn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fy6ia13086vy6imslm7xn.png" alt="Kafka Interface"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Building Consumer
&lt;/h3&gt;

&lt;p&gt;Here is a simple example of consuming messages using the Python library.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

from confluent_kafka import Consumer, KafkaError

c = Consumer({
    'bootstrap.servers': 'localhost:9091',
    'group.id': 'counting-group',
    'enable.auto.commit': True,
    'session.timeout.ms': 6000,
    'default.topic.config': {'auto.offset.reset': 'smallest'}
})

c.subscribe(['light_bulb'])
while True:
    msg = c.poll(0.1)
    if msg is None:
        continue
    elif not msg.error():
        print(msg.value())
    elif msg.error().code() == KafkaError._PARTITION_EOF:
        print("End of partition reached")
    else:
        print("Error")


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;Consumer&lt;/code&gt; class also takes a configuration dictionary. We specify the name of the consumer group, enable auto-commit of the consumer offset, set timeout, and configure the offset to start at the smallest element. We can use the consumer object to subscribe to a list of topics and the while loop allows continuously check for new log data in the subscribed topics.&lt;/p&gt;

&lt;p&gt;This &lt;a href="https://github.com/Boyu1997/kafka-light-bulb/blob/master/consumer.py" rel="noopener noreferrer"&gt;consumer.py&lt;/a&gt; file provides an implementation of counting the number of "dots" and "dashes" in the Morse code transmitted in the light bulb example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure Handling
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Single Node Failure
&lt;/h3&gt;

&lt;p&gt;Because Kafka is designed to be fault-tolerant, we can simulate node failure by stopping a container. This will simulate the effect of a server being brought down by faults that might present in the software, hardware, or network. To stop broker node number 3, we can run &lt;code&gt;docker stop kafka_kafka3_1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fa79wteofhalpbl6x90na.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fa79wteofhalpbl6x90na.png" alt="Kafka without broker node 3"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Refreshing the Kafdrop page, we should see that host kafka3 is missing and the partition that was served by kafka3 is now allocated between kafka1 and kafka2. In this case, we see each of the two remaining broker nodes is handling 2 partitions, which sum to 4 partitions in the system, which indicates no data was lost by the failure of the broker node kafaka3. We can click into the topic &lt;em&gt;light_bulb&lt;/em&gt; and under partitions, we can see the partition that was lead by kafka3 was updated to another available node. Also, the loss of broker node 3 results in all of the partitions being under replicated, as the configuration calls for three replications but only two broker nodes are available.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Node Failure
&lt;/h3&gt;

&lt;p&gt;We will also stop broker node number 2 by running &lt;code&gt;docker stop kafka_kafka2_1&lt;/code&gt;. In this case, because we configured the replication factor as three, failing two nodes will not result in data loss. We can see that the leader node for all partitions is switched to the remaining broker node 1.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fzsebufmgi8cuyfaxm9r6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fzsebufmgi8cuyfaxm9r6.png" alt="Kafka without broker node 2 and 3"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  All Nodes Failure
&lt;/h3&gt;

&lt;p&gt;To stop the last remaining node, run &lt;code&gt;docker stop kafka_kafka1_1&lt;/code&gt;. In this case, we will result in data loss. As we lose all replications of data. However, this is an unlikely situation and it is more likely caused by network issues than all of the broker nodes failed. So it is likely that when the network is resumed, data can be retrieved. However, having no broker node will mean the Kafka server is unavailable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In general, Kafka is a highly configurable distributed system that fits the need of modern agile applications. It is the data service layer between the publisher and consumer of application data.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Run Python MapReduce on local Docker Hadoop Cluster</title>
      <dc:creator>Boyu</dc:creator>
      <pubDate>Mon, 05 Oct 2020 21:54:59 +0000</pubDate>
      <link>https://dev.to/boyu1997/run-python-mapreduce-on-local-docker-hadoop-cluster-1g46</link>
      <guid>https://dev.to/boyu1997/run-python-mapreduce-on-local-docker-hadoop-cluster-1g46</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;This post covers how to deploy a local Docker Hadoop Cluster to run custom Python mapper and reducer function using the classic word count example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Environment Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Docker&lt;/strong&gt;, get Docker &lt;a href="https://docs.docker.com/get-docker/"&gt;here&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Docker Compose&lt;/strong&gt;, get Docker Compose &lt;a href="https://docs.docker.com/compose/install/"&gt;here&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Git&lt;/strong&gt;, get Git &lt;a href="https://git-scm.com/book/en/v2/Getting-Started-Installing-Git"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Deploy Hadoop Cluster using Docker
&lt;/h2&gt;

&lt;p&gt;We will use the Docker image by &lt;a href="https://github.com/big-data-europe/docker-hadoop"&gt;big-data-europe repository&lt;/a&gt; to set up Hadoop.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone git@github.com:big-data-europe/docker-hadoop.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the Docker image for Hadoop on your local machine, we can use docker-compose to configure the local Hadoop cluster. Replace the &lt;code&gt;docker-compose.yml&lt;/code&gt; file with the following file from &lt;a href="https://gist.github.com/nathan815/a938b3f7a4d06b2811cf2b1a917800e1"&gt;this GitHub Gist&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This docker-compose file configures a Hadoop cluster with a master node (namenode) and three worker nodes, it also configures the network port to allow communication between the nodes. To start the cluster, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker-compose up -d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;docker ps&lt;/code&gt; to verify the containers are up, you should see a container list similar to the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IMAGE                           PORTS                    NAMES
docker-hadoop_resourcemanager                            resourcemanager
docker-hadoop_nodemanager1      0.0.0.0:8042-&amp;gt;8042/tcp   nodemanager1
docker-hadoop_historyserver     0.0.0.0:8188-&amp;gt;8188/tcp   historyserver
docker-hadoop_datanode3         9864/tcp                 datanode3
docker-hadoop_datanode2         9864/tcp                 datanode2
docker-hadoop_datanode1         9864/tcp                 datanode1
docker-hadoop_namenode          0.0.0.0:9870-&amp;gt;9870/tcp   namenode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The current status of the local Hadoop cluster will be available at &lt;a href="http://localhost:9870/"&gt;localhost:9870&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Running Python MapReduce function
&lt;/h2&gt;

&lt;p&gt;For this simple MapReduce program, we will use the classical word count example. The program reads text files and counts how often each word occurs.&lt;/p&gt;

&lt;p&gt;The mapper function will read the text and emit the key-value pair, which in this case is &lt;code&gt;&amp;lt;word, 1&amp;gt;&lt;/code&gt;. Copy the following code into &lt;code&gt;mapper.py&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/bin/env python
"""mapper.py"""

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print ('%s\t%s' % (word, 1))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reducer function processes the result from the mapper and returns the word count. Copy the following code into &lt;code&gt;reducer.py&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/bin/env python
"""reducer.py"""

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print ('%s\t%s' % (current_word, current_count))
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print ('%s\t%s' % (current_word, current_count))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note because Hadoop runs on Apache server which is built in Java, the program takes a Java JAR file as an input. To execute Python in Hadoop, we will need to use the &lt;a href="https://hadoop.apache.org/docs/r1.2.1/streaming.html"&gt;Hadoop Streaming library &lt;/a&gt; to pipe the Python executable into the Java framework. As a result, we need to process the Python input from STDIN.&lt;/p&gt;

&lt;p&gt;Copy the local &lt;code&gt;mapper.py&lt;/code&gt; and &lt;code&gt;reducer.py&lt;/code&gt; to the namenode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker cp LOCAL_PATH/mapper.py namenode:mapper.py
docker cp LOCAL_PATH/reducer.py namenode:reducer.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enter the namenode container of the Hadoop cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker exec -it namenode bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run &lt;code&gt;ls&lt;/code&gt; and you should find &lt;code&gt;mapper.py&lt;/code&gt; and &lt;code&gt;reducer.py&lt;/code&gt; in the namenode container.&lt;/p&gt;

&lt;p&gt;Now let's prepare the input. For this simple example, we will use a set of text files with a short string. For a more realistic example, you can use e-book from &lt;a href="http://www.gutenberg.org/"&gt;Project Gutenberg&lt;/a&gt;, download the &lt;code&gt;Plain Text UTF-8&lt;/code&gt; encoding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mkdir input
echo "Hello World" &amp;gt;input/f1.txt
echo "Hello Docker" &amp;gt;input/f2.txt
echo "Hello Hadoop" &amp;gt;input/f3.txt
echo "Hello MapReduce" &amp;gt;input/f4.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The MapReduce program access files from the Hadoop Distributed File System (HDFS). Run the following to transfer the input directory and files to HDFS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hadoop fs -mkdir -p input
hdfs dfs -put ./input/* input
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;find / -name 'hadoop-streaming*.jar'&lt;/code&gt; to locate the hadoop string library JAR file. The path should look something like &lt;code&gt;PATH/hadoop-streaming-3.2.1.jar&lt;/code&gt;&lt;br&gt;
Finally, we can execute the MapReduce program:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hadoop jar /opt/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar \
-file mapper.py    -mapper mapper.py \
-file reducer.py   -reducer reducer.py \
-input input -output output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To safely shut down the cluster and remove containers, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker-compose down
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Reference
&lt;/h2&gt;

&lt;p&gt;Yen V. (2019). How to set up a Hadoop cluster in Docker.&lt;br&gt;
Retrieved from: &lt;a href="!%5BScreen%20Shot%202020-10-05%20at%205.47.33%20PM%5D(https://dev-to-uploads.s3.amazonaws.com/i/2sd0llh04f4lmn0imryw.png)"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Noll M. Writing An Hadoop MapReduce Program In Python.&lt;br&gt;
Retrieved from: &lt;a href="https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/"&gt;here&lt;/a&gt;&lt;/p&gt;

</description>
      <category>hadoop</category>
      <category>mapreduce</category>
    </item>
    <item>
      <title>Creating an interactive application for explaining machine learning project to non-cs major students</title>
      <dc:creator>Boyu</dc:creator>
      <pubDate>Tue, 26 May 2020 03:51:48 +0000</pubDate>
      <link>https://dev.to/boyu1997/creating-an-interactive-application-for-explaining-machine-learning-project-to-non-cs-major-students-1bf5</link>
      <guid>https://dev.to/boyu1997/creating-an-interactive-application-for-explaining-machine-learning-project-to-non-cs-major-students-1bf5</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;As a CS + statistic major student working on my graduation project. For the year 2020, it will be almost obscured if it is not somehow related to machine learning. So I did an experiment with DenseNet architecture, to experiment with different level of dense connections and how that influence network performance. (More about DenseNet read the &lt;a href="https://arxiv.org/abs/1608.06993"&gt;paper&lt;/a&gt; and this &lt;a href="https://towardsdatascience.com/review-densenet-image-classification-b6631a8ef803"&gt;article&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;However, there is also this requirement by my school to teach my project to my classmates from different majors in a 60 minutes session. So I can't assume them knowing what is a neural network or having the time to give a crash course in machine learning before explaining my project. So what is a better way to explain a technical project to a non-technical audience than using visualization? This reminds me of the amazing &lt;a href="https://playground.tensorflow.org/"&gt;TensorFlow Playground&lt;/a&gt; I played with while first learning machine learning. So I start with building a similar, though a much simpler playground application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo Link
&lt;/h2&gt;

&lt;p&gt;&lt;a href="http://cnn-dense-connection.boyu.io/"&gt;http://cnn-dense-connection.boyu.io/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Link to Code
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Boyu1997/cnn-dense-connection"&gt;https://github.com/Boyu1997/cnn-dense-connection&lt;/a&gt;&lt;br&gt;
navigate to &lt;code&gt;playground&lt;/code&gt; for code specific to this application&lt;/p&gt;

&lt;h2&gt;
  
  
  How I built it
&lt;/h2&gt;

&lt;p&gt;The idea of this project is simple, to give users the ability to add and remove dense connections in the network by simply clicking the connection edges and show an instant update of the prediction result.&lt;/p&gt;

&lt;p&gt;To simplify the project, all of the models are trained locally using code similar to my dense connections experiment, and all the prediction results are saved into a JSON file. The main change to the model besides being a much smaller one is the final layer activation to Softmax so that the prediction is between 0% and 100% and sum up nicely to one, comparing to the more common use of ReLU and takes the maximum as the first-class label. The web application needs only be a frontend application, handling click input, and updating the page render. The main challenge of this project for me is to learn and use D3.js quickly. Examining the code for TensorFlow Playground, I see their implementation of the network configuration component is based on D3. Though I feel a bit hesitant to start learning and using D3 for a side project with a tight deadline. After searching and going through some alternative network visualization library, it comes clear to me that other higher-level libraries do not give sufficient control so that D3 is the most sensible option.&lt;/p&gt;

&lt;p&gt;The front-end application is about managing state and render the page accordingly. So my initial thought is to use React, which I have experience working with. However, I soon find out using D3 in React is not the most intuitive way to code. As this is only a one-page application, I decided to use D3 to render all the interactive parts of the page into an SVG element and keeping the rest as plain HTML. For hosting the project, I used Webpack to build the JavaScript file and place all the static files to a GCP storage bucket and setup website access.&lt;/p&gt;

&lt;p&gt;Another interesting thing I totally missed while making the first iteration of this project is how big the number of possible ways to configure the model. I originally went for a three dense blocks six total dense layers network as in the illustration used in CondenseNet (see figure 2 in the &lt;a href="https://github.com/ShichenLiu/CondenseNet"&gt;repository&lt;/a&gt;). This model with 21 dense connections (the curved edges in the plot) gives a total of &lt;code&gt;2^21 = 2097152&lt;/code&gt; possible configurations for allowing each dense connection to be set to on and off. Given I plan to pre-train the models locally, this is something pretty impossible to carry out. So I reduced the network to two dense blocks four total layers as in the application. This gives me 10 dense connections and 1024 possible configurations, which takes a few hours on GPU to train all of the 1024 models.&lt;/p&gt;

</description>
      <category>octograd2020</category>
    </item>
  </channel>
</rss>
