Zishuo Ding

Posted on Mar 25, 2022

Create a Hadoop playground with Docker Desktop on Windows in minutes

#docker #java #bigdata #codenewbie

This semester, I've chosen to take a course about parallel computing. One of the projects involves writing a MapReduce program on Hadoop in Java. Connecting to the school's computing resources might be difficult at times, especially when a due date is approaching. As a result, I looked for an easy way to set up a local Hadoop environment with Docker on my Windows laptop, so that I could conduct some experiments quickly.

Preparations

Docker

Docker saves us from having to go through complicated installation procedures for certain softwares (including Hadoop in this post), and it also allows us to clearly delete them if we need to free up some disk space.

The first step is to download and install a Docker Desktop for Windows (for Mac if your OS is Mac) on your computer. Now, Docker Desktop supports using WSL 2 (Windows Subsystem for Linux 2) instead of Hyper-V as the backend. If you do not have WSL on your Windows machine, you could follow this official guide to enable it.

You may check the versions by typing the following commands in your terminal (Powershell/WSL shell) to test the correct installation of both Docker and Docker Compose once the Docker Desktop is installed and running.

$ docker --version
Docker version 20.10.13, build a224086
$ docker-compose --version
Docker Compose version v2.3.3

It's also possible to check Docker's functioning by launching a sample docker container.

$ docker run -d -p 80:80 --name myserver nginx

VSCode

I'm sure every developer has installed VSCode, so you just need to make sure you have the plugin Remote Development installed in your VSCode. This enables you to develop in a container, on a remote machine, or in WSL.

Let’s go Hadoop

Setup

As you can see from the steps below, setting up a Hadoop environment with Docker is rather simple.

Clone the repo big-data-europe/docker-hadoop under a certain path, then setup the Hadoop cluster via docker-compose.

git clone git@github.com:big-data-europe/docker-hadoop.git
cd docker-hadoop
docker-compose up -d

Now, you are all set :). After a few moments, you can check if it is working properly by visiting http://localhost:9870/.

Get on the train

It's finally time to meet your new Hadoop cluster. Start VS Code, then go to the left panel and select the Remote Development plugin. Select "Containers" from the dropdown above, then locate and connect to a container named "namenode" by clicking "Attach to the container" icon. You've arrived in the Hadoop world!

Hello "WordCount"

This is how we will test the Hadoop cluster. We will run the Word Count example (from source code in Java) to see how it works.

But don't hurry up just yet. Here are some more things to do. We need to make some sample input data.

Open the terminal in the VS Code. Then run:

mkdir input
echo "Hello World" > input/f1.txt
echo "Hello Docker" > input/f2.txt

The inputs we have created are stored in your local (more precisely, in the docker container local). We also need to copy them to the HDFS.

hadoop fs -mkdir -p input
hdfs dfs -put ./input/* input

After the preparation, you can get the official WordCount example from this link.

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Save it with the filename WordCount.java.

Now, let's find out if it works.

export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wordcount.jar WordCount*.class
hadoop jar wordcount.jar WordCount input output

It looks like there are a lot of logs, but what we care about is the output. Print them out with cat command.

$ hdfs dfs -cat output/part-r-0000*
Docker 1
Hello 2
World 1

Hooray! We did it!

Clean-up

This is simple. Using this command will make your computer's life easier.

docker-compose down

Acknowledgement

This article significantly references this article by José Lise. I've added some content on how to develop with VSCode.

DEV Community