DEV Community

Cover image for How working/install Hadoop with Notebooks?
Lucas M. Ríos
Lucas M. Ríos

Posted on • Edited on

How working/install Hadoop with Notebooks?

🐘📝 Basic commands to work with Hadoop in Notebooks


🔗Related content

You can find notebook related in:

📀Google Collab

You can find video related in:

📺YouTube

You can find repo related in:

🐱‍🏍GitHub

You can connect with me in:

🧬LinkedIn


Resume 🧾

I will install Hadoop program and will use a library of Python to write a job that answer the question, how many row exists by each rating?


1st - Install Hadoop 🐘

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

You would can get other version if you need in: https://downloads.apache.org/hadoop/common/ and later replace it in the before command.


2nd - Unzip and copy 🔓

I use following command:

!tar -xzvf hadoop-3.3.4.tar.gz && cp -r hadoop-3.3.4/ /usr/local/

3rd - Set up Hadoop's Java ☕

I use following command:

#To find the default Java path and add export in hadoop-env.sh
JAVA_HOME = !readlink -f /usr/bin/java | sed "s:bin/java::"
java_home_text = JAVA_HOME[0]
java_home_text_command = f"$ {JAVA_HOME[0]} "
!echo export JAVA_HOME=$java_home_text >>/usr/local/hadoop-3.3.4/etc/hadoop/hadoop-env.sh
Enter fullscreen mode Exit fullscreen mode

4th - Set Hadoop home variables 🏡

I use following command:

# Set environment variables
import os
os.environ['HADOOP_HOME']="/usr/local/hadoop-3.3.4"
os.environ['JAVA_HOME']=java_home_text
Enter fullscreen mode Exit fullscreen mode

5th - Run Hadoop 🏃‍♂️

I use following command:

!/usr/local/hadoop-3.3.4/bin/hadoop


6th - Create a folder with HDFS 🌎📂

I use following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir ml-100k


7th - Remove folder with HDFS ♻

Maybe, later you need remove it. To do that you must apply following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -rm -r ml-100k


8th - Getting a dataset to anlyze with Hadoop 💾

I use a dataset from grouplens. You can get other in:
http://files.grouplens.org/datasets/

This time I use movieslens and you can download it using:

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

To use data extract files. I extract files in path later of -d in command:

!unzip "/content/ml-100k.zip" -d "/content/ml-100k_folder"

Can move it in actual directory using HDFS like:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -copyFromLocal /content/ml-100k_folder/ml-100k/* ml-100k/

For list them:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls ml-100k


9th - Installing dependencies to use Python 📚

We can install dependency to use MapReduce using:

!pip install mrjob

I recomend learn more about mrjob in: https://mrjob.readthedocs.io/en/latest/


10th - Mananging temp folder 💿

You can create in anywhere you want.

We create a temp folder to use when we run job:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///tmp

We assign permissions to temp folder with:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///tmp

We list files of temp folder:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls file:///tmp


11th - Creating process to use with MRJOB using Python 🐍

To create job in Python, you must see structure of dataset to configure jobs.
In this case dataset is like:

!head /content/ml-100k/u.data -n -10

I can get following information of dataset:

  • First column reference to userID.
  • Second column reference to movieID.
  • Third column reference to rating.
  • Fourth column reference to timestamp.
%%writefile RatingBreakdown.py
# import modules
from mrjob.job import MRJob
from mrjob.step import MRStep

# create class inhereted from MRJob
class RatingBreakdown(MRJob):
  # assign steps, first mapper last reducer
  def steps(self):
    return [
            MRStep(mapper=self.mapper_get_rating,
                   reducer=self.reducer_count_ratings)
    ]

  # creating mapper, assigning attributes from dataset
  def mapper_get_rating(self, _, line):
    (userID, movieID, rating, timestamp) = line.split('\t')
    # assign like the key rating and assign each row value 1
    yield rating, 1

  # creating reducer, sum 
  def reducer_count_ratings(self, key, values):
    # in function of each key we sum values
    yield key, sum(values)

if __name__ == '__main__':
  RatingBreakdown.run()
Enter fullscreen mode Exit fullscreen mode

12th - Running the process 🙈

Here we run the process specifing some parameters:

  • Python file program !python RatingBreakdown.py
  • Where is .jar to run hadoop /usr/local/hadoop-3.3.4/share/hadoop/tools/lib/hadoop-streaming-3.3.4.jar
  • Temp file and dataset file:///tmp /content/ml-100k/u.data

When run process, maybe take a few minutes...
I run with:

!python RatingBreakdown.py -r hadoop --hadoop-streaming-jar /usr/local/hadoop-3.3.4/share/hadoop/tools/lib/hadoop-streaming-3.3.4.jar --hadoop-tmp-dir file:///tmp /content/ml-100k/u.data


13th - Listing results 🥂

I run again process and put results in results.txt

!python RatingBreakdown.py /content/ml-100k_folder/ml-100k/u.data > results.txt

Use following command to list file:

!cat results.txt


14th - Say thanks, give like and share if this has been of help/interest 😁🖖


🔗Related content

You can find post related in:

📀Google Collab

You can find video related in:

📺YouTube

You can find repo related in:

🐱‍🏍GitHub

You can connect with me in:

🧬LinkedIn


Top comments (0)