Lucas M. Ríos

Posted on Jan 7, 2023 • Edited on Jan 10, 2023

How working/install Hadoop with Notebooks?

#python #datascience #bigdata #cloud

🐘📝 Basic commands to work with Hadoop in Notebooks

🔗Related content

You can find notebook related in:

📀Google Collab

You can find video related in:

📺YouTube

You can find repo related in:

🐱‍🏍GitHub

You can connect with me in:

🧬LinkedIn

Resume 🧾

I will install Hadoop program and will use a library of Python to write a job that answer the question, how many row exists by each rating?

1st - Install Hadoop 🐘

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

You would can get other version if you need in: https://downloads.apache.org/hadoop/common/ and later replace it in the before command.

2nd - Unzip and copy 🔓

I use following command:

`!tar -xzvf hadoop-3.3.4.tar.gz && cp -r hadoop-3.3.4/ /usr/local/`

3rd - Set up Hadoop's Java ☕

I use following command:

#To find the default Java path and add export in hadoop-env.sh
JAVA_HOME = !readlink -f /usr/bin/java | sed "s:bin/java::"
java_home_text = JAVA_HOME[0]
java_home_text_command = f"$ {JAVA_HOME[0]} "
!echo export JAVA_HOME=$java_home_text >>/usr/local/hadoop-3.3.4/etc/hadoop/hadoop-env.sh

4th - Set Hadoop home variables 🏡

I use following command:

# Set environment variables
import os
os.environ['HADOOP_HOME']="/usr/local/hadoop-3.3.4"
os.environ['JAVA_HOME']=java_home_text

5th - Run Hadoop 🏃‍♂️

I use following command:

!/usr/local/hadoop-3.3.4/bin/hadoop

6th - Create a folder with HDFS 🌎📂

I use following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir ml-100k

7th - Remove folder with HDFS ♻

Maybe, later you need remove it. To do that you must apply following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -rm -r ml-100k

8th - Getting a dataset to anlyze with Hadoop 💾

I use a dataset from grouplens. You can get other in:
http://files.grouplens.org/datasets/

This time I use movieslens and you can download it using:

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

To use data extract files. I extract files in path later of -d in command:

!unzip "/content/ml-100k.zip" -d "/content/ml-100k_folder"

Can move it in actual directory using HDFS like:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -copyFromLocal /content/ml-100k_folder/ml-100k/* ml-100k/

For list them:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls ml-100k

9th - Installing dependencies to use Python 📚

We can install dependency to use MapReduce using:

!pip install mrjob

I recomend learn more about mrjob in: https://mrjob.readthedocs.io/en/latest/

10th - Mananging temp folder 💿

You can create in anywhere you want.

We create a temp folder to use when we run job:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///tmp

We assign permissions to temp folder with:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///tmp

We list files of temp folder:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls file:///tmp

11th - Creating process to use with MRJOB using Python 🐍

To create job in Python, you must see structure of dataset to configure jobs.
In this case dataset is like:

!head /content/ml-100k/u.data -n -10

I can get following information of dataset:

First column reference to userID.
Second column reference to movieID.
Third column reference to rating.
Fourth column reference to timestamp.

%%writefile RatingBreakdown.py
# import modules
from mrjob.job import MRJob
from mrjob.step import MRStep

# create class inhereted from MRJob
class RatingBreakdown(MRJob):
  # assign steps, first mapper last reducer
  def steps(self):
    return [
            MRStep(mapper=self.mapper_get_rating,
                   reducer=self.reducer_count_ratings)
    ]

  # creating mapper, assigning attributes from dataset
  def mapper_get_rating(self, _, line):
    (userID, movieID, rating, timestamp) = line.split('\t')
    # assign like the key rating and assign each row value 1
    yield rating, 1

  # creating reducer, sum 
  def reducer_count_ratings(self, key, values):
    # in function of each key we sum values
    yield key, sum(values)

if __name__ == '__main__':
  RatingBreakdown.run()

12th - Running the process 🙈

Here we run the process specifing some parameters:

Python file program !python RatingBreakdown.py
Where is .jar to run hadoop /usr/local/hadoop-3.3.4/share/hadoop/tools/lib/hadoop-streaming-3.3.4.jar
Temp file and dataset file:///tmp /content/ml-100k/u.data

When run process, maybe take a few minutes...
I run with:

!python RatingBreakdown.py -r hadoop --hadoop-streaming-jar /usr/local/hadoop-3.3.4/share/hadoop/tools/lib/hadoop-streaming-3.3.4.jar --hadoop-tmp-dir file:///tmp /content/ml-100k/u.data

13th - Listing results 🥂

I run again process and put results in results.txt

!python RatingBreakdown.py /content/ml-100k_folder/ml-100k/u.data > results.txt

Use following command to list file:

!cat results.txt

DEV Community

How working/install Hadoop with Notebooks?

🐘📝 Basic commands to work with Hadoop in Notebooks

🔗Related content

You can find notebook related in:

You can find video related in:

You can find repo related in:

You can connect with me in:

Resume 🧾

1st - Install Hadoop 🐘

2nd - Unzip and copy 🔓

`!tar -xzvf hadoop-3.3.4.tar.gz && cp -r hadoop-3.3.4/ /usr/local/`

3rd - Set up Hadoop's Java ☕

4th - Set Hadoop home variables 🏡

5th - Run Hadoop 🏃‍♂️

6th - Create a folder with HDFS 🌎📂

7th - Remove folder with HDFS ♻

8th - Getting a dataset to anlyze with Hadoop 💾

9th - Installing dependencies to use Python 📚

10th - Mananging temp folder 💿

11th - Creating process to use with MRJOB using Python 🐍

12th - Running the process 🙈

13th - Listing results 🥂

14th - Say thanks, give like and share if this has been of help/interest 😁🖖

🔗Related content

You can find post related in:

You can find video related in:

You can find repo related in:

You can connect with me in:

Top comments (0)

Read next

7 Must-Try Open-Source Tools for Python and JavaScript Developers 🚀

Suppressing "KeyboardInterrupt" Message on Python Script

Why Is Spark Slow??

Top 7 Data Careers You Should Know About in 2025