DEV Community

Cover image for How working/install Hadoop with Notebooks?
Lucas M. Rรญos
Lucas M. Rรญos

Posted on โ€ข Edited on

2

How working/install Hadoop with Notebooks?

๐Ÿ˜๐Ÿ“ Basic commands to work with Hadoop in Notebooks


๐Ÿ”—Related content

You can find notebook related in:

๐Ÿ“€Google Collab

You can find video related in:

๐Ÿ“บYouTube

You can find repo related in:

๐Ÿฑโ€๐ŸGitHub

You can connect with me in:

๐ŸงฌLinkedIn


Resume ๐Ÿงพ

I will install Hadoop program and will use a library of Python to write a job that answer the question, how many row exists by each rating?


1st - Install Hadoop ๐Ÿ˜

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

You would can get other version if you need in: https://downloads.apache.org/hadoop/common/ and later replace it in the before command.


2nd - Unzip and copy ๐Ÿ”“

I use following command:

!tar -xzvf hadoop-3.3.4.tar.gz && cp -r hadoop-3.3.4/ /usr/local/

3rd - Set up Hadoop's Java โ˜•

I use following command:

#To find the default Java path and add export in hadoop-env.sh
JAVA_HOME = !readlink -f /usr/bin/java | sed "s:bin/java::"
java_home_text = JAVA_HOME[0]
java_home_text_command = f"$ {JAVA_HOME[0]} "
!echo export JAVA_HOME=$java_home_text >>/usr/local/hadoop-3.3.4/etc/hadoop/hadoop-env.sh
Enter fullscreen mode Exit fullscreen mode

4th - Set Hadoop home variables ๐Ÿก

I use following command:

# Set environment variables
import os
os.environ['HADOOP_HOME']="/usr/local/hadoop-3.3.4"
os.environ['JAVA_HOME']=java_home_text
Enter fullscreen mode Exit fullscreen mode

5th - Run Hadoop ๐Ÿƒโ€โ™‚๏ธ

I use following command:

!/usr/local/hadoop-3.3.4/bin/hadoop


6th - Create a folder with HDFS ๐ŸŒŽ๐Ÿ“‚

I use following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir ml-100k


7th - Remove folder with HDFS โ™ป

Maybe, later you need remove it. To do that you must apply following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -rm -r ml-100k


8th - Getting a dataset to anlyze with Hadoop ๐Ÿ’พ

I use a dataset from grouplens. You can get other in:
http://files.grouplens.org/datasets/

This time I use movieslens and you can download it using:

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

To use data extract files. I extract files in path later of -d in command:

!unzip "/content/ml-100k.zip" -d "/content/ml-100k_folder"

Can move it in actual directory using HDFS like:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -copyFromLocal /content/ml-100k_folder/ml-100k/* ml-100k/

For list them:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls ml-100k


9th - Installing dependencies to use Python ๐Ÿ“š

We can install dependency to use MapReduce using:

!pip install mrjob

I recomend learn more about mrjob in: https://mrjob.readthedocs.io/en/latest/


10th - Mananging temp folder ๐Ÿ’ฟ

You can create in anywhere you want.

We create a temp folder to use when we run job:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///tmp

We assign permissions to temp folder with:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///tmp

We list files of temp folder:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls file:///tmp


11th - Creating process to use with MRJOB using Python ๐Ÿ

To create job in Python, you must see structure of dataset to configure jobs.
In this case dataset is like:

!head /content/ml-100k/u.data -n -10

I can get following information of dataset:

  • First column reference to userID.
  • Second column reference to movieID.
  • Third column reference to rating.
  • Fourth column reference to timestamp.
%%writefile RatingBreakdown.py
# import modules
from mrjob.job import MRJob
from mrjob.step import MRStep

# create class inhereted from MRJob
class RatingBreakdown(MRJob):
  # assign steps, first mapper last reducer
  def steps(self):
    return [
            MRStep(mapper=self.mapper_get_rating,
                   reducer=self.reducer_count_ratings)
    ]

  # creating mapper, assigning attributes from dataset
  def mapper_get_rating(self, _, line):
    (userID, movieID, rating, timestamp) = line.split('\t')
    # assign like the key rating and assign each row value 1
    yield rating, 1

  # creating reducer, sum 
  def reducer_count_ratings(self, key, values):
    # in function of each key we sum values
    yield key, sum(values)

if __name__ == '__main__':
  RatingBreakdown.run()
Enter fullscreen mode Exit fullscreen mode

12th - Running the process ๐Ÿ™ˆ

Here we run the process specifing some parameters:

  • Python file program !python RatingBreakdown.py
  • Where is .jar to run hadoop /usr/local/hadoop-3.3.4/share/hadoop/tools/lib/hadoop-streaming-3.3.4.jar
  • Temp file and dataset file:///tmp /content/ml-100k/u.data

When run process, maybe take a few minutes...
I run with:

!python RatingBreakdown.py -r hadoop --hadoop-streaming-jar /usr/local/hadoop-3.3.4/share/hadoop/tools/lib/hadoop-streaming-3.3.4.jar --hadoop-tmp-dir file:///tmp /content/ml-100k/u.data


13th - Listing results ๐Ÿฅ‚

I run again process and put results in results.txt

!python RatingBreakdown.py /content/ml-100k_folder/ml-100k/u.data > results.txt

Use following command to list file:

!cat results.txt


14th - Say thanks, give like and share if this has been of help/interest ๐Ÿ˜๐Ÿ––


๐Ÿ”—Related content

You can find post related in:

๐Ÿ“€Google Collab

You can find video related in:

๐Ÿ“บYouTube

You can find repo related in:

๐Ÿฑโ€๐ŸGitHub

You can connect with me in:

๐ŸงฌLinkedIn


AWS GenAI LIVE image

How is generative AI increasing efficiency?

Join AWS GenAI LIVE! to find out how gen AI is reshaping productivity, streamlining processes, and driving innovation.

Learn more

Top comments (0)

Billboard image

Create up to 10 Postgres Databases on Neon's free plan.

If you're starting a new project, Neon has got your databases covered. No credit cards. No trials. No getting in your way.

Try Neon for Free โ†’

๐Ÿ‘‹ Kindness is contagious

Please leave a โค๏ธ or a friendly comment on this post if you found it helpful!

Okay