🐘📝 Basic commands to work with Hadoop in Notebooks
🔗Related content
You can find notebook related in:
You can find video related in:
You can find repo related in:
🐱🏍GitHub
You can connect with me in:
Resume 🧾
I will install Hadoop program and will use a library of Python to write a job that answer the question, how many row exists by each rating?
1st - Install Hadoop 🐘
I use following command but you can change to get current last version:
!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
You would can get other version if you need in: https://downloads.apache.org/hadoop/common/ and later replace it in the before command.
2nd - Unzip and copy 🔓
I use following command:
!tar -xzvf hadoop-3.3.4.tar.gz && cp -r hadoop-3.3.4/ /usr/local/
3rd - Set up Hadoop's Java ☕
I use following command:
#To find the default Java path and add export in hadoop-env.sh
JAVA_HOME = !readlink -f /usr/bin/java | sed "s:bin/java::"
java_home_text = JAVA_HOME[0]
java_home_text_command = f"$ {JAVA_HOME[0]} "
!echo export JAVA_HOME=$java_home_text >>/usr/local/hadoop-3.3.4/etc/hadoop/hadoop-env.sh
4th - Set Hadoop home variables 🏡
I use following command:
# Set environment variables
import os
os.environ['HADOOP_HOME']="/usr/local/hadoop-3.3.4"
os.environ['JAVA_HOME']=java_home_text
5th - Run Hadoop 🏃♂️
I use following command:
!/usr/local/hadoop-3.3.4/bin/hadoop
6th - Create a folder with HDFS 🌎📂
I use following command:
!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir ml-100k
7th - Remove folder with HDFS ♻
Maybe, later you need remove it. To do that you must apply following command:
!/usr/local/hadoop-3.3.4/bin/hadoop fs -rm -r ml-100k
8th - Getting a dataset to anlyze with Hadoop 💾
I use a dataset from grouplens. You can get other in:
http://files.grouplens.org/datasets/
This time I use movieslens and you can download it using:
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
To use data extract files. I extract files in path later of -d in command:
!unzip "/content/ml-100k.zip" -d "/content/ml-100k_folder"
Can move it in actual directory using HDFS like:
!/usr/local/hadoop-3.3.4/bin/hadoop fs -copyFromLocal /content/ml-100k_folder/ml-100k/* ml-100k/
For list them:
!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls ml-100k
9th - Installing dependencies to use Python 📚
We can install dependency to use MapReduce using:
!pip install mrjob
I recomend learn more about mrjob in: https://mrjob.readthedocs.io/en/latest/
10th - Mananging temp folder 💿
You can create in anywhere you want.
We create a temp folder to use when we run job:
!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///tmp
We assign permissions to temp folder with:
!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///tmp
We list files of temp folder:
!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls file:///tmp
11th - Creating process to use with MRJOB using Python 🐍
To create job in Python, you must see structure of dataset to configure jobs.
In this case dataset is like:
!head /content/ml-100k/u.data -n -10
I can get following information of dataset:
- First column reference to userID.
- Second column reference to movieID.
- Third column reference to rating.
- Fourth column reference to timestamp.
%%writefile RatingBreakdown.py
# import modules
from mrjob.job import MRJob
from mrjob.step import MRStep
# create class inhereted from MRJob
class RatingBreakdown(MRJob):
# assign steps, first mapper last reducer
def steps(self):
return [
MRStep(mapper=self.mapper_get_rating,
reducer=self.reducer_count_ratings)
]
# creating mapper, assigning attributes from dataset
def mapper_get_rating(self, _, line):
(userID, movieID, rating, timestamp) = line.split('\t')
# assign like the key rating and assign each row value 1
yield rating, 1
# creating reducer, sum
def reducer_count_ratings(self, key, values):
# in function of each key we sum values
yield key, sum(values)
if __name__ == '__main__':
RatingBreakdown.run()
12th - Running the process 🙈
Here we run the process specifing some parameters:
- Python file program
!python RatingBreakdown.py
- Where is .jar to run hadoop
/usr/local/hadoop-3.3.4/share/hadoop/tools/lib/hadoop-streaming-3.3.4.jar
- Temp file and dataset
file:///tmp /content/ml-100k/u.data
When run process, maybe take a few minutes...
I run with:
!python RatingBreakdown.py -r hadoop --hadoop-streaming-jar /usr/local/hadoop-3.3.4/share/hadoop/tools/lib/hadoop-streaming-3.3.4.jar --hadoop-tmp-dir file:///tmp /content/ml-100k/u.data
13th - Listing results 🥂
I run again process and put results in results.txt
!python RatingBreakdown.py /content/ml-100k_folder/ml-100k/u.data > results.txt
Use following command to list file:
!cat results.txt
14th - Say thanks, give like and share if this has been of help/interest 😁🖖
🔗Related content
You can find post related in:
You can find video related in:
You can find repo related in:
🐱🏍GitHub
Top comments (0)