🐷📝 Basic commands to work with Pig in Notebooks
🔗Related content
You can find post related in:
You can find repo related in:
🐱🏍GitHub
You can connect with me in:
Resume 🧾
I will install Hadoop with Pig program and will use a library of Python to write a job that answer the question, how many row exists by each rating?
First I install Hadoop using same commands that I have used before but without put a number of step.
Install Hadoop 🐘
I use following command but you can change to get current last version:
!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
You would can get other version if you need in: https://downloads.apache.org/hadoop/common/ and later replace it in the before command.
Unzip and copy 🔓
I use following command:
!tar -xzvf hadoop-3.3.4.tar.gz && cp -r hadoop-3.3.4/ /usr/local/
Set up Hadoop's Java ☕
I use following command:
#To find the default Java path and add export in hadoop-env.sh
JAVA_HOME = !readlink -f /usr/bin/java | sed "s:bin/java::"
java_home_text = JAVA_HOME[0]
java_home_text_command = f"$ {JAVA_HOME[0]} "
!echo export JAVA_HOME=$java_home_text >>/usr/local/hadoop-3.3.4/etc/hadoop/hadoop-env.sh
Set Hadoop home variables 🏡
I use following command:
# Set environment variables
import os
os.environ['HADOOP_HOME']="/usr/local/hadoop-3.3.4"
os.environ['JAVA_HOME']=java_home_text
1st - Install Pig 🐷
I use following command but you can change to get current last version:
!wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz
You would can get other version if you need in: https://downloads.apache.org/pig/ and later replace it in the before command.
2nd - Unzip and copy 🔓
I use following command:
!tar -xzvf pig-0.17.0.tar.gz
3rd - Set Pig home variables 🏡
I use following command:
# Set environment variables
import os
os.environ['PIG_HOME']="/content/pig-0.17.0"
os.environ['PIG_CLASSPATH']="/usr/local/hadoop-3.3.1/conf"
os.environ["PATH"] += os.pathsep + "/content/pig-0.17.0/bin"
We can validate installation with command:
!pig -version
4th - Create a folder with HDFS 🌎📂
I use following command:
!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///content/data_pig
4.1 - Remove folder with HDFS ♻
Maybe, later you need remove it. To do that you must apply following command:
!/usr/local/hadoop-3.3.4/bin/hadoop fs -rm -r file:///content/data_pig
5th - Getting a dataset to anlyze with Pig 💾
I use a dataset from grouplens. You can get other in:
http://files.grouplens.org/datasets/
This time I use movieslens and you can download it using:
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
To use data extract files. I extract files in path later of -d in command:
!unzip "/content/ml-100k.zip" -d "file:///content/data_pig"
For list them:
!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls /content/data_pig/ml-100k
6th - Creating process to use Pig with Pig Syntax 🐖
To create job in Pig, you must see structure of dataset to configure jobs.
In this case we print dataset with following command:
!head /content/data_pig/ml-100k/u.data
I can get following information of dataset:
- First column reference to userID.
- Second column reference to movieID.
- Third column reference to rating.
- Fourth column reference to timestamp.
# Create pig script
%%writefile id.pig
/* id.pig */
student = LOAD 'file:///content/data_pig/ml-100k/u.data' USING PigStorage(' ')
as (userId:int, movieId:int, rating:int, timestamp:int);
student_order = ORDER student BY rating DESC;
Dump student_order;
7th - Running the process 🙈
Here we run the process specifing some parameters:
- Pig file program is
id.pig
- Dataset is in
file:///content/data_pig/ml-100k/u.data
When run process, maybe take a few minutes...
You can run script with:
!pig -x local id.pig
But we run script and save results in a file .txt:
!pig -x local id.pig > results.txt
8th - Advancing in the logic of the scripts 😎
Now we will advance in logic of the script to get answer to next questions:
- What are the oldest 5 star movies?
- What are the worst movies?
8.1 - Find oldest 5 star movies start ⭐
%%writefile fiveStarMovies.pig
ratings = LOAD 'file:///content/data_pig/ml-100k/u.data'
AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD 'file:///content/data_pig/ml-100k/u.item' USING PigStorage('|')
AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);
nameLookup = FOREACH metadata GENERATE movieID, movieTitle, ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;
ratingsByMovie = GROUP ratings BY movieID;
avgRatings = FOREACH ratingsByMovie GENERATE group as movieID, AVG(ratings.rating) as avgRating;
fiveStarMovies = FILTER avgRatings BY avgRating > 4.0;
fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;
oldestFiveStarMovies = ORDER fiveStarsWithData BY nameLookup::releaseTime;
DUMP oldestFiveStarMovies;
Run script and save results in a file .txt:
!pig -x local fiveStarMovies.pig > fiveStarMovies.txt
8.2 - Find most rated bad movies ⭐
%%writefile BadPopularMovies.pig
ratings = LOAD 'file:///content/data_pig/ml-100k/u.data'
AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD 'file:///content/data_pig/ml-100k/u.item' USING PigStorage('|')
AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);
nameLookup = FOREACH metadata GENERATE movieID, movieTitle;
groupedRating = GROUP ratings by movieID;
avgRatings = FOREACH groupedRating GENERATE group as movieID, AVG(ratings.rating) as avgRating, COUNT(ratings.rating) AS numRatings;
badMovies = FILTER avgRatings BY avgRating < 2.0;
namedBadMovies = JOIN badMovies BY movieID, nameLookup BY movieID;
results = FOREACH namedBadMovies GENERATE nameLookup::movieTitle as movieName,
badMovies::avgRating as avgRating, badMovies::numRatings as numRatings;
finalResults = ORDER results BY numRatings DESC;
DUMP finalResults;
Run script and save results in a file .txt:
!pig -x local BadPopularMovies.pig > BadPopularMovies.txt
Top comments (0)