DEV Community

Cover image for How working/install Pig with Notebooks?
Lucas M. Ríos
Lucas M. Ríos

Posted on

3

How working/install Pig with Notebooks?

🐷📝 Basic commands to work with Pig in Notebooks


🔗Related content

You can find post related in:

📀Google Colab

You can find repo related in:

🐱‍🏍GitHub

You can connect with me in:

🧬LinkedIn


Resume 🧾

I will install Hadoop with Pig program and will use a library of Python to write a job that answer the question, how many row exists by each rating?

First I install Hadoop using same commands that I have used before but without put a number of step.


Install Hadoop 🐘

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

You would can get other version if you need in: https://downloads.apache.org/hadoop/common/ and later replace it in the before command.


Unzip and copy 🔓

I use following command:

!tar -xzvf hadoop-3.3.4.tar.gz && cp -r hadoop-3.3.4/ /usr/local/


Set up Hadoop's Java ☕

I use following command:

#To find the default Java path and add export in hadoop-env.sh
JAVA_HOME = !readlink -f /usr/bin/java | sed "s:bin/java::"
java_home_text = JAVA_HOME[0]
java_home_text_command = f"$ {JAVA_HOME[0]} "
!echo export JAVA_HOME=$java_home_text >>/usr/local/hadoop-3.3.4/etc/hadoop/hadoop-env.sh
Enter fullscreen mode Exit fullscreen mode

Set Hadoop home variables 🏡

I use following command:

# Set environment variables
import os
os.environ['HADOOP_HOME']="/usr/local/hadoop-3.3.4"
os.environ['JAVA_HOME']=java_home_text
Enter fullscreen mode Exit fullscreen mode

1st - Install Pig 🐷

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz

You would can get other version if you need in: https://downloads.apache.org/pig/ and later replace it in the before command.


2nd - Unzip and copy 🔓

I use following command:

!tar -xzvf pig-0.17.0.tar.gz


3rd - Set Pig home variables 🏡

I use following command:

# Set environment variables
import os
os.environ['PIG_HOME']="/content/pig-0.17.0"
os.environ['PIG_CLASSPATH']="/usr/local/hadoop-3.3.1/conf"
os.environ["PATH"] += os.pathsep + "/content/pig-0.17.0/bin"
Enter fullscreen mode Exit fullscreen mode

We can validate installation with command:

!pig -version


4th - Create a folder with HDFS 🌎📂

I use following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///content/data_pig


4.1 - Remove folder with HDFS ♻

Maybe, later you need remove it. To do that you must apply following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -rm -r file:///content/data_pig


5th - Getting a dataset to anlyze with Pig 💾

I use a dataset from grouplens. You can get other in:
http://files.grouplens.org/datasets/

This time I use movieslens and you can download it using:

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

To use data extract files. I extract files in path later of -d in command:

!unzip "/content/ml-100k.zip" -d "file:///content/data_pig"

For list them:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls /content/data_pig/ml-100k


6th - Creating process to use Pig with Pig Syntax 🐖

To create job in Pig, you must see structure of dataset to configure jobs.
In this case we print dataset with following command:

!head /content/data_pig/ml-100k/u.data

I can get following information of dataset:

  • First column reference to userID.
  • Second column reference to movieID.
  • Third column reference to rating.
  • Fourth column reference to timestamp.
# Create pig script
%%writefile id.pig
/* id.pig */

student = LOAD 'file:///content/data_pig/ml-100k/u.data' USING PigStorage(' ')
   as (userId:int, movieId:int, rating:int, timestamp:int);

student_order = ORDER student BY rating DESC;

Dump student_order;
Enter fullscreen mode Exit fullscreen mode

7th - Running the process 🙈

Here we run the process specifing some parameters:

  • Pig file program is id.pig
  • Dataset is in file:///content/data_pig/ml-100k/u.data

When run process, maybe take a few minutes...

You can run script with:

!pig -x local id.pig

But we run script and save results in a file .txt:

!pig -x local id.pig > results.txt


8th - Advancing in the logic of the scripts 😎

Now we will advance in logic of the script to get answer to next questions:

  • What are the oldest 5 star movies?
  • What are the worst movies?

8.1 - Find oldest 5 star movies start ⭐

%%writefile fiveStarMovies.pig

ratings = LOAD 'file:///content/data_pig/ml-100k/u.data'
    AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD 'file:///content/data_pig/ml-100k/u.item' USING PigStorage('|')
    AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);

nameLookup = FOREACH metadata GENERATE movieID, movieTitle, ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;

ratingsByMovie = GROUP ratings BY movieID;

avgRatings = FOREACH ratingsByMovie GENERATE group as movieID, AVG(ratings.rating) as avgRating;

fiveStarMovies = FILTER avgRatings BY avgRating > 4.0;

fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;

oldestFiveStarMovies = ORDER fiveStarsWithData BY nameLookup::releaseTime;

DUMP oldestFiveStarMovies;
Enter fullscreen mode Exit fullscreen mode

Run script and save results in a file .txt:

!pig -x local fiveStarMovies.pig > fiveStarMovies.txt


8.2 - Find most rated bad movies ⭐

%%writefile BadPopularMovies.pig

ratings = LOAD 'file:///content/data_pig/ml-100k/u.data'
  AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD 'file:///content/data_pig/ml-100k/u.item' USING PigStorage('|')
    AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);

nameLookup = FOREACH metadata GENERATE movieID, movieTitle;

groupedRating = GROUP ratings by movieID;

avgRatings = FOREACH groupedRating GENERATE group as movieID, AVG(ratings.rating) as avgRating, COUNT(ratings.rating) AS numRatings;  

badMovies = FILTER avgRatings BY avgRating < 2.0;

namedBadMovies = JOIN badMovies BY movieID, nameLookup BY movieID;

results = FOREACH namedBadMovies GENERATE nameLookup::movieTitle as movieName,
          badMovies::avgRating as avgRating, badMovies::numRatings as numRatings;

finalResults = ORDER results BY numRatings DESC;

DUMP finalResults;
Enter fullscreen mode Exit fullscreen mode

Run script and save results in a file .txt:

!pig -x local BadPopularMovies.pig > BadPopularMovies.txt


9th - Say thanks, give like and share if this has been of help/interest 😁🖖


AWS Q Developer image

Your AI Code Assistant

Automate your code reviews. Catch bugs before your coworkers. Fix security issues in your code. Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

Top comments (0)

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay