Hi everyone, it's my first post ever on Dev.to!
I will talk about a school project I did with Hadoop MapReduce technology.
I had quite a struggle doing it properly because it was hard to find some good resources online.
The purpose is to go through each step and understand how to manage multiple Hadoop Jobs, run multiple mappers, and handle multiple input files.
I will assume here that you are a little bit familiar with Hadoop and MapReduce, but here are some references to start with:
The Project
Let's delve into the details of this project.
The primary objective is to provide two large CSV files, both available in the Movie Lens File archive.
- Movies Information File (movies.csv):
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
- Ratings Information File (ratings.csv):
userId, movieId, rating, timestamp
1, 296, 5.0, 1147880044
1, 306, 3.5, 1147868817
The goal is to determine the number of users who have liked the provided list of movies.
The desired output should look like this:
"78 users have liked the following film(s): Lost In Translation (2003)"
While it may seem challenging initially, don't panic; we will go through it step by step.
- Retrieve Each User's Favorite Film:
The first step is to gather, for each user, information about their favorite film.
Count Users Liking Each Movie:
The second step involves counting the number of users who like each movie. Here, we will link the movie ID with the movie name.
Merge Films with the Same Number of Likes:
The final step is to merge films with the same number of users who like them and create a list.
Step 1 : Find users favourite film
For the initial job, we will use the ratings.csv
as input. Here's how the process will unfold:
-
Mapper:
- For each line, extract the
userID
,movieId
, and its rating. - Create an object called
RatingInfo
to store the rating of the film. - Write a new line with key:
userId
and value:RatingInfo(movieId:Rating)
.
- For each line, extract the
-
Reducer:
- Collect all user rating info.
- Find the highest rating.
- Write the
userID
and the correspondingmovieID
.
The output file will contain each user's favorite film.
Here's the flow of the job :
Here you can see the map reduce information more clearly.
This job is pretty straight forward because there is no special configuration.
Step 2 : Merge movie ID and movie Name
Here come the fun part.
We will here run two mappers the first mapper will get the file we created with the previous job.
This Mapper will only get the movieId and write it with value equals to One.
public class MovieIDUserIDMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text one = new Text("1");
private Text movieID = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] fields = line.split("\t");
if (fields.length >= 2) {
movieID.set(fields[1]);
context.write(movieID, one);
}
}
}
The second Mapper will get the movies.csv file as input.
He will simply write the movieId and movieName.
The reducer will get 2 inputs s ohe will need a way to know what value he is dealing with for that we add a strign to the movie name value.
public class MovieIDMovieNameMapper extends Mapper<Object, Text, Text, Text> {
private Text movieID = new Text();
private Text movieInfo = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] fields = value.toString().split(",");
if (fields.length >= 2) {
movieID.set(fields[0]);
movieInfo.set("info:" + fields[1]);
context.write(movieID, movieInfo);
}
}
}
The Reducer will then count the number of times a movie is referenced and link the movie ID and the movie Name. The output file will have the number of users who their favourite film is "movie name".
The driver to run this job :
private static Job countNumberOfUserLikeEachMovie(String[] args) throws IOException {
Configuration conf2 = new Configuration();
Job job2 = Job.getInstance(conf2, "User Count by Movie Name");
job2.setJarByClass(UserHighestRateMovieName.class);
MultipleInputs.addInputPath(job2, new Path(args[1]),
TextInputFormat.class, MovieIDMovieNameMapper.class);
MultipleInputs.addInputPath(job2, new Path(args[2] + "job1/part-r-00000"),
TextInputFormat.class, MovieIDUserIDMapper.class);
job2.setReducerClass(UserCountMovieName.class);
job2.setNumReduceTasks(1);
job2.setMapOutputKeyClass(Text.class);
job2.setMapOutputValueClass(Text.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
TextOutputFormat.setOutputPath(job2, new Path(args[2] + "job2/"));
return job2;
}
Here is the flow of the second job :
Step 3 : Group numbers of users and create movie name list
This job will be pretty much the same as the first one.
We get the last output as our input.
The Mapper will write numberOfUsers and movieName.
The reducer will create a string with all the movie name to list all the movies with the same number of users.
public void reduce(IntWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
StringBuilder movieList = new StringBuilder();
for (Text value : values) {
movieList.append(value.toString()).append(", ");
}
String formattedMovieList = movieList.substring(0, movieList.length() - 2);
context.write(new Text(key.toString() + " have liked the following film(s) : "), new Text(formattedMovieList));
}
The output file will then look like this :
"78 users have liked the following film(s) : Lost In Translation (2003)"
We can see that people favorite film is The Shawshank Redemption.
Here the flow of the job :
You can find the full code for this project here : MovieFinder Repo
We are done !! Don't be shy to ask as many questions as you want I will try to answer it !
Top comments (0)