DEV Community

Edoka Chisom
Edoka Chisom

Posted on • Updated on


Ever closed from a busy day at work and wonder what movie to watch, or on a weekend at home how do you decide what to watch? Not as easy as you thought right, but here comes data to the rescue.
In this write up, I will show you how to scrape data from Rotten Tomatoes which is a popular movie rating website and we will be creating word clouds for about a 100 movies.
Rotten tomatoes has a list of top 100 movies of all time whose ranking is based on critic score and number of critic reviews. But this metric is a little bit flawed as only critics are allowed to rate a movie, ordinary viewers like you and I weren't acknowledged in this. Wouldn't it be awesome to be able to compare the critic score and audience score?

Import the necessary libraries

# import all the necessary libaries
from bs4 import BeautifulSoup
import os
import requests
import glob
import unicodedata
import pandas as pd
Enter fullscreen mode Exit fullscreen mode

Next, we download the top 100 greatest movies of all time dataset by clicking here

Image description

Web scrape the rotten tomatoes site

Rotten Tomatoes site preview

Here, we will web scrape the rotten tomatoes site in order to get the audience score since it was not included in the original dataset.
Due, to the brittleness of webscrapping, I've downloaded the movies web pages and compiled them in this zipped folder., so you can now reproduce this without issues. The folder contains the webpages for the top 100 greatest movies of all time which you can view in your local system browser or code editor.

# Webscraping
# Webscrape rotten tomatoes site to get the movie title,
# audience score and number of audience ratings

# List of dictionaries to build file by file and later convert to a DataFrame
df2_list = []

# folder where the movie webpages are saved
folder = 'C:\\Users\\user\\Desktop\\UDACITY\\Greatest_Movies\\rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file,"lxml")
        title = soup.find('title').text
        title = unicodedata.normalize("NFKD",title)
        audience_score = soup.find(name='div',class_='meter-value').find(name='span',class_='superPageFontColor').text[:-1]
        num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor')

        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',', '')

        # Append to list of dictionaries
        df2_list.append({'title': title,
                        'audience_score': int(audience_score),
                        'number_of_audience_ratings': int(num_audience_ratings)})

# convert list of dictionaries to dataframe       
df2 = pd.DataFrame(df2_list, columns = ['title', 'audience_score', 'number_of_audience_ratings'])
Enter fullscreen mode Exit fullscreen mode

Image description

Merge both dataframes together

Now we perform a inner join of both dataframes but to do this, both dataframes have to have a common column(title column in this case). Thus, a little bit of cleaning has to be done on our title columns. Reformat the title column of df2 to have the same format as that of df1 and then remove any trailing spaces.

Image description

Image description

Image description

Visualization using Tableau

We now have all we need to create our visualization. For this we would be using tableau.

Image description
Link to Tableau visualization

Create a WordCloud for movie reviews

In order to create a word cloud, we need reviews for each movie. We would download Robert Ebert's review(A popular American movie critic) as a text file then loop through each review to get the movie title and its review and finally join his reviews to our dataframe.

folder_name = 'ebert_reviews'
if not os.path.exists(folder_name):
    os.makedirs(folder_name) # creates the directory ebert_reviews if it doesn't exist

ebert_review_urls = ['',
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

Left Join with Robert Ebert's Review

df = new_df.merge(df3,how='left',on='title')
Enter fullscreen mode Exit fullscreen mode

Image description

Some movies haven't been reviewed yet by Robert.

Image description

We can move forward with creating the word cloud now, first we write the code for a single movie to see if it works perfectly before scaling it to a 100 movies.
Word cloud of the first movie

Below is the code to reproduce it for all 100 movies.

# Generate wordcloud for all movies
col = 0
for review in df["review_text"]:
    wc = WordCloud( background_color = "white",
               width = 3000, height = 2000).generate(df["review_text"][col])  # create word cloud for each movie
    wc.to_file(wordcloud_folder+"/"+str(df.ranking[col])+'_'+df.title[col]+'.png') # save the word cloud
    if col > 100:
Enter fullscreen mode Exit fullscreen mode


Now, we have a better sense of what movies to watch based on their quadrant in our Tableau visualization and their review in our word cloud. Enjoy your weekend!!!. Link to my Github repo for the source code.

Top comments (0)