JoeyFaulkner

Posted on Apr 23, 2020

Will and Grace and Frankie and Benny: Finding the longest connected set of film titles using IMDB and python

#python #datascience

I am between jobs! I have just finished working at Plantix and will soon be starting an exciting project in the energy sector. So because my vacation has been cancelled due to Coronavirus, and with the constraint that my leisure time between these two jobs has to be almost entirely sat at my desk, I decided to finally follow one of the more stupid ideas I've had to an adequate conclusion.

I have no intention of proving it, but I think "{X} and {Y}" is one of the most common film/tv series title formats there is. A fun game is to try and make the longest title "{X} and {Y} and {Z} and …" where each of "{X} and {Y}", "{Y} and {Z}", etc. are titles of films or tv shows. Turning "Will and Grace" and "Grace and Frankie" into "Will and Grace and Frankie", for example. Let's make the computer do it and see how long we can make it.

What I have: a list of 2 million IMDB movies/TV series/shorts

What I'm going to do with it: Make the longest chain of connected titles in the format "X and Y and Z" etc

The full code is below, and so is the result. But for the sake of anyone interested I'm going to go through it piece by piece, so skip this if you don't care.

Data Reading and Cleaning

First things first, download the "title.basics.tsv" table hosted here https://www.imdb.com/interfaces/.

cached_filename = 'title.basics-trimmed.tsv'
if not os.path.exists(cached_filename):
    # this is a heavy operation so it would be ideal to just do it once and save the result
    df = pd.read_csv('title.basics.tsv', delimiter='\t')
    df = df[df['titleType'].isin(['movie', 'video', 'tvMovie', 'tvSeries', 'tvMiniSeries', 'short'])]
    df.to_csv(cached_filename, sep='\t')
else:
    df = pd.read_csv(cached_filename)

Here I've read the tsv, removed all of the TV episodes and then cached the result. I removed TV episodes as they're the vast majority of the original dataset and it feels like cheating, because no-one on earth gives a shit about the title of a midseason Scrubs episode. My laptop, frequently crashing when trying to include these episodes in the analysis, heartily agrees.

Afterwards, with the ugly one-liner:

df['primaryTitle'] = df.primaryTitle.map(lambda l: l.lower().replace('&', 'and'))

we do all the data cleaning needed: turn every title lower case and replace "&" with "and". Further cool data cleaning would be to include "and" in different languages and see if we could have cross-lingual sets.

Find Eligible Titles

We still have a massive dataframe at this point, so the highest priority is to get rid of as many rows as we can while not changing the eventual result. Therefore we set up a regex filter and do a fullmatch on the title. Anything which doesn't return None is a match, so we drop everything which returns None.

regex_pattern = '[A-Za-z0-9]* and [A-Za-z0-9]*'
pattern = re.compile(regex_pattern)

contains = df.primaryTitle.map(pattern.fullmatch)
and_df = df[~contains.isnull()]

I hate regex. Hate hate hate it. My typical workflow with regex is googling more and more desperate things like "pattern match regex", "pattern match regex with and" "site:stackoverflow.com pattern match regex with "and"" until I find some saint who with zero explanation gives a regex string like "[a-9]*+)*" and I copy and paste it into my code. Sadly this time I had to actually work out how it works and I still hate it, but basically [A-Za-z0–9]* means matching any word made out of only alphanumeric characters, so we're looking for two alphanumeric words connected by " and ". If this is actually doing something different, let me know! I won't bite, I'm used to being wrong.

Represent films as edges

This is basically the only novel work I did here, so please temper any congratulations based on that. The idea is that we can actually represent a film such as "Gnomeo and Juliet" as a directed edge joining two vertices. What I mean is that there exists two vertices: Gnomeo and Juliet, and they remain unconnected until some misguided fool in 2011 pitched the film and secured a $36 million budget.

Once the film is released, Gnomeo and Juliet are connected by a directed edge ( a fancy word for an arrow). I.e, once the film is released, we can take our "X and Y and Z and Gnomeo" and move to "X and Y and Z and Gnomeo and Juliet" and then look at the films downstream of Juliet to continue the chain. Before the film, they were stuck on Gnomeo. In the example of the graph above: if we release a film known as "7 and 2", we will then have an arrow pointing from 7 to 2.

This is a lot of words to explain these two lines:

and_df['start'] = and_df.primaryTitle.map(lambda l: l.split(' ')[0])
and_df['end'] = and_df.primaryTitle.map(lambda l: l.split(' ')[-1])

Bit of Fun: Most common starts/ends

Because we have this data lying around, here are a list of the most common pre-"and" words:

love            431
lost            220
black           142
hide            123
blood           122

and the most common post-"and":

me          299
found       194
seek        121
friends     111
son         102

For some reason I expected the pre's to be similar to the post's, but they aren't. It looks a lot like there are nearly 200 films called Lost & Found, and nearly 120 called Hide & Seek. At this point its probably clear there is some duplication in the titles. Luckily it won't affect our analysis downstream, except by increasing the runtime.

Longest path through a directed graph

This part I copied near-completely from a stackoverflow answer, right here: https://stackoverflow.com/a/29321323. Thank you, er.. jedwards. I'll go over what they are actually doing in this code, but just to note that if I have seen further it is by standing on the shoulders of stack overflow users.
Firstly we need a data structure representing this directed graph. Jedwards here uses a defaultdict:

graph_dict = defaultdict(list)

for _, row in and_df.iterrows():
    graph_dict[row['start']].append(row['end'])

For those uninitiated, a defaultdict can be used exactly like a dict, except if you try an access a key that doesn't already exist on the dictionary, instead of throwing a KeyError like a standard dict would do, it initializes an object that can be immediately used. As an example, if you've used a Counter before, that is basically defaultdict(lambda l: 0) .
In this specific case we are using a defaultdict to create an empty list and immediately append to it. With this loop above, we create a dictionary where the key is the starting vertex (the Gnomeo) and the value is every vertex you can jump to from this (the Juliets).
Jedwards then provides us with this recursive Depth First Search:

def DFS(graph_dict, vertex, seen=None, path=None):
    if seen is None: seen=[]
    if path is None: path=[vertex]

    seen.append(vertex) # this means we only go to each word once which is kind of sad but :shrug:

    paths = []
    for next_vertex in graph_dict[vertex]:
        if next_vertex not in seen:
            t_path = path + [next_vertex]
            paths.append(tuple(t_path))
            paths.extend(DFS(graph_dict, next_vertex, seen, t_path))
    return paths

This takes a vertex (gnomeo) and a graph dict (made above) and creates every possible path through the graph without hitting the same vertex twice. The recursion here is very cool and makes sense: we move vertex by vertex trying to find every path through with more and more constraints until there's nowhere left to go and we've explored every path.

I've added a comment here that this means that we can only hit each vertex once. This is a bit of a shame because if for example we had a cycle in the graph like "Gnomeo and Juliet and Romeo and Gnomeo" then we could include this and make the title even longer. But that would take tracking the usage of the individual edges and I have a feeling that this would push the runtime up from 20 minutes at present to about a year. I'm sticking with jedwards here.

Another thing to note about this function is that it needs to be given a start. Therefore in the spirit of brute forcing this problem, we can simply run DFS on every vertex that is the start of an edge. On every Gnomeo.

all_biggest = []
for start in tqdm(list(graph_dict.keys())):
    all_biggest.extend(sorted(DFS(graph_dict, start), key=len, reverse=True)[:5])

all_biggest = sorted(all_biggest, key=len, reverse=True)

This code will go Gnomeo by Gnomeo and take the top 5 chains in terms of length and add them to a list, finally sorting them. I originally thought I'd show a few here, but the reality of how long these chains were made that impossible. Here is the gist for the whole code. Please comment on it if there could be some improvement! :)

What you've been waiting for

The longest chain I found was 371 films/tv shows long. I've imdb'd the weird ones and they all seem real! I'm adding this as a image here because I couldn't find a nicer way of formatting it. The actual string itself can be found here

EDIT: A title chain with Gnomeo in it

How did i leave out a title starting with Gnomeo? Here you go, anyway

DEV Community

Will and Grace and Frankie and Benny: Finding the longest connected set of film titles using IMDB and python

Data Reading and Cleaning

Find Eligible Titles

Represent films as edges

Bit of Fun: Most common starts/ends

Longest path through a directed graph

What you've been waiting for

EDIT: A title chain with Gnomeo in it

Top comments (0)

Read next

eq and ne in PyTorch

How to Run stable-diffusion-3.5-large-turbo on Google Colab

Leveraging Python's Pattern Matching and Comprehensions for Data Analytics

How I Use Scikit-Learn for Data Science Projects