Alistair

Posted on Mar 18 • Edited on Apr 14

Cleaning Cinema Titles Before You Can Even Search

#javascript #datascience #webdev #opensource

When Clusterflick first started pulling listings, I assumed the hard part would be the scraping. Getting the data off 250+ different cinema websites, each with their own structure and quirks — that's where the complexity lives, right?

But before any of that work pays off, before a single TMDB search can happen, there's a problem sitting right at the start of the pipeline: cinema listings don't always give you a clean film title. They give you something like this:

BAR TRASH – THE ZODIAC KILLER (1971) at Beer Merchants Tap

Or:

(IMAX) Princess Mononoke: 2025 Re-Release Subtited

Or my personal favourite:

MUPPET PUPPETS CHRISTMAS CAROL WORKSHOP & SING-ALONG

None of those are going to find anything useful in a TMDB search. So before matching can happen, there's a normalisation step — and it's grown into something with its own test suite of nearly 15,000 cases.

The Obvious Stuff

The easy wins are the patterns you see immediately once you start looking at real listings. Film Clubs will attach their branding, and cinemas love adding their series names and event types to the front of a title:

Bar Trash:
DocHouse:
CLASSIC MATINEE:
Animation at War:
Family Film Club:

And the end of titles is just as cluttered:

… + Q&A with Director
… on 35mm film
… (4K Remaster)
… Special Screening
… with Introduction

For all of these, there's a known-removable-phrases.js file — a flat list of exact strings and patterns to strip. It currently has around 1,000 entries. The rule for adding to it is simple: if a phrase is a superfluous label added by a venue, that isn't part of identifying the film, it goes here. Spelling corrections and encoding fixes are handled separately.

The list isn't pretty, but it works. After stripping known phrases, BAR TRASH – THE ZODIAC KILLER (1971) at Beer Merchants Tap becomes THE ZODIAC KILLER (1971). Progress.

The Plus Problem

A lot of venues append extra information to titles using a + separator:

Slade in Flame + Q&A with Noddy Holder
TO A LAND UNKNOWN + PRE-RECORDED Q&A
Goodbye to the Past + pre-recorded intro by Annette Insdorf

The solution is obvious: split on + and take whatever's before it. Except — and this is where it gets awkward — some legitimate film titles contain a +:

Romeo + Juliet

That's the actual title of the Baz Luhrmann film. Split naively and you'd search for "Romeo" and find nothing useful. So there's a corrections list that pre-empts the split:

["Romeo + Juliet", "Romeo+Juliet"],

Removing the spaces makes it invisible to the splitter, then it gets normalised back correctly downstream. It's a bit of a hack, but it does the job.

The same logic applies to the – and / separators, which venues also use to attach event context. The pipeline strips what comes after the last separator — unless the result looks wrong, in which case there's probably a correction for it.

"Presents" and Other Sneaky Prefixes

Some patterns can't be handled with a fixed string list — there are too many variations. So instead we look for signal words to decide what information we can discard. If a title contains presents:, for example, everything before presents: is almost certainly not the film title:

Ghibliotheque presents... Spirited Away
VHS Late Tapes Takeover: LCVA presents POUT

These get handled with a regex match: if presents?:? appears mid-title, take whatever follows it.

The same approach works for premiere of:, screening of:, retrospective screening of:, and a handful of others. Each one is a named match rather than a blindly applied strip, so the code can be explicit about what it's doing.

The Corrections List

Even after removing known phrases and applying structural patterns, there are titles that are just wrong — or at least not in the form TMDB expects. That's where normalize-title.js comes in. It has a corrections array with around 500 entries, covering everything from typos to venue-specific quirks to completely misnamed films.

Some are straightforward spelling fixes:

["Carvaggio", "Caravaggio"],
["Seigfried", "Siegfried"],
["Labryinth", "Labyrinth"],

Some are encoding artefacts or odd formatting choices:

["&amp;", "&"],
["½", " 1/2"],

Some are venues getting the actual film title wrong. The BFI listed a film as "Battleground" as a translation from the original Italian — the film is called "Battlefield":

["Battleground + intro ", "Battlefield + intro "],

And then there are the genuinely weird ones. MUPPET PUPPETS CHRISTMAS CAROL WORKSHOP & SING-ALONG — that's not a film, it's an event which includes a film.

["MUPPET PUPPETS CHRISTMAS CAROL WORKSHOP & SING-ALONG", "Muppet Christmas Carol"],

With hindsight, this is the kind of thing I try to avoid - a one-off correction for a singluar event. This probably should have not had a correction applied and instead rely on failing over to the LLM for identification using matching hints.

One entry I'm particularly fond of:

[/^Dr\.? Strangelove$/i, "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"],

Because cinemas almost never write the full title, but having the full title makes it much more likely to match on a TMDB search.

What Gets Stripped Last

After the corrections and phrase removal, there's a final cleanup pass: diacritics get normalised, smart quotes become straight quotes, soft hyphens get removed, trailing punctuation goes, articles at the start (the, a) get stripped (in most cases, not all) so that The Big Lebowski and Big Lebowski match the same thing.

Year suffixes in brackets like (1971) are kept, because they're genuinely useful disambiguation — Psycho (1960) is a different film from Psycho (1998) (and you'll probably want to know which version you're about to watch 😉).

There's also the theatre performance problem. Some venues list National Theatre Live and Royal Ballet screenings using the same listing format as regular films. NT Live: Dr Strangelove isn't looking for a film called "Dr Strangelove" — it's looking for the NT Live broadcast of it. There's a whole separate setup for that which gets detected and normalised before this pipeline runs. But that's probably worth its own post.

Perfect Is the Enemy of Good

The list of corrections is never going to be finished. New venues bring new branding. Films get re-released with different title formats. Cinemas just spell things wrong.

What the normalisation step needs to do is get most titles into a clean enough state that the TMDB search returns the right film. The cases it misses — titles that are too ambiguous or too corrupted — fall through to the LLM matching stage, which can handle a messier input. That's the right place for those anyway: the normalisation step is supposed to be fast and cheap, not exhaustive.

The test suite in normalize-title.test.js keeps the list honest. Every correction and removable phrase is supposed to have a corresponding test case in test-titles.json, so there's a record of what each entry is for and a way to verify it doesn't break anything when the list changes. And it gets updated every day as new data comes in.

It's not elegant. But the alternative — sending BAR TRASH – THE ZODIAC KILLER (1971) at Beer Merchants Tap to TMDB and hoping for the best — doesn't work. And now you know why 🍿

P.S. Shout out to Bar Trash for having some of the most consistent and standardised titles ❤️
Those titles make for a great example in this blog post, but they're far from being the most complex ones I need to deal with!

🎬 A list of the movies mentioned:

Next post: ~~Testing Your Prompts Like You Test Your Code~~
Unfortunately I've not gotten this work completed. So until then, the next post will be The Raspberry Pi Cluster in My Living Room

Top comments (1)

Benjamin Nguyen • Mar 18

I love you and another person make an article into a story all the time.