DEV Community: Alistair

Github as Infrastructure

Alistair — Wed, 15 Apr 2026 07:47:00 +0000

Clusterflick has always been a personal project, which means keeping costs down has always been a goal. I already had a GitHub account for the code, so the question was how far I could push that. The answer turned out to be: further than I expected.

This post is about using GitHub not just as a place to store code, but as the actual infrastructure the project runs on. Some of it is straightforward. Some of it is a bit unconventional. All of it comes back to the same goal: keep it cheap, keep it open.

Actions as the Pipeline Engine

The core of Clusterflick is a data pipeline: retrieve cinema listings, transform them, enrich them, combine them, generate outputs. That pipeline runs on GitHub Actions.

Every morning, a workflow kicks off spinning up dozens of jobs to retrieve raw cinema data. As those finish, downstream workflows start spinning up more jobs — transforming, enriching, combining, and finally generating the website. Each step is a separate workflow in a separate repo, with outputs feeding into the next. The project readme has more details of the full flow if you want the detail.

One thing that makes this work in practice is GitHub's secrets management. API keys for TMDB, the LLM provider, and the tokens needed for cross-repo dispatch events are all stored as secrets and injected into workflows at runtime — none of it sitting in the codebase.

The free tier for public repos covers all of this comfortably. The open source decision and the infrastructure decision are linked — without public repos, the free Actions minutes disappear.

Releases as a Database

This is the part people tend to raise an eyebrow at 😁

Instead of a database, or an S3 bucket, or any paid storage layer, the output of all the scrapes is a bunch of JSON files uploaded as assets of a GitHub Release on a public repo. A release is just a named snapshot and you can attach arbitrary files to it. The latest run uses the "latest" tag.

Downstream workflows then download the latest release, and the pipeline continues. Each stage produces its own release in its own repo, which makes getting "the latest data" trivial — there's no pattern matching on release names, no querying a database. Each repo has one job: produce a release. Grab the latest release from that repo to get the data.

It's not a pattern you'd reach for if you were building something traditional, but for a project where the data is public anyway it works well — the data is versioned, publicly accessible, queryable via the GitHub API, and costs nothing. There's also a nice side effect: every daily run produces an immutable, timestamped snapshot of cinema data. If you ever wanted to analyse how London's cinema landscape changes over time, that archive is just sitting there.

A Multi-Repo Architecture With Cross-Repo Triggers

The pipeline spans multiple repos — data-retrieved, data-transformed, data-cached, data-combined, and others — with workflows in one repo triggering workflows in the next via repository_dispatch events.

The reason for the split is the Releases-as-storage pattern above. Each repo holds one stage's data as its latest release, and downstream repos pull from it. That boundary — one repo, one release, one stage — is what makes the "grab the latest" approach work cleanly. Some of these repos contain little more than a workflow file, a package.json, and a readme.

The trade-off is that tracing a failure across the chain means navigating between repos. The practical fix is a readme with status badges for each workflow — a glance tells you which stage broke, rather than having to click through each repo's Actions tab to find out.

Self-Hosted Runners: When Cloud Runners Get Blocked

Some cinema venues block requests from GitHub's cloud runner IP ranges — they're well-known and easy to identify as automated traffic. To handle those, I run a cluster of Raspberry Pi 4s at home as self-hosted runners. They use a residential IP address, so requests look like regular browser traffic.

The previous post in this series covers the hardware side of that setup. From GitHub's perspective: register each Pi as a runner in the org, add it to a runner group, target jobs at that group with runs-on: self-hosted. The combination of cloud runners for most venues and self-hosted for the tricky ones means the pipeline rarely hits a wall.

Pages for the Site

The site itself is a statically generated Next.js app, built in CI and deployed to GitHub Pages via actions/deploy-pages. No server to manage, no hosting bill.

Pages works well, but the default caching headers are conservative — not ideal when you're serving a lot of static assets. I've got Cloudflare sitting in front to handle that properly. The site performance post goes into that in more detail.

Actions Beyond the Pipeline

The pipeline is the obvious use of Actions, but it's not the only one. A few workflows that exist purely for maintenance or operational visibility:

Weekly dependency cache cleanup. GitHub-hosted runners cache node_modules and Playwright browser installs between runs, but those caches can get stale or bloated. A scheduled weekly workflow clears and rebuilds them, which keeps job times consistent.

Automated PRs for title normalisation. When new cinema titles come in, a workflow records each title alongside its normalised output and opens a PR for manual review. This serves two purposes: it builds up a set of real titles to confirm that changes to the normaliser work as expected, and the PR diff makes it easy to spot cases where the normaliser isn't behaving correctly. I tried automating the review step itself with an LLM — that post covers how that went.

Comparison against Accessible Screenings UK. Accessible Screenings UK maintains their own dataset of accessible cinema showings. A workflow runs a comparison between their data and Clusterflick's and surfaces anything that's in their data but missing from ours — a useful cross-check for coverage gaps that would otherwise be invisible.

Having these as scheduled or triggered workflows means they happen consistently rather than being something that gets done when someone remembers.

The Scripts Repo as an npm Package

One pattern that's worked well is treating the shared scripts repo as an npm package installed directly from GitHub, rather than publishing it to the npm registry.

In each repo's package.json, the dependency looks like:

"scripts": "github:clusterflick/scripts"

Running npm install pulls the latest from the default branch. No versioning ceremony, no publishing step — every repo that depends on the scripts always gets the current version.

There's no pinning by default, so a breaking change in the scripts repo will affect anything that reinstalls. The flip side is that a fix in the scripts repo propagates automatically the next time a job runs npm install — so you can fix a broken job without re-running the entire workflow.

GitHub Projects for Task Management

The last piece is GitHub Projects for tracking work. Issues live on the repos they relate to, and the project board pulls them all together.

Keeping task management in GitHub means everything — code, data, CI, tasks — lives in one place. It's great for a project worked on in spare moments rather than full days.

The Bigger Picture

What I've ended up with is a project where GitHub does a lot more than host the code. It runs the pipeline, stores the data, hosts the site, handles the automation, and tracks the work. That wasn't the plan from the start — it accumulated decision by decision, each one driven by the same question: what's the cheapest way to do this that doesn't create more problems than it solves?

The Releases-as-storage pattern still feels a bit odd to explain, but it works. The multi-repo cross-trigger setup adds some operational complexity, but it keeps each stage understandable on its own. None of it is architecture for architecture's sake — it's just what emerged from trying to keep the whole thing running for as little as possible.

I Tried to Automate a Manual Review Task with Claude. It Wasn't Worth It.

Alistair — Sat, 04 Apr 2026 16:12:07 +0000

Every day, a CI job adds new entries to test-titles.json in my Clusterflick repo. When it finds a cinema listing title the normaliser hasn't seen before, it records the input and the current output, then opens a pull request. Someone — usually me — then has to review whether those outputs are actually correct, fix anything that isn't, and merge.

It's not complicated work. Review the output and confirm the normalizer has done the correct job. If it hasn't, fix the output (test now fails ❌) and then fix the normalizer (until the test now passes ✅). But it happens twice day, and "not complicated" doesn't mean "not context switching".

So I decided to try automating it with Claude. Several hours and $5 later, I don't think it was worth it — and I think the reasons why are worth writing up 💸

The Task

The normaliser — normalize-title.js — converts raw cinema listing titles into a consistent string. I've written about it more in depth in my previous post, Cleaning Cinema Titles Before You Can Even Search.

When the CI job adds new test entries, it records whatever the normaliser currently produces. The reviewer's job is to decide whether that output is correct. There's a docs/reviewing-title-normalisation-test-cases.md file with detailed guidance on how to classify and fix different types of issues.

The automation task: look at the new entries, use the guide to decide if they look correct, fix anything that's wrong, commit. Automating it with Claude seemed like a reasonable fit, especially as I'd been doing this semi automated locallly using a very basic prompt:

In this branch we've had some automated updates to `common/tests/test-titles.json`.
Confirm these changes are correct, or if they're not correct then fix them.
There's details on how this setup works in `docs/reviewing-title-normalisation-test-cases.md`

The Approach

I set up Claude platform and added $5 of credit. Then set up a GitHub Actions workflow triggered by a @claude review titles comment on any PR. The Claude Code GitHub Action handles the Claude integration — it checks out the PR branch, runs Claude Code against it, and can commit fixes back to the branch.

The workflow was straightforward in principle:

on:
  issue_comment:
    types: [created]

jobs:
  claude-review:
    if: >-
      contains(github.event.comment.body, '@claude review titles') &&
      github.event.issue.pull_request != null
    runs-on: ubuntu-latest

Claude gets the diff, reads the documentation, checks each new entry, and either accepts it as correct or fixes it. Should be straightforward, and a manual trigger to kick it off so no surprises.

For this, I was also going to double down with Claude; Claude.ai to guide me through the setup, and using Claude API (via the Github action) to do the action review. But getting there took a few attempts.

The Problems

Something worth noting upfront: every failed run here cost money, especially if Claude spirals and chews through tokens. There's not a lot of feedback (or too much once we figured out streaming that back) so it's much harder than it is locally to see what Claude's thinking and there's no reprompt to bring it back on path. On top of that, each run takes several minutes before you find out what went wrong, the feedback loop is slow and expensive. Debugging a GitHub Actions workflow normally costs you time. Debugging this one cost time and cash.

Permissions. The first run failed with OIDC token errors. The Claude Code Action uses OIDC to generate a GitHub App token, which requires id-token: write in the workflow permissions. I'm not sure why Claude.ai didn't include that in the initial workflow.

Branch checkout. The PR branch wasn't checked out by default — the runner was on main, so Claude found no diff (and chewed through tokens). I added an explicit checkout step with ref: refs/pull/${{ github.event.issue.number }}/head and fetch-depth: 0 so git diff had something to work with. Again, I'm not sure why Claude.ai didn't include that in the initial workflow.

I probably should have caught this one myself. Checking out the PR branch is a well-known requirement when working with pull requests in Actions. I assumed a language model with broad knowledge of GitHub Actions would have it covered. The lesson there is the same as always with LLM output: trust but verify.

Missing --dangerously-skip-permissions. Without this flag, Claude keeps pausing to ask permission before running bash commands or editing files. In a non-interactive GitHub Actions environment that means it loops forever waiting for input it'll never get. Required flag for any autonomous use. Again, I'm not sure why Claude.ai didn't include that in the initial workflow.

--allowedTools has a bug. I initially used --allowedTools Bash,Read,Edit,Write to restrict Claude to just the tools it needs. But there's a known issue where the init message still reports all available tools, which can confuse Claude into thinking it can use them. Swapped to --disallowedTools instead, which works correctly.

By this point I'd spent half my budget just getting the plumbing right, without the PR being updated at all. For context, this PR added 11 new titles, so it wasn't a huge amount of data to review.

The 30-Turn Failure

The first run that got past all the setup issues hit the 30-turn limit and stopped without committing anything. It cost $0.59 and took about five minutes.

What happened was actually Claude doing the right thing. It ran all 11 inputs through the normaliser, saw that every output matched what was recorded, and then — correctly — kept going. Because matching the normaliser isn't the same as being correct. The documentation I'd pointed it at says it plainly:

The output field in test-titles.json is what the test expects, not necessarily what is correct.

So Claude spent the next 25+ turns reading through normalize-title.js, known-removable-phrases.js, and the existing test data, reasoning about whether each output was actually right. That's exactly the job. The problem was it cost $0.59 and ran out of turns before committing anything useful.

I asked Claude.ai to help diagnose this, and it suggested adding an explicit stopping condition to the prompt — something like "if it matches, accept it, don't investigate further." I took that suggestion at face value without thinking through what it actually meant. It would stop the spiralling. It would also stop the reasoning. Those are the same thing 🤦

I added the stopping condition, dropped --max-turns to 15, and declared the cost problem fixed. It wasn't — I'd just hidden it.

A "Successful" Run That Wasn't

With the prompt fixed and tools switched to --disallowedTools, the next run completed in 6 turns and 45 seconds. Cost: $0.19.

The full sequence: check the git log, get the diff, read the docs, run all 11 inputs through the normaliser in a single batch, conclude "All 11 new entries match the recorded output exactly. No fixes needed."

The problem is that conclusion is always true, by construction. The CI job that creates these PRs records normalizer(input) as the output — so of course it matches when you run the normaliser again. Confirming that match is confirming that the CI job that created the PR did the job correctly, nothing more.

What I actually needed was the second step: reasoning about whether those outputs are correct, spotting event prefixes that should be stripped, recognising real film titles that are getting mangled, and updating known-removable-phrases.js accordingly. That's the work. In solving the cost problem by narrowing the prompt, I'd removed the work entirely.

When I went back through the PR manually, I found several entries that still needed fixing.

The Cost Problem Underneath

What kept nagging at me: the task is reviewing 11 strings. There's a large corpus of existing examples, a detailed instructions document, and an LLM with a vast amount of general knowledge. It shouldn't require 30 turns and $0.59 to do this — and the fact that it did suggests something isn't well-suited here, not just misconfigured.

Part of it is a problem with visibility. With each run costing real money and taking several minutes, debugging is expensive. You can't easily see why Claude went down a particular path until you're staring at a full JSON trace of every tool call. Every misconfiguration costs you money and ten minutes before you understand what went wrong. Several of those cycles add up quickly — the $5 I spent getting here was just debugging, not doing useful work.

And even when the infrastructure is right, the cost curve for this type of task is awkward. Simple cases (all outputs correct) should be cheap, but you can't know in advance whether the run will be simple. If Claude starts investigating an ambiguous case, you're back to 20+ turns and $0.50+. The unpredictability makes it hard to budget.

For a task this focused — a small number of strings, a clear pattern to match against, a fixed corpus to consult — perhaps a deterministic script would be more reliable (and much cheaper). The Claude Code GitHub Action is well-suited to open-ended tasks where you're not sure what tools you'll need... and maybe if you've got a healthy budget to back that too. A free, open-source, personal project trying to automate reviewing normaliser outputs against a known pattern isn't really any of that.

What I'd Do Differently

I wouldn't abandon the idea entirely. The local Claude Code workflow — where I can watch it reason, reprompt when it went off track, and apply fixes interactively — has worked well and saved real time. The problem is trying to make that fully autonomous in a way that's cost-effective.

If I came back to this, I'd probably try a direct API call with a tighter prompt and explicit output format rather than the full Claude Code agentic setup. Something that gets the diff, asks Claude to classify each entry as "looks correct" or "has issue: [reason]", and only triggers the expensive autonomous work when there's actually something to fix.

But for now, some things are still faster and cheaper done by hand. 🍿

The Raspberry Pi Cluster in My Living Room

Alistair — Wed, 25 Mar 2026 08:55:00 +0000

There are six Raspberry Pi 4s on a shelf in my living room. They run 24/7, they're all wired directly into the router, and they exist for one fairly specific reason: some cinema websites block GitHub's IP ranges.

GitHub Actions runners share IP space with a lot of automated traffic, and a handful of venues had decided they didn't want to serve requests from that space. The failures were inconsistent — empty responses, timeouts, bot-detection pages — which made them annoying to diagnose. Once I'd worked out what was actually happening, the fix was straightforward: residential IP addresses. Requests that look like they're coming from someone's home connection, because they are.

Hence the Pis.

Why Pis, Not Just a Cheap PC?

It's a fair question. I set myself a target: £50 or less per Pi, all-in. That means the Pi 4 itself, an SD card, a power cable, and an ethernet cable. No wiggle room for a fancy case or anything optional. But six Pis at £50 each is £300 — you could buy a reasonable secondhand desktop for that and run six runners on it without breaking a sweat.

The honest answer is that it didn't start as a deliberate architecture decision. I had one Pi spare, so I set it up as a runner. That was enough at first. As I added more venues and the pipeline got busier, I added another, then another. By the time I had three or four, I was actively buying more rather than reconsidering the approach — partly because they're cheap and low-power (running a desktop 24/7 would cost noticeably more on the electricity bill), but also because I'd started to like the fault tolerance story.

Each Pi is independent. If one plays up, it takes one runner offline, not all of them. Better yet, there's nothing precious about any individual machine — the setup steps are fully documented, so if a Pi goes wrong I can wipe the SD card and have it back as a runner in under an hour. Cattle, not pets. A single PC running six processes doesn't give you that.

Pi 4s aren't particularly cheap if you buy them new and in a hurry, but there's a reasonable secondhand market if you're patient. I watched eBay listings and Facebook Marketplace, picked them up when they matched the budget, and that's how I ended up with six of them. A few came without accessories, which meant sourcing cables separately — but even then, it worked out.

One thing I learned the hard way: the power supply matters more than you'd think. The Pi 4 is particular about voltage, and one of mine was on an underpowered cable. All headless, so there's no screen to hint at what's wrong — it just showed up as one runner that was less reliable than the others, dropping jobs intermittently. It took longer than I'd like to admit to narrow it down to the power supply. Swapping the power supply fixed it immediately.

SD Cards: The Unexpected Bottleneck

The other thing that surprised me was how much the SD cards matter for this use case.

Most Raspberry Pi guides will tell you any Class 10 card is fine, and for general use that's probably true. But GitHub Actions runners do a lot of I/O — constant git checkouts, caches being read and written, files being created and deleted across every job. Slow cards can appear fine at first, but you'll notice them becoming a bottleneck once they get a job, especially one with a lot of smaller steps. Jobs that should take 10 seconds start taking ten times as long, and you can't figure out why until you look at where the time is actually going.

Swapping to SanDisk Extreme Pro cards made a noticeable difference — runners were now consistently faster on anything I/O-heavy, which in practice is most jobs. I ended up writing a workflow to test SD card speed which uses Raspberry Pi's own speed test script. It checks whether read and write speeds are fast enough to provide adequate performance, which saves finding out the hard way mid-pipeline (and I'm hoping will let me quickly diagnose if an SD card is degrading).

The other SD card lesson: 16GB is too small. The GitHub Actions runner cache fills up in less than a week of regular use. I have a scheduled workflow to free up space — it clears the npm cache, removes all Playwright browsers, then reinstalls the latest dependencies and pre-warms everything. It works, but it's a bit of a workaround for a storage problem. I've since bumped everything to 64GB cards, I still run the workflow weekly, and so far everything's running smoothly.

The Physical Setup

Six Pis sitting loose on a shelf with cables going everywhere is exactly as annoying as it sounds, so I designed a mount to keep things tidy. It's a 3D-printed mount that holds each Pi in place, with enough spacing for airflow and clean cable routing (power cable is supported, SD card is accessible from the top, ethernet cable is hidden underneath).

If you want to print one yourself, I've uploaded the STL files to Printables.

Everything is connected directly to the router via ethernet. No Wi-Fi. I briefly considered Wi-Fi for the tidiness of it, but I've had too many experiences with Wi-Fi dropouts causing mysterious CI failures, and the whole point of this thing is reliability. Ethernet cables aren't pretty, but they don't drop connections.

The full cluster sits inside an IKEA SMARRA box. It runs quietly, doesn't generate much heat, and sits in a corner where it's easy to ignore — which is exactly what you want from infrastructure.

What I Haven't Covered

Getting the Pis onto the network is the easy bit. Actually registering them as self-hosted GitHub Actions runners, keeping those runners healthy, and managing the runner environment across six machines is its own topic — one for another day.

The short version for the curious: GitHub provides a script you run on each machine, it registers itself in your repo's settings, and from that point on it just sits there waiting to pick up jobs. The initial setup is straightforward enough. It's everything that comes after — keeping them healthy, diagnosing npm cache issues, hunting down slow runners — where things get more interesting. I do have a workflow that reports stats across all runners — uptime, temperature, disk space remaining — which at least makes it easy to spot a machine that's quietly having a bad time.

Next post: GitHub as Infrastructure — self-hosted runners, secrets management, and using GitHub Actions as the backbone of a daily data pipeline.

Cleaning Cinema Titles Before You Can Even Search

Alistair — Wed, 18 Mar 2026 08:55:00 +0000

When Clusterflick first started pulling listings, I assumed the hard part would be the scraping. Getting the data off 250+ different cinema websites, each with their own structure and quirks — that's where the complexity lives, right?

But before any of that work pays off, before a single TMDB search can happen, there's a problem sitting right at the start of the pipeline: cinema listings don't always give you a clean film title. They give you something like this:

BAR TRASH – THE ZODIAC KILLER (1971) at Beer Merchants Tap

Or:

(IMAX) Princess Mononoke: 2025 Re-Release Subtited

Or my personal favourite:

MUPPET PUPPETS CHRISTMAS CAROL WORKSHOP & SING-ALONG

None of those are going to find anything useful in a TMDB search. So before matching can happen, there's a normalisation step — and it's grown into something with its own test suite of nearly 15,000 cases.

The Obvious Stuff

The easy wins are the patterns you see immediately once you start looking at real listings. Film Clubs will attach their branding, and cinemas love adding their series names and event types to the front of a title:

Bar Trash:
DocHouse:
CLASSIC MATINEE:
Animation at War:
Family Film Club:

And the end of titles is just as cluttered:

… + Q&A with Director
… on 35mm film
… (4K Remaster)
… Special Screening
… with Introduction

For all of these, there's a known-removable-phrases.js file — a flat list of exact strings and patterns to strip. It currently has around 1,000 entries. The rule for adding to it is simple: if a phrase is a superfluous label added by a venue, that isn't part of identifying the film, it goes here. Spelling corrections and encoding fixes are handled separately.

The list isn't pretty, but it works. After stripping known phrases, BAR TRASH – THE ZODIAC KILLER (1971) at Beer Merchants Tap becomes THE ZODIAC KILLER (1971). Progress.

The Plus Problem

A lot of venues append extra information to titles using a + separator:

Slade in Flame + Q&A with Noddy Holder
TO A LAND UNKNOWN + PRE-RECORDED Q&A
Goodbye to the Past + pre-recorded intro by Annette Insdorf

The solution is obvious: split on + and take whatever's before it. Except — and this is where it gets awkward — some legitimate film titles contain a +:

Romeo + Juliet

That's the actual title of the Baz Luhrmann film. Split naively and you'd search for "Romeo" and find nothing useful. So there's a corrections list that pre-empts the split:

["Romeo + Juliet", "Romeo+Juliet"],

Removing the spaces makes it invisible to the splitter, then it gets normalised back correctly downstream. It's a bit of a hack, but it does the job.

The same logic applies to the – and / separators, which venues also use to attach event context. The pipeline strips what comes after the last separator — unless the result looks wrong, in which case there's probably a correction for it.

"Presents" and Other Sneaky Prefixes

Some patterns can't be handled with a fixed string list — there are too many variations. So instead we look for signal words to decide what information we can discard. If a title contains presents:, for example, everything before presents: is almost certainly not the film title:

Ghibliotheque presents... Spirited Away
VHS Late Tapes Takeover: LCVA presents POUT

These get handled with a regex match: if presents?:? appears mid-title, take whatever follows it.

The same approach works for premiere of:, screening of:, retrospective screening of:, and a handful of others. Each one is a named match rather than a blindly applied strip, so the code can be explicit about what it's doing.

The Corrections List

Even after removing known phrases and applying structural patterns, there are titles that are just wrong — or at least not in the form TMDB expects. That's where normalize-title.js comes in. It has a corrections array with around 500 entries, covering everything from typos to venue-specific quirks to completely misnamed films.

Some are straightforward spelling fixes:

["Carvaggio", "Caravaggio"],
["Seigfried", "Siegfried"],
["Labryinth", "Labyrinth"],

Some are encoding artefacts or odd formatting choices:

["&amp;", "&"],
["½", " 1/2"],

Some are venues getting the actual film title wrong. The BFI listed a film as "Battleground" as a translation from the original Italian — the film is called "Battlefield":

["Battleground + intro ", "Battlefield + intro "],

And then there are the genuinely weird ones. MUPPET PUPPETS CHRISTMAS CAROL WORKSHOP & SING-ALONG — that's not a film, it's an event which includes a film.

["MUPPET PUPPETS CHRISTMAS CAROL WORKSHOP & SING-ALONG", "Muppet Christmas Carol"],

With hindsight, this is the kind of thing I try to avoid - a one-off correction for a singluar event. This probably should have not had a correction applied and instead rely on failing over to the LLM for identification using matching hints.

One entry I'm particularly fond of:

[/^Dr\.? Strangelove$/i, "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"],

Because cinemas almost never write the full title, but having the full title makes it much more likely to match on a TMDB search.

What Gets Stripped Last

After the corrections and phrase removal, there's a final cleanup pass: diacritics get normalised, smart quotes become straight quotes, soft hyphens get removed, trailing punctuation goes, articles at the start (the, a) get stripped (in most cases, not all) so that The Big Lebowski and Big Lebowski match the same thing.

Year suffixes in brackets like (1971) are kept, because they're genuinely useful disambiguation — Psycho (1960) is a different film from Psycho (1998) (and you'll probably want to know which version you're about to watch 😉).

There's also the theatre performance problem. Some venues list National Theatre Live and Royal Ballet screenings using the same listing format as regular films. NT Live: Dr Strangelove isn't looking for a film called "Dr Strangelove" — it's looking for the NT Live broadcast of it. There's a whole separate setup for that which gets detected and normalised before this pipeline runs. But that's probably worth its own post.

Perfect Is the Enemy of Good

The list of corrections is never going to be finished. New venues bring new branding. Films get re-released with different title formats. Cinemas just spell things wrong.

What the normalisation step needs to do is get most titles into a clean enough state that the TMDB search returns the right film. The cases it misses — titles that are too ambiguous or too corrupted — fall through to the LLM matching stage, which can handle a messier input. That's the right place for those anyway: the normalisation step is supposed to be fast and cheap, not exhaustive.

The test suite in normalize-title.test.js keeps the list honest. Every correction and removable phrase is supposed to have a corresponding test case in test-titles.json, so there's a record of what each entry is for and a way to verify it doesn't break anything when the list changes. And it gets updated every day as new data comes in.

It's not elegant. But the alternative — sending BAR TRASH – THE ZODIAC KILLER (1971) at Beer Merchants Tap to TMDB and hoping for the best — doesn't work. And now you know why 🍿

P.S. Shout out to Bar Trash for having some of the most consistent and standardised titles ❤️
Those titles make for a great example in this blog post, but they're far from being the most complex ones I need to deal with!

🎬 A list of the movies mentioned:

Next post: ~~Testing Your Prompts Like You Test Your Code~~
Unfortunately I've not gotten this work completed. So until then, the next post will be The Raspberry Pi Cluster in My Living Room

Site Performance: Loading 30,000 Showings in a Browser

Alistair — Wed, 11 Mar 2026 08:47:00 +0000

At least twice a day, the pipeline scrapes 250+ London cinemas and produces a dataset of 1,500+ films with 30,000+ showings. Then I need to get all of that into a browser.

Getting the raw data from venues is its own challenge (covered in an earlier post) but even once you've got it, making it available to users fast and in a useful way has its own set of problems to solve.

Clusterflick runs entirely as a static site, served from GitHub Pages with no live server. That's a deliberate constraint — the whole project runs on GitHub's free tier, and I'd like to keep it that way (more on that in a future post). But it means the browser has to do more of the work, and that puts performance decisions front and centre.

By the time data reaches the frontend, it's already been through several pipeline stages — each one producing a GitHub release that the next stage picks up:

Retrieve: raw HTML, JSON APIs, and scraped pages from all 252 venues
- ~800 MB total
Transform: extracts structured showings from the raw data, matches films against TMDB and saves the ID of matches
- down to ~15 MB total
Combine: merges the films from all venues together and hydrates films that have a TMDB ID with rich metadata (cast, genres, poster images, ratings)
- ~18 MB total
Process: strips redundant data, extracts URL prefixes, splits into chunks
- ~5 MB raw, ~1.5 MB gzipped over the wire

This post is about the decisions in that last step (and one I unmade) getting from the combined JSON to something a browser can load and render quickly.

The Compression Detour

Before building anything clever on the frontend, I wanted to be sure the raw data was as small as possible. I'd been running the JSON through compress-json, a library that structurally transforms JSON — deduplicating repeated values into lookup tables, encoding types differently. It made the raw file dramatically smaller. As an example, for one of the runs the full dataset without it is 10.97 MB; with it, 4.85 MB. That's a real reduction.

So I ran a benchmark across every optimisation in the pipeline to see which ones were actually earning their place.

Optimisation	Gzipped impact
Removing showing overviews	💪 -6.1% (saves 109 KB)
URL prefix extraction	💪 -5.0% (saves 90 KB)
Removing IDs	💪 -2.4% (saves 43 KB)
Removing false a11y flags	🤷 ~0%
Trimming RT data	🤷 ~0%
`compress-json`	😱 +18.5% (hurts!)

The headline finding: compress-json makes the gzipped output larger. Without it, the gzipped total is 1.43 MB. With it, 1.76 MB. That's 333 KB I was paying to make things worse.

The reason makes sense once you think about it. Gzip excels at finding repeated byte sequences — exactly what compress-json was doing first. The two approaches fight each other: compress-json's transformed structure is actually harder for gzip to compress than plain repetitive JSON. Gzip decompression is built into every browser's network stack — native C++ code that runs before JavaScript even sees the response. compress-json decompression, by contrast, runs on the main thread in JavaScript. So the current pipeline was paying three times: larger transfer size, extra JS bundle weight for the decompress library, and CPU time running decompress() on every chunk.

So I deleted it. The "no compress-json" variant still has all the other optimisations applied and lands at 1.43 MB — 19% smaller than before. 🎉

The two optimisations that turned out to have zero impact — removing false accessibility flags and trimming Rotten Tomatoes fields — were easy to rationalise after the fact. Accessibility data is sparse; very few performances have those flags set at all, so deleting false values removes almost nothing. The RT fields are a handful of small values per movie. Neither gives gzip much to work with.

Splitting the Data into Chunks

Even at 1.43 MB gzipped, serving the full dataset as a single file would mean users wait for everything before seeing anything. Instead, as part of the data processing it's splits into chunks and a metadata file written alongside them.

The chunking isn't by movie count — it's by serialised byte size, with a target of ~400 KB per chunk. Chunking by movie count would produce wildly uneven file sizes; a blockbuster showing at 50+ venues generates far more data than a one-week indie run. Performance count was an earlier approach, but it still produced too much variance — chunk files ranged 65 KB - 1.2 MB. Switching to byte size brought that down to 16 KB - 727 KB, with the bulk of chunks clustering tightly between 324 KB and 436 KB.

The remaining outliers are expected. The small tail chunks at the end of the alphabet simply don't have enough movies left to fill a full bucket. The large ones contain individual films whose serialised data alone exceeds the target — a blockbuster with 50+ venues and thousands of performances will do that — so they necessarily get a bucket to themselves.

Movies are sorted alphabetically by normalised title before being bucketed — mirroring the default sort order on the site. The idea is that we'll start downloading chunk 0 first, and it'll contain the movies a the top of the list which are visible on screen when the page first loads. So the data the user actually sees is most likely to arrive first and there's less change of visible updates as subsequent chunks load in below the fold.

data.meta.a1b2c3d4e5.json
data.0.f6g7h8i9j0.json
data.1.k1l3m5n7o9.json
...
data.<index>.<fingerprint>.json

The metadata file carries the full lookup tables for genres, people, and venues (shared across all movies), the URL prefix table used to reconstruct booking links, and the mapping that tells the client which chunk contains which movie ID. It's the one file the browser always fetches first — and it's hashed like the chunks, so its filename is baked into NEXT_PUBLIC_DATA_FILENAME at build time.

There's one catch with GitHub Pages: it sets a 10-minute cache TTL on everything at the browser level, which means even a fingerprinted file that hasn't changed for weeks gets revalidated every 10 minutes. Cloudflare sits in front of the site and fixes this in two ways: it caches the files at the edge, and it overrides GitHub's cache-control headers so browsers are told to store all JSON files for a year. Since every file — chunks and metadata alike — is fingerprinted, a changed file always means a new URL and a cache miss by design. A first-time visitor fetches from Cloudflare's edge and caches locally for a year. A repeat visitor gets it straight from their browser cache. Either way, they're only ever making a network request for files that have actually changed.

Once the client has the metadata, CinemaDataProvider handles the rest:

Priority chunk — on a movie detail page, the client looks up the movie's chunk in the mapping and fetches it immediately. Showings appear before the rest of the dataset has loaded.
All other chunks in parallel — via Promise.allSettled(), so a single failed chunk doesn't block everything else from loading.
Expand and prune — IDs stripped before serialisation are re-added via expandData() (restoring the keys that were removed to save bytes), and past performances are stripped before chunks enter React state.

Static Export Changes Everything

Clusterflick uses Next.js with output: "export". There's no live server. Every page is pre-rendered to static HTML during npm run build, then served from GitHub Pages.

This shapes every rendering decision. When Next.js docs talk about Server Components, in this context that means "code that runs at build time on a Node process" — not a server handling live requests. Whatever I pre-render is fixed until the next build.

Two Grids on the Home Page

The home page has a slightly odd architecture, and it's worth explaining why.

At build time, app/page.tsx (a Server Component) reads the chunk files from disk, merges them, applies the default filters — films and shorts, 7-day window — and takes the first 72 results sorted by normalized title. These 72 movies are rendered as a static HTML grid of poster images and links. No JavaScript required. This grid is wrapped in an SSROnly component that removes itself after hydration.

So during the initial paint, and for any crawler, there's a real grid of films with real titles and links in the HTML. Once JavaScript loads and mounts, SSROnly cleans up that static content and hands off to the interactive grid.

The 72 limit is deliberate. It's enough for a meaningful SEO payload — film titles, poster images, links — without bloating the HTML with hundreds of entries. The real, interactive grid that users actually browse is built entirely client-side with the full dataset, applying any filters which may be in effect.

Virtualising 1,500+ Posters

The filter UI is designed to give immediate visual feedback as you change options — in the current design the filter overlay is semi-transparent, so you can see the poster grid updating behind it as you adjust. That only works if rendering is fast. On an earlier design, where the filter controls sat directly above a flat list of results, the lag was obvious and painful: every filter change triggered a re-render of the entire list.

The solution is react-virtualized — specifically its Grid component combined with WindowScroller. Rather than rendering the full list, it calculates which cells are currently visible in the viewport and only renders those, plus a small buffer:

<WindowScroller>
  {({ height, isScrolling, registerChild, onChildScroll, scrollTop }) => (
    <div ref={registerChild}>
      <Grid
        autoHeight
        cellRenderer={cellRenderer}
        columnCount={columnCount}
        columnWidth={POSTER_WIDTH + GAP}   // 208px per column
        rowHeight={POSTER_HEIGHT + GAP}    // 308px per row
        rowCount={rowCount}
        overscanRowCount={3}               // pre-render 3 rows above/below viewport
        scrollTop={scrollTop}
        isScrolling={isScrolling}
        onScroll={onChildScroll}
        ...
      />
    </div>
  )}
</WindowScroller>

~ src/app/page-content.tsx

WindowScroller ties the grid's scroll position to the page's native scroll rather than creating a separate scrollable container. That keeps the browser scrollbar, avoids scroll-jank on mobile, and means the address bar hides naturally on iOS.

Fixed cell dimensions (always 200×300px with an 8px gap) let react-virtualized calculate row and column positions with simple arithmetic, avoiding expensive DOM measurement. Window width isn't available at build time, so the component initialises with a single-column placeholder and sets real dimensions in a useEffect after mount.

The first two rows are above the fold on most screens, so next/image is told to load those eagerly with fetchpriority="high". Everything below row 2 is lazy-loaded as the user scrolls.

One wrinkle: the intro section above the grid can be collapsed or expanded, which shifts the grid's offset on the page. WindowScroller needs to know about this:

requestAnimationFrame(() => {
  window.dispatchEvent(new Event("resize"));
});

A synthetic resize event prompts WindowScroller to recalculate its position. Not elegant, but it works.

Movie Detail Pages: Stripping Performances Before They Cross the Wire

Each film has its own pre-rendered page. generateStaticParams() iterates every movie at build time and Next.js generates a static HTML file for each — typically 1,500+ pages per build.

The app/movies/[id]/[slug]/page.tsx Server Component does the structurally stable work: resolves genres, people, and venues for the film; generates JSON-LD structured data (Movie, BreadcrumbList, ScreeningEvent) for search engine rich results. Then — critically — it strips performances from the movie prop before passing it to the client component:

const { performances: _performances, ...movieWithoutPerformances } = movie;

That means the pre-rendered HTML — and the inline JSON Next.js serialises into it for hydration — only contains movie metadata (title, poster, ratings, cast). The actual showtimes are fetched at runtime by the data context.

The app/movies/[id]/[slug]/page-content.tsx Client Component calls getDataWithPriority(movie.id) on mount, which fetches the chunk containing this film first before loading everything else in parallel. A startTransition defers the showings computation until after the hero section has rendered — so the poster, title, and ratings appear immediately, with showtimes filling in shortly after.

Where It Stands

With all of this in place, I ran Lighthouse against the site across cold and warm cache — averaged over three runs on desktop.

Metric	Cold cache	Warm cache
Lighthouse score	74/100	92/100
First Contentful Paint	459ms	23ms
Largest Contentful Paint	2.5s	281ms
Speed Index	2.5s	42ms
Cumulative Layout Shift	0.197	0.18
Transfer size	5.5 MB	20 KB

The warm cache numbers are the point of everything in this post — 308 of 336 network requests served from cache, 5.5 MB down to 20 KB (less than 1% of the data going across the wire), LCP dropping from 2.5s to 281ms (about 10% of the original time). That's what content-hashed files plus a year-long browser TTL actually buys you.

Cold cache is where there's still work to do. A 74/100 and a 2.5s LCP on first visit isn't bad, but it's not where I'd like it to be. The LCP is the main thing to improve — 2.5s is right at the edge of Google's "needs improvement" threshold, and it's what's dragging the cold cache score down. The CLS (0.197) is a known trade-off from the SSR grid handing off to the virtualised interactive grid, but given warm cache sits at 0.18 and still scores 92/100, it's clearly not the bottleneck.

Next post: Cleaning Cinema Titles Before You Can Even Search

A Brief Detour: Two Writing Challenges and What Came Out of Them

Alistair — Wed, 04 Mar 2026 08:30:00 +0000

Regular Clusterflick series readers: I got distracted. Twice 😅

In the last week I entered a couple of dev.to writing challenges, and both turned out to be good excuses to write about things that were already on the series roadmap — just earlier and in a slightly different shape than I'd originally planned.

The first was the 1️⃣ DEV Weekend Challenge: Community, which I used to write about the film club discovery and "near me" features, which I finally took the time to build. The second was the 2️⃣ Built with Google Gemini: Writing Challenge, which pulled forward what was going to be a later post about using LLMs in the data pipeline.

Both are standalone submissions, but they're very much part of this project:

The LLM post in particular covers things I'd have gotten to eventually in this series — the matching pipeline, the reason key trick, defensive JSON parsing. Worth a read if you've been following along!

Back to the regular schedule next week 🫡

Next post: Site Performance: Loading 30,000+ Showings in a Browser

Three Things I Learned Using LLMs in a Data Pipeline

Alistair — Mon, 02 Mar 2026 19:44:27 +0000

This is a submission for the Built with Google Gemini: Writing Challenge

What I Built with Google Gemini

"Ghibliotheque Presents: My Neighbor Totoro + Intro"

That's a real cinema listing title, but it's not a title you can just search for. And as titles go, it's one of the more straightforward ones. Things get even messier when we get into cinema listing pages. I've seen venues that don't include a year, don't include the director, or give you little more than a title and a one-line description. If you're building an aggregator that needs to identify what's actually showing, you spend a lot of time staring at strings like this.

I've been building Clusterflick, a cinema aggregator for London that pulls listings from 250+ venues daily. I thought scraping would be the hard part. But figuring out what a listing actually is — which film, matched to which entry in The Movie DB — is where a lot of complexity lies. And it's where I've been using Gemini.

There's a whole layer of work involved in cleaning raw listing strings down to something searchable — that's worth a post of its own — but even with a clean title, the matching problem doesn't go away. Many venues don't include the necessary information to programatically search using The Movie DB API — just a title and maybe a vague description. Even when they do have more data, e.g. title plus year or even title plus director, it doesn't necessarily uniquely identify a film. And legitimate films with short or common names can be difficult to surface in TMDB search results at all.

I use Gemini to help at four stages in the identification pipeline:

Match against TMDB — given the cinema listing and a list of search results from TMDB, Gemini picks the best match. This handles the majority of cases.
- common/ask-llm-to-review-results.js
Direct identification — if TMDB search returns nothing useful, I ask Gemini if it recognises the film from the listing alone. Its training data often knows about films that don't surface well through search.
- common/ask-llm.js (The original use of Gemini in the project — everything else has grown from this first step)
Classify the listing — if we still can't identify a film, I ask Gemini what the listing actually is: a film, a short, a double bill, a quiz night, a live event, a comedy show. That classification feeds into filters on the website, and it determines what happens next in the pipeline — a listing classified as multiple films or shorts triggers its own follow-up steps.
- common/ask-llm-to-categorise.js
Extract multiple films or shorts — if a listing is identified as containing multiple films or shorts (a double bill, a shorts programme, a marathon), I ask Gemini to pull out the individual titles so each can be matched separately.
- scripts/transform/identify-multiple-movies.js
- scripts/transform/identify-shorts.js

Each stage only fires if the previous one didn't produce a result. That keeps costs down and means Gemini is only doing the hard work when simpler approaches have already failed.

The model I'm using is gemini-2.5-flash-lite. I'd been running on gemini-2.0-flash for a while and recently upgraded — one line change in the code, and I saw no noticeable difference in the identification and categorisation output from the previous run. Free performance improvement!

Demo

Clusterflick is live at clusterflick.com — 250+ venues and thousands of films across London, updated daily.

The pipeline code is open source (github.com/clusterflick/scripts), and runs across GitHub's cloud runners and a cluster of 6 Raspberry Pis in my living room — so if the judges are looking for a good home for that prize, I have a shelf ready! 🍿

The parsing layer discussed below is in llm-client.js.

What I Learned

Asking for a reason made the model more honest

When I first started asking Gemini to match listings to TMDB results, I was asking it to return a match and a confidence score (I use 0–9). It worked, but I was getting too many confident wrong answers — the model would pick something and report high confidence even when it was clearly a stretch.

The fix was adding a reason key to the expected JSON response. Forcing the model to articulate why it had chosen a match made it noticeably more cautious. It's like the difference between someone blurting out an answer and someone having to show their working. The false positives dropped.

{
  "reason": "Listing matched description of magical forest spirits and animation style",
  "confidence": 8,
  "match": {
    "id": 8392
  }
}

I now apply the same pattern wherever I need the model to make a judgement call. Structured output with a reason field is the single most effective prompt change I've made.

Using Gemini to improve my prompts

At some point I realised I was spending more time tweaking prompts than writing actual pipeline code. So I started asking Gemini to critique and rewrite them for me.

It sounds circular, but it works. The model is better than I am at structuring instructions for itself — clearer constraints, better edge case handling, more consistent output. Now when a prompt isn't giving me the results I want, my first step is to paste it into a fresh conversation and ask the model what's wrong with it and how it would rewrite it.

The results are often prompts I wouldn't have written myself. More explicit about edge cases. Better at specifying output format. And because the model wrote them, they tend to produce more predictable responses.

Defensive parsing is non-negotiable

Even with well-crafted prompts, LLM output in production will occasionally be malformed. I found this out when the model truncated a film overview mid-sentence and left a trailing backslash — one bad character broke JSON.parse and failed the entire job.

The longer the pipeline ran, the more edge cases surfaced. The model occasionally hallucinates fields that aren't in the schema (backdrop_path appearing uninvited was a fun one). It sometimes leaves unescaped quotes inside string values. Markdown code fences show up often enough that stripping them became standard. Each of these is now a line in the sanitisation layer:

const result = await chatSession.sendMessage(prompt);
const response = result.response.text();

// Unwrap the string if it's been wrapped in a markdown block
const jsonString = response.replace("\`\`\`json", "").replace("\`\`\`", "");

const correctedJsonString = jsonString
  // Apply corrections for malformed escape characters (perhaps due to truncation)
  .replace(/\\(?!["\\/bfnrtu]|u[0-9a-fA-F]{4})/g, "")
  // Apply corrections for hallucinated invalid additions
  .replace(/"backdrop_path": "[^,]+,\n/i, "")
  // Fix unescaped quotes within the "reason" field value
  .replace(
    /"reason"\s*:\s*"(.*)"\s*([,}])/s,
    (_match, reasonContent, terminator) => {
      const fixed = reasonContent.replace(/(?<!\\)"/g, '\\"');
      return `"reason":"${fixed}"${terminator}`;
    },
  );

try {
  return JSON.parse(correctedJsonString);
} catch (e) {
  console.log("Error parsing LLM answer");
  console.log("--- Original response: -----------------------");
  console.log(response);
  console.log("--- Corrected response: ----------------------");
  console.log(correctedJsonString);
  throw e;
}

Every line in there is a real production issue. Treat LLM responses as untrusted input, sanitise before you parse, and log both the original and corrected response when things go wrong — you'll want that context when debugging.

Google Gemini Feedback

Flash-lite has been reliable and cheap, which matters when you're running a pipeline daily across hundreds of venues and thousands of films. Cost has stayed predictable as the number of venues has grown, which is exactly what I needed.

One deliberate choice worth mentioning: I run with temperature: 0. This is a data pipeline, not a creative writing tool — I want output as close to deterministic and consistent as possible.

const generationConfig = {
  temperature: 0,
  topP: 0.95,
  topK: 40,
  maxOutputTokens: 8192,
};

The upgrade from 2.0 to 2.5 was painless — one line change, no prompt tuning needed. To confirm nothing had shifted, I ran the pipeline twice with each model version and compared the transformed output. No noticeable differences for any venues. That kind of stability is worth a lot in production.

The main frustration I haven't fully solved is flip-flopping. The pipeline runs daily, and occasionally a listing that was confidently matched to film X on one run comes back as film Y the next. The confidence is right on the edge either way — only one can be right, or both can be wrong — and temperature: 0 helps but doesn't eliminate it. I'd love better signalling when the model is genuinely on the fence, rather than having to infer uncertainty from a confidence score that turns out not to be reliable enough to always act on.

Making London's hidden film clubs discoverable

Alistair — Sun, 01 Mar 2026 11:50:52 +0000

This is a submission for the DEV Weekend Challenge: Community

The Community

I've spent the last year building Clusterflick — a site that pulls together cinema listings from across London so you can see everything showing, everywhere, without jumping between a dozen different websites. It started as a personal itch: I just wanted to know what was on (for the backstory, see my intro post)

But the more I used it, the more I realised I was only solving half the problem. I could tell you what was showing at which venue — but I couldn't tell you if the screening was part of a film club, whether the club screenings were accessible, or even that the club existed at all. London has a genuinely brilliant film club scene: community cinemas, genre nights, archive screenings, disability-led clubs. Most of them are invisible unless you already know to look for them.

That felt wrong. These communities deserve better than a buried events page most people never find.

What I Built

Two new features, both aimed at making London's film club community more discoverable.

Film Club Pages

clusterflick.com/film-clubs gives each film club its own dedicated page. Each page shows their logo, a short description of who they are and what they programme, links back to their own site, and — crucially — pulls together their full upcoming lineup across all the venues they screen at. A lot of clubs move around; they're not tied to a single cinema. Clusterflick now reflects that.

To give a sense of the range:

Bar Trash programmes cult and curiosity films for people who've exhausted the mainstream;
Pitchblack Playback runs immersive listening sessions in the dark, using cinema sound systems the way most people never get to hear them;
and Lost Reels specialises in bringing forgotten, lost, or otherwise unavailable films back to UK screens.

Three very different clubs, all doing something you won't find on a standard listings site, and all working across multiple venues.

I also included accessibility information on each club page, surfaced directly from the screening data. If a club regularly programmes relaxed screenings or subtitled showings, that's highlighted. It shouldn't take three clicks to find out whether a club is somewhere you can actually go.

Near Me

clusterflick.com/near-me uses the browser's location API to show you what's geographically closest to wherever you are right now — venues, films showing there, and the film clubs attached to those screenings. It's not trying to be Google Maps. The goal is simpler: give someone a starting point. "What's on near me tonight?" is one of the most natural questions in the world, and it's surprisingly hard to answer if you don't already know which cinemas are in your area. And alongside "what's on near me?", it now also answers "what film clubs are near me?" — surfacing the clubs connected to those local venues.

Together, these two features turn Clusterflick from a listings aggregator into something closer to a community directory.

Demo

Both features are live now:

🎬 Film clubs: clusterflick.com/film-clubs
📍 Near me: clusterflick.com/near-me

Code

clusterflick / clusterflick.com

Code for the clusterflick website

Clusterflick

clusterflick.com · Storybook (Chromatic)

Every film, every cinema, one place.

Clusterflick is an open-source web app that aggregates film screenings from across London cinemas into a single, searchable interface. Compare screenings find showtimes, and discover what's on — whether you're chasing new releases or cult classics.

Features

Unified Cinema Listings — Browse film screenings from 250+ London cinemas in one place
Rich Movie Data — View ratings and reviews from IMDb, Letterboxd Metacritic, and Rotten Tomatoes
Multiple Event Types — Find movies, TV screenings, comedy, music events, talks, workshops, and more
Venues & Boroughs — Browse all cinemas by venue or explore all 33 London boroughs
Festival Pages — Dedicated pages for London film festivals with full programme listings
Accessibility Filters — Filter by audio description, subtitles, hard of hearing support, relaxed screenings, and baby-friendly showings
Geolocation — Sort venues by distance from your current location
Shareable Filters —…

View on GitHub

And the data pipeline that feeds the cinema data the site relies on is here: github.com/clusterflick/data-combined

How I Built It

The site is built with Next.js and TypeScript, hosted on GitHub Pages. The film club pages are server-side rendered — all the data is known ahead of time, so they can be fully built at deploy. Near Me is the opposite: since it depends on the user's location, there's nothing to pre-render. The venue and screening data loads client-side, and the results appear once both that data and the user's location are available.

The Near Me logic is straightforward in principle: grab the user's coordinates from the browser Location API, load the cinema location data from the data pipeline, calculate distances, sort, render. The trickier part was deciding what "near" means when you're in London. After some trial and error, 2 miles turned out to be the sweet spot — enough to surface a decent set of options without stretching the definition of "nearby" too far.

For the film club pages, the main work was research and curation. I used Claude to help with the initial research pass — pulling together descriptions, verifying club details, and drafting copy — then reviewed and edited everything manually. The club-to-screening relationships come from the data pipeline, which already tags screenings with their organiser where that data is available. In the end I've added 22 clubs to the system, and over time I'll continue to add more.

CI/CD runs via GitHub Actions. The data pipeline runs twice a day, and the site rebuilds automatically each time it finishes — so listings stay fresh without any manual intervention. I can also kick off a deployment manually when there are site updates to ship.

This has been sitting in my GitHub issues for the last few months — five separate issues, all variations on the same ask; "what's nearby?" and "how do I find film clubs?". I kept kicking them down the road. This weekend challenge was the forcing function I needed to actually ship them. 🎉

Getting the Data Model Right: Movie -> Showings -> Performances

Alistair — Wed, 25 Feb 2026 08:47:00 +0000

When I started building cinema aggregation tooling — pulling listings from multiple independent cinemas — the first real decision was the data model. I've fought bad schemas before. So I sat with this one for a while before writing any code.

The hierarchy I landed on is Movie → Showings → Performances, and while it might sound over-engineered at first glance, every layer earns its place.

Why not just Movie → Performances?

My first schema was essentially flat. A movie had a title, some overview metadata (directors, actors, duration), and an array of performances — times you could go and see it. Simple enough, and it worked fine when I was dealing with a single cinema's listings.

But a cinema doesn't just show a film. It shows variants of a screening. Take Hackney Picturehouse's 40th anniversary run of Labyrinth. They didn't just list it once with a bunch of times — they had regular showings, a "Kids' Club" baby-friendly screening, and a "Relaxed Screening" for folks needing additional support, including neurodivergent audiences and those living with dementia. These aren't just different times — they're fundamentally different experiences, each with their own listing page, their own description, and their own set of performance slots.

That middle layer — the Showing — captures this. A Showing represents one cinema's particular presentation of a movie. It carries the variant-specific context: the URL for that listing, any notes about what makes it different, and its own array of performances underneath. Hackney Picturehouse's Labyrinth becomes three Showings, each with their own performances — rather than one flat list of times where you have to squint at freetext notes to figure out which screening is which.

The original schema

The first version of my transform schema — the contract that each cinema's scraper had to produce — looked roughly like this: a flat array of objects, each with a title, a url, an overview block of metadata, and an array of performances. Each performance had a time, optional screen, freetext notes, and a bookingUrl.

It got the job done for a single venue. But it was doing too much in too few layers. The "notes" field on each performance was carrying all the variant information as unstructured text. Categories lived in the overview, but there was no way to distinguish between a film, a live comedy night, and a quiz. Duration was required, which made sense when we were only generating calendar events, but caused problems when the data was missing. And there was no hook for enriching the data with external sources.

What changed

The evolved schema introduces several things the original couldn't support cleanly.

A showingId gives each showing a stable identity. This matters when you're deduplicating across sources or tracking what's changed between scrapes.

A category enum (movie, tv, quiz, comedy, music, talk, workshop, shorts, event) acknowledges that modern independent cinemas are not just cinemas. They host all kinds of events, and your data model needs to represent that without shoehorning everything into a film-shaped hole. It also set the scene for going beyond cinemas to any venue that screens films and might have other interesting events.

Structured accessibility data at the performance level replaces freetext notes for things like audio description, baby-friendly screenings, hard-of-hearing support, relaxed sessions, and subtitles. This is crucial — accessibility isn't a property of the movie, or even the showing. It's a property of that specific screening at that specific time. A Tuesday afternoon showing might be relaxed; the Saturday evening one isn't.

A status object on each performance captures things like whether it's sold out. Again, this is inherently performance-level data.

External enrichment fields — themoviedb and themoviedbs (plural) — provide the hook for hydrating listings with data from TMDB. The singular version covers standard films; the plural handles double bills or curated screening programmes where a single showing maps to multiple movies.

And several small refinements: duration is no longer required (because a quiz night doesn't have a runtime), year was added to the overview, classification replaced the awkwardly-named age-restriction, and additionalProperties: false was added throughout the schema to keep the data tight when validating.

Where it gets interesting: combining venues

The transform schema represents what comes out of a single venue's scrape. Each cinema produces its own array of showings. But the aggregation site needs to combine these into a unified view: one movie, with showings from multiple cinemas, each with their own performances.

This is where the hierarchy really pays off. The Movie → Showings → Performances structure scales naturally from single-venue to multi-venue. You don't need to restructure anything — you just group showings under a shared movie identity.

But combining also means deduplicating, and that's where things get nuanced. When the same movie appears at three different cinemas, you'll have overlapping metadata at different levels:

Director and cast info might exist in the showing-level overview (scraped from the cinema's own listing) and at the movie level (from TMDB). Which do you trust? Usually the external source is more reliable and complete, but not always — a cinema might list a special guest or a different cut.
Accessibility information is firmly performance-level. No deduplication needed — it's inherently specific to that time slot at that venue.
Categories and genres can drift between sources. One cinema might tag something as "Drama", another as "Drama / Thriller", and TMDB might call it "Drama, Crime". You need a strategy for reconciling these.

Deduplication isn't a single operation — it's a per-field decision about which source of truth wins at which level of the hierarchy. Having clean separation between movies, showings, and performances makes those decisions much more tractable than they'd be in a flat structure.

The payoff

Spending time upfront on the data model meant that when complexity arrived — new venue types, accessibility requirements, external data enrichment, multi-venue aggregation — the schema absorbed it instead of fighting it. The hierarchy isn't clever for its own sake; it maps onto how cinemas actually programme their events, and that's what makes it hold up.

Next post: ~~Site Performance: Loading 30,000+ Showings in a Browser~~
Change in the schedule: A Brief Detour: Two Writing Challenges and What Came Out of Them

Scaling From 3 Cinemas to 240+ Venues: What Broke and What Evolved

Alistair — Wed, 18 Feb 2026 08:47:00 +0000

When I started scraping London cinema listings, I had three venues and a simple script. Fetch a page, parse it, done. Fast forward to today: 240+ venues, half a dozen different platform types, and a pipeline that runs daily across both GitHub's cloud runners and a cluster of 6 Raspberry Pis in my living room.

Here's what I learned about building extraction systems that scale, and the architectural decisions that emerged from necessity rather than planning.

The Retrieve/Transform Split: How Purity Became Practical

Early on, I had a simple mental model: retrieve grabs the main page, transform figures out what to do with it. If transform needed more data, it just... made more requests. Simple enough, right?

Wrong 😅

This made transform impure. It was making network calls, which created a cascading set of problems:

Debugging was a nightmare - request code wasn't all in one place
Caching became complicated - you now have to cache in two different jobs. If you clear the cache of one job, what impact will that have on the other job?
Testing was fragile - you couldn't test transform logic without network access

The solution wasn't about network topology or runner management. It was about simplicity and separation of concerns.

The new contract is simple:

retrieve does all the fetching - even if it needs to parse HTML to find links to follow
transform makes zero network calls - it takes inputs and produces data that adheres to the schema, that's the guarantee

Each function has a single responsibility. Retrieve handles the messy, stateful, network-dependent work. Transform does the pure, testable, repeatable work.

In practice, this means retrieve might fetch a main page, parse it for film listing URLs, fetch all of those, and hand everything to transform as a bundle. Transform just processes what it's given.

This matters for more than just clean code. Once all retrieves complete, the pipeline creates a GitHub release with an immutable blob of all the raw data. Then transform jobs run against that release. If I change downstream code later, I can re-run transforms on old data without hitting anyone's servers again. That only works if transforms are pure functions.

The retrieve workflow lives in one repository, transform in another. Each creates releases named by timestamp. Clean separation all the way down.

The Variety of Retrieval Strategies

With 240 venues, you see every possible variation of how a cinema might publish its data. Here's what emerged:

Single Page: The Dream

Example: Prince Charles Cinema

One big page with everything you need. Parse it once, you're done. These are vanishingly rare and I treasure them.

Main Page + Listing Pages: The Common Pattern

Example: The Castle Cinema

This is by far the most common pattern. You fetch the main "what's on" page to discover what films are showing, then fetch each film's individual listing page for the rich data you need for proper matching - full synopsis, runtime, cast, directors.

It's two-stage, but predictable. Retrieve handles both stages, transform gets a complete dataset.

JSON/API Endpoints: The Developer's Joy

When a cinema exposes a proper API, everything gets easier.

Normal JSON: Cineworld has straightforward endpoints. Hit them, parse the response, done.

Big Standard (OCAPI): This is where it gets interesting. Open Commerce API (OCAPI) is a standardised ticketing platform API used by both Curzon and ODEON. One unified codebase handles two of the biggest cinema chains in London. When you discover a new cinema runs on OCAPI, it's trivial to add - just point the existing module at their endpoints.

Weird JSON: Metro Cinema technically has a JSON API, but it requires signed requests with hard coded API key on the front-end. There's a bunch of hoop-jumping involved. Still better than parsing HTML, but barely.

GraphQL: Same Benefits, Different Query Language

Example: ActOne Cinema

Like JSON endpoints, but with GraphQL queries. You get structured data without HTML wrangling. The learning curve is steeper than REST, but the payoff is the same - no HTML parsing.

The HTML Parsing Toolkit: Cheerio, Playwright, and date-fns

When there's no API and you're parsing HTML, you need the right tools for the job.

Cheerio - For sites that let you just fetch their HTML. Cheerio is like jQuery but without an actual DOM. You can do CSS selectors and extraction without spinning up a browser. Fast and lightweight.

Playwright - For sites that won't let you just fetch HTML. Maybe they have bot detection, maybe they're heavily client-side rendered, maybe they need requests from residential IPs (hello, cluster of 6 Pis). You need a real browser to make it work.

The BFI is the worst offender for needing this. Both BFI Southbank and BFI IMAX run on the same slow, inconsistent site. Pages load in pieces asynchronously and often time out. It's the longest-running retrieve in the entire pipeline. There's no API. It's just a slog 😭

date-fns - Once you've extracted the data, you still have to parse it. Cinema websites output dates and times in wildly different formats. date-fns handles converting these strings into date objects so we can generate the timestamps the schema requires. Anyone who's worked with dates knows how much of a headache they can be without a good library!

Complex Multi-Page: When Listings and Booking Are Separate

Example: Science Museum

This is where it gets properly complicated:

Retrieve "products" from their JSON API
Filter for movies (because they sell all kinds of products)
Now we've got the titles - but nothing else, there's no link to detail pages in this data
Use their HTML search page to search for each title and scrape the first match (this only works because the Science Museum doesn't show many films and they have distinct titles)
Fetch the listing page HTML for each match to get full movie details

It's a multi-stage dance between JSON and HTML, search and direct fetch, just to get a complete dataset. And Retrieve handles all of this. Transform just processes the final bundle.

Shared Cinema Platforms: When Adding Venues Becomes Trivial

The absolute best moment in maintaining this pipeline is discovering a new cinema runs on a platform you already support.

OCAPI powers ODEON and Curzon. One codebase, two major chains, dozens of screens.

Savoy is the big one for independent cinemas - when you find a new independent cinema and realize it's running Savoy's platform, you just configure a new venue to point at it. No new extraction code needed.

Indy Cinema Group and AdmitOne both power multiple cinemas in the dataset. Same pattern - write the platform integration once, point it at new venues as you discover them.

When a cinema migrates between platforms you already know, updating is a trivial config change. This is what makes scaling from a few venues to 200+ feasible - you're not writing 200 different scrapers, you're pointing a dozen implementations at different configurations.

Event Platforms: When Venues Don't Have Their Own Sites

Not every screening venue maintains its own website with listings. Some only publish events on platforms like Eventbrite, Dice, or OutSavvy (in the codebase we call them "sources")

Here's how the pipeline handles this:

Once per retrieve run, pull all London film-specific events from each source. How we get those varies:

Some source let you filter directly on "Films"
For others we search "Films" and "Theatre" (to catch theatre-on-film like NT Live)
Some require keyword searches and some processing once we have the data

From the source, we now have a bunch of events for lots of different venues, some of which may not even be in London. This is where the setup for sources differs - sources don't transform, they "find". Using the venue attributes - name, address, coordinates, alternative names - they find matching events that the venue's transform function can then encorporate when outputting the final list of venue events.

Each source is responsible for matching based on what data it has. Most compare against the venue name (and list of alternative names like "The Ritzy" vs "Ritzy Picturehouse") plus either:

Coordinate match within 100m, or
Postcode match (some listings have wrong coordinates but correct addresses)

Name matching is fuzzy - basic normalization before comparing. I've never seen false positives because the matching is pretty specific, so we're more likely to miss events than missmatch. There are analysis scripts for each source showing which events matched and which didn't, so we can manually review for missing events.

Event-source-only venues don't have a website to retrieve at all. And their transform just returns whatever the sources found.

Example: BFI Stephen Street - a private hire screen that only appears on event platforms when someone books it for a public screening.

The beauty of this pattern: when a new venue shows up on Eventbrite, adding it is minimal effort. The event data is already being pulled daily. You just register the venue metadata and let the matching happen.

What This Looks Like In Practice

Here's the flow:

Retrieve jobs run - some on GitHub's cloud runners, some on the local cluster for sites that need residential IPs
Data gets aggregated into a GitHub release in the retrieve repository
Transform jobs pull that release and run on GitHub's cloud
Each transform is pure - it processes the data it's given, optionally merging in matched events from the event sources
Output is data conforming to a standardized schema, regardless of whether the source was a single HTML page, a GraphQL API, or an Eventbrite search
Final transformed data gets published as a release in the transform repository

The system isn't elegant because I designed it to be. It's elegant because each constraint - rate limits, IP restrictions, venue variety, platform diversity - forced a clean separation of concerns.

And somehow, it all runs daily, for 240+ venues, without falling over* 🍿

* it sometimes falls over

Next post: Getting the Data Model Right: Movie -> Showings -> Performances

Calendar Feeds: Where It All Started

Alistair — Wed, 11 Feb 2026 08:34:00 +0000

When I lived in Belfast, I had one problem: I wanted to know what was showing at the Strand Cinema without having to remember to check their website. I wanted to look at next Friday in my calendar and see if there was anything worth going to.

So I built a scraper. Pull the listings, transform them into something structured, generate an ICS file. Done.

That was June 2023. That workflow—retrieve, transform, output—is still the foundation of everything Clusterflick does today.

What It Looks Like Now

I currently have 14 cinema calendar feeds in my Google Calendar, for those venues I go to most often. When I want to see what's on, I toggle a few of them on—maybe the BFI, The Castle Cinema, Genesis Cinema, and Hackney Picturehouse if I'm planning for the weekend. When I book tickets, I just copy that event over to my personal calendar.

Adding a feed is as simple as pasting a URL into Google Calendar. If you want to try it yourself, the 📅 data-calendar repo has instructions and feed URLs for every venue.

Update: It's now even easier 🎉 - handy calendar links are included on venue pages in Clusterflick. You can add to Google Calendar, Outlook, or any calendar app that supports Webcal with one click!

👆 Calendar buttons, now at the top of every venue page. Super easy to get your favourite (Prince Charles Cinema?) schedule right in your calendar 🎬

Rich Events, Not Just "7pm — Cinema"

Each calendar event includes:

The venue name and location (so Google Maps knows where you're going)
A link back to the original listing page
The movie title as the cinema lists it
Whatever metadata we managed to extract: directors, actors, a plot summary

Below all of that, we include our match with The Movie Database: so you also have the canonical title, the year, an overview, and a link back to TMDB if you want to look up more—but the event title itself stays as the cinema's original listing.

👆 Prince Charles Cinema schedule for next week 📆

This is different from the website, where everything gets unified under one canonical movie title. Calendar feeds are venue-specific—they're mirroring what's on that cinema's website, so using their original title makes sense. If the Prince Charles Cinema is showing "Troll 2 (aka Best Worst Movie)" and we've matched it to Troll 2 in TMDB, that's fine. The feed is telling you what's on at that venue, not trying to reconcile it with every other cinema's listing.

The Duration Problem

Here's the annoying thing about cinema listings: they tell you when the film starts, but rarely how long it is. And if you're putting something in a calendar, you need an end time.

Early on, I just defaulted everything to 90 minutes. If the listing happens to include a runtime, we use it. And since we match more than 96% of films against TMDB, we can pull the actual runtime from there. So if it's a 2h20m film, you get a 2h20m calendar event.

It's not perfect—it doesn't account for the 20 minutes of trailers most cinemas front-load. But it's close enough to be useful. A two-hour film showing up as a two-hour block in your calendar is good enough for planning your evening.

It Branches Early

One of the nice architectural wins here: calendar feeds come straight off the transform step. They don't need the combining logic, the caching layer, or the TMDB enrichment that the website requires.

The website has to:

Combine showings from multiple venues into canonical movies
Cache TMDB lookups to avoid rate limits
Fetch rich metadata (full cast, crew, posters, trailers)
Generate static pages for every film

The calendar feeds skip all of that. They're just: here's what this venue says is showing, in a format your calendar app understands. We branch off right after transform and generate the ICS file. Simple.

Why This Matters

This is still the simplest, most personally useful output of the whole project. Everything else—the website, the movie matching, the LLM-assisted disambiguation—grew from this.

I just wanted to see what was on at the cinema without having to check their website. Two years later, I still use these feeds every week. The rest of Clusterflick exists because this one thing was useful enough to keep building on.

Next post: Scaling From 3 Cinemas to 240 Venues: What Broke and What Evolved

Building Clusterflick: A London Cinema Aggregator

Alistair — Fri, 06 Feb 2026 18:20:48 +0000

I've been working on a personal project called Clusterflick — a single source for every movie showing across London. Right now it's tracking 240 venues across 5 event platforms, currently pulling in 1,398 events and over 30,000 showings.

It started simply enough: I just wanted cinema times on my calendar. But it quickly spiralled into a full data pipeline running on GitHub Actions, a statically generated Next.js site, and a cluster of Raspberry Pis in my living room.

Some of the most interesting challenges so far:

Movie matching is deceptively hard. You'd think title + year would uniquely identify a film. It doesn't. Neither does title + director. Sometimes cinema listings don't even give you enough to identify a movie as a human.
Scraping at scale without a budget. GitHub runner IPs get blocked, so now there's a Raspberry Pi cluster handling the tricky ones.
Using LLMs for data quality. When fuzzy matching falls short, LLMs have been surprisingly useful for resolving ambiguous movie lookups against The Movie DB.
Keeping it cheap. The whole thing runs on near-zero infrastructure costs — GitHub Actions for orchestration, Releases as storage, static site generation to avoid hosting costs.

The whole project is open source on GitHub. If any of this sounds interesting, I'd love to hear from others working on similar scraping/aggregation/data pipeline projects.