<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alistair</title>
    <description>The latest articles on DEV Community by Alistair (@alistairjcbrown).</description>
    <link>https://dev.to/alistairjcbrown</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3756419%2F8b8f7758-c6f0-4fd7-8547-14c43909cd4e.png</url>
      <title>DEV Community: Alistair</title>
      <link>https://dev.to/alistairjcbrown</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alistairjcbrown"/>
    <language>en</language>
    <item>
      <title>I Tried to Automate a Manual Review Task with Claude. It Wasn't Worth It.</title>
      <dc:creator>Alistair</dc:creator>
      <pubDate>Sat, 04 Apr 2026 16:12:07 +0000</pubDate>
      <link>https://dev.to/alistairjcbrown/i-tried-to-automate-a-manual-review-task-with-claude-it-wasnt-worth-it-13m9</link>
      <guid>https://dev.to/alistairjcbrown/i-tried-to-automate-a-manual-review-task-with-claude-it-wasnt-worth-it-13m9</guid>
      <description>&lt;p&gt;Every day, a CI job adds new entries to &lt;a href="https://github.com/clusterflick/scripts/blob/main/common/tests/test-titles.json" rel="noopener noreferrer"&gt;&lt;code&gt;test-titles.json&lt;/code&gt;&lt;/a&gt; in my Clusterflick repo. When it finds a cinema listing title the normaliser hasn't seen before, it records the input and the current output, then opens a pull request. Someone — usually me — then has to review whether those outputs are actually correct, fix anything that isn't, and merge.&lt;/p&gt;

&lt;p&gt;It's not complicated work. Review the output and confirm the normalizer has done the correct job. If it hasn't, fix the output (test now fails ❌) and then fix the normalizer (until the test now passes ✅). But it happens twice day, and "not complicated" doesn't mean "not context switching".&lt;/p&gt;

&lt;p&gt;So I decided to try automating it with Claude. Several hours and $5 later, I don't think it was worth it — and I think the reasons why are worth writing up 💸&lt;/p&gt;

&lt;h2&gt;
  
  
  The Task
&lt;/h2&gt;

&lt;p&gt;The normaliser — &lt;a href="https://github.com/clusterflick/scripts/blob/ed3f84d25486b84703b3fd6e2d89fbbdae3a1bf3/common/normalize-title.js" rel="noopener noreferrer"&gt;&lt;code&gt;normalize-title.js&lt;/code&gt;&lt;/a&gt; — converts raw cinema listing titles into a consistent string. I've written about it more in depth in my previous post, &lt;a href="https://dev.to/alistairjcbrown/cleaning-cinema-titles-before-you-can-even-search-1463"&gt;Cleaning Cinema Titles Before You Can Even Search&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;When the CI job adds new test entries, it records whatever the normaliser currently produces. The reviewer's job is to decide whether that output is &lt;em&gt;correct&lt;/em&gt;. There's a &lt;a href="https://github.com/clusterflick/scripts/blob/ed3f84d25486b84703b3fd6e2d89fbbdae3a1bf3/docs/reviewing-title-normalisation-test-cases.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/reviewing-title-normalisation-test-cases.md&lt;/code&gt;&lt;/a&gt; file with detailed guidance on how to classify and fix different types of issues.&lt;/p&gt;

&lt;p&gt;The automation task: look at the new entries, use the guide to decide if they look correct, fix anything that's wrong, commit. Automating it with Claude seemed like a reasonable fit, especially as I'd been doing this semi automated locallly using a very basic prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;In this branch we've had some automated updates to `common/tests/test-titles.json`.
Confirm these changes are correct, or if they're not correct then fix them.
There's details on how this setup works in `docs/reviewing-title-normalisation-test-cases.md`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Approach
&lt;/h2&gt;

&lt;p&gt;I set up Claude platform and added $5 of credit. Then set up a GitHub Actions workflow triggered by a &lt;code&gt;@claude review titles&lt;/code&gt; comment on any PR. The &lt;a href="https://github.com/anthropics/claude-code-action" rel="noopener noreferrer"&gt;Claude Code GitHub Action&lt;/a&gt; handles the Claude integration — it checks out the PR branch, runs Claude Code against it, and can commit fixes back to the branch.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3cof1mkaprdk57pr8ri3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3cof1mkaprdk57pr8ri3.png" alt="Screenshot of Claude platform" width="800" height="133"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The workflow was straightforward in principle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;issue_comment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;created&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;claude-review&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;-&lt;/span&gt;
      &lt;span class="s"&gt;contains(github.event.comment.body, '@claude review titles') &amp;amp;&amp;amp;&lt;/span&gt;
      &lt;span class="s"&gt;github.event.issue.pull_request != null&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6z9kyaz1vy4bo0rfn9qt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6z9kyaz1vy4bo0rfn9qt.png" alt="Screenshot of Claude in actions output" width="800" height="109"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Claude gets the diff, reads the documentation, checks each new entry, and either accepts it as correct or fixes it. Should be straightforward, and a manual trigger to kick it off so no surprises.&lt;/p&gt;

&lt;p&gt;For this, I was also going to double down with Claude; Claude.ai to guide me through the setup, and using Claude API (via the Github action) to do the action review. But getting there took a few attempts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problems
&lt;/h2&gt;

&lt;p&gt;Something worth noting upfront: every failed run here cost money, especially if Claude spirals and chews through tokens. There's not a lot of feedback (or too much once we figured out streaming that back) so it's much harder than it is locally to see what Claude's thinking and there's no reprompt to bring it back on path. On top of that, each run takes several minutes before you find out what went wrong, the feedback loop is slow and expensive. Debugging a GitHub Actions workflow normally costs you time. Debugging this one cost time &lt;em&gt;and&lt;/em&gt; cash.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permissions.&lt;/strong&gt; The first run failed with OIDC token errors. The Claude Code Action uses OIDC to generate a GitHub App token, which requires &lt;code&gt;id-token: write&lt;/code&gt; in the workflow permissions. I'm not sure why Claude.ai didn't include that in the initial workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Branch checkout.&lt;/strong&gt; The PR branch wasn't checked out by default — the runner was on &lt;code&gt;main&lt;/code&gt;, so Claude found no diff (and chewed through tokens). I added an explicit checkout step with &lt;code&gt;ref: refs/pull/${{ github.event.issue.number }}/head&lt;/code&gt; and &lt;code&gt;fetch-depth: 0&lt;/code&gt; so &lt;code&gt;git diff&lt;/code&gt; had something to work with. Again, I'm not sure why Claude.ai didn't include that in the initial workflow.&lt;/p&gt;

&lt;p&gt;I probably should have caught this one myself. Checking out the PR branch is a well-known requirement when working with pull requests in Actions. I assumed a language model with broad knowledge of GitHub Actions would have it covered. The lesson there is the same as always with LLM output: trust but verify.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing &lt;code&gt;--dangerously-skip-permissions&lt;/code&gt;.&lt;/strong&gt; Without this flag, Claude keeps pausing to ask permission before running bash commands or editing files. In a non-interactive GitHub Actions environment that means it loops forever waiting for input it'll never get. Required flag for any autonomous use. Again, I'm not sure why Claude.ai didn't include that in the initial workflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg8g22ttv10gnkknoj3h7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg8g22ttv10gnkknoj3h7.png" alt="Screenshot of Claude.ai after being queried about dangerously-skip-permissions flag" width="800" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;--allowedTools&lt;/code&gt; has a bug.&lt;/strong&gt; I initially used &lt;code&gt;--allowedTools Bash,Read,Edit,Write&lt;/code&gt; to restrict Claude to just the tools it needs. But there's a known issue where the init message still reports all available tools, which can confuse Claude into thinking it can use them. Swapped to &lt;code&gt;--disallowedTools&lt;/code&gt; instead, which works correctly.&lt;/p&gt;

&lt;p&gt;By this point I'd spent half my budget just getting the plumbing right, without the PR being updated at all. For context, this PR added 11 new titles, so it wasn't a huge amount of data to review.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 30-Turn Failure
&lt;/h2&gt;

&lt;p&gt;The first run that got past all the setup issues hit the 30-turn limit and stopped without committing anything. It cost $0.59 and took about five minutes.&lt;/p&gt;

&lt;p&gt;What happened was actually Claude doing the right thing. It ran all 11 inputs through the normaliser, saw that every output matched what was recorded, and then — correctly — kept going. Because matching the normaliser isn't the same as being correct. The documentation I'd pointed it at says it plainly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The &lt;code&gt;output&lt;/code&gt; field in &lt;code&gt;test-titles.json&lt;/code&gt; is what the test &lt;strong&gt;expects&lt;/strong&gt;, not necessarily what is correct.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So Claude spent the next 25+ turns reading through &lt;code&gt;normalize-title.js&lt;/code&gt;, &lt;code&gt;known-removable-phrases.js&lt;/code&gt;, and the existing test data, reasoning about whether each output was actually right. That's exactly the job. The problem was it cost $0.59 and ran out of turns before committing anything useful. &lt;/p&gt;

&lt;p&gt;I asked Claude.ai to help diagnose this, and it suggested adding an explicit stopping condition to the prompt — something like "if it matches, accept it, don't investigate further." I took that suggestion at face value without thinking through what it actually meant. It would stop the spiralling. It would also stop the reasoning. Those are the same thing 🤦&lt;/p&gt;

&lt;p&gt;I added the stopping condition, dropped &lt;code&gt;--max-turns&lt;/code&gt; to 15, and declared the cost problem fixed. It wasn't — I'd just hidden it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A "Successful" Run That Wasn't
&lt;/h2&gt;

&lt;p&gt;With the prompt fixed and tools switched to &lt;code&gt;--disallowedTools&lt;/code&gt;, the next run completed in 6 turns and 45 seconds. Cost: $0.19.&lt;/p&gt;

&lt;p&gt;The full sequence: check the git log, get the diff, read the docs, run all 11 inputs through the normaliser in a single batch, conclude &lt;em&gt;"All 11 new entries match the recorded output exactly. No fixes needed."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The problem is that conclusion is &lt;em&gt;always&lt;/em&gt; true, by construction. The CI job that creates these PRs records &lt;code&gt;normalizer(input)&lt;/code&gt; as the output — so of course it matches when you run the normaliser again. Confirming that match is confirming that the CI job that created the PR did the job correctly, nothing more.&lt;/p&gt;

&lt;p&gt;What I actually needed was the second step: reasoning about whether those outputs are &lt;em&gt;correct&lt;/em&gt;, spotting event prefixes that should be stripped, recognising real film titles that are getting mangled, and updating &lt;code&gt;known-removable-phrases.js&lt;/code&gt; accordingly. That's the work. In solving the cost problem by narrowing the prompt, I'd removed the work entirely.&lt;/p&gt;

&lt;p&gt;When I went back through the PR manually, I found several entries that still needed fixing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Problem Underneath
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What kept nagging at me:&lt;/strong&gt; the task is reviewing 11 strings. There's a large corpus of existing examples, a detailed instructions document, and an LLM with a vast amount of general knowledge. It shouldn't require 30 turns and $0.59 to do this — and the fact that it did suggests something isn't well-suited here, not just misconfigured.&lt;/p&gt;

&lt;p&gt;Part of it is a problem with visibility. With each run costing real money and taking several minutes, debugging is expensive. You can't easily see why Claude went down a particular path until you're staring at a full JSON trace of every tool call. Every misconfiguration costs you money and ten minutes before you understand what went wrong. Several of those cycles add up quickly — the $5 I spent getting here was just debugging, not doing useful work.&lt;/p&gt;

&lt;p&gt;And even when the infrastructure is right, the cost curve for this type of task is awkward. Simple cases (all outputs correct) should be cheap, but you can't know in advance whether the run will be simple. If Claude starts investigating an ambiguous case, you're back to 20+ turns and $0.50+. The unpredictability makes it hard to budget.&lt;/p&gt;

&lt;p&gt;For a task this focused — a small number of strings, a clear pattern to match against, a fixed corpus to consult — perhaps a deterministic script would be more reliable (and much cheaper). The Claude Code GitHub Action is well-suited to open-ended tasks where you're not sure what tools you'll need... and maybe if you've got a healthy budget to back that too. A free, open-source, personal project trying to automate reviewing normaliser outputs against a known pattern isn't really any of that.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;I wouldn't abandon the idea entirely. The local Claude Code workflow — where I can watch it reason, reprompt when it went off track, and apply fixes interactively — has worked well and saved real time. The problem is trying to make that fully autonomous in a way that's cost-effective.&lt;/p&gt;

&lt;p&gt;If I came back to this, I'd probably try a direct API call with a tighter prompt and explicit output format rather than the full Claude Code agentic setup. Something that gets the diff, asks Claude to classify each entry as "looks correct" or "has issue: [reason]", and only triggers the expensive autonomous work when there's actually something to fix.&lt;/p&gt;

&lt;p&gt;But for now, some things are still faster and cheaper done by hand. 🍿&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>github</category>
      <category>showdev</category>
    </item>
    <item>
      <title>The Raspberry Pi Cluster in My Living Room</title>
      <dc:creator>Alistair</dc:creator>
      <pubDate>Wed, 25 Mar 2026 08:55:00 +0000</pubDate>
      <link>https://dev.to/alistairjcbrown/the-raspberry-pi-cluster-in-my-living-room-6ik</link>
      <guid>https://dev.to/alistairjcbrown/the-raspberry-pi-cluster-in-my-living-room-6ik</guid>
      <description>&lt;p&gt;There are six Raspberry Pi 4s on a shelf in my living room. They run 24/7, they're all wired directly into the router, and they exist for one fairly specific reason: some cinema websites block GitHub's IP ranges.&lt;/p&gt;

&lt;p&gt;GitHub Actions runners share IP space with a lot of automated traffic, and a handful of venues had decided they didn't want to serve requests from that space. The failures were inconsistent — empty responses, timeouts, bot-detection pages — which made them annoying to diagnose. Once I'd worked out what was actually happening, the fix was straightforward: residential IP addresses. Requests that look like they're coming from someone's home connection, because they are.&lt;/p&gt;

&lt;p&gt;Hence the Pis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7e803a71a6usck861uyd.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7e803a71a6usck861uyd.jpg" alt="Raspberry Pis in mounts" width="800" height="583"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Pis, Not Just a Cheap PC?
&lt;/h2&gt;

&lt;p&gt;It's a fair question. I set myself a target: &lt;em&gt;£50 or less per Pi&lt;/em&gt;, all-in. That means the Pi 4 itself, an SD card, a power cable, and an ethernet cable. No wiggle room for a fancy case or anything optional. But six Pis at £50 each is £300 — you could buy a reasonable secondhand desktop for that and run six runners on it without breaking a sweat.&lt;/p&gt;

&lt;p&gt;The honest answer is that it didn't start as a deliberate architecture decision. I had one Pi spare, so I set it up as a runner. That was enough at first. As I added more venues and the pipeline got busier, I added another, then another. By the time I had three or four, I was actively buying more rather than reconsidering the approach — partly because they're cheap and low-power (running a desktop 24/7 would cost noticeably more on the electricity bill), but also because I'd started to like the fault tolerance story.&lt;/p&gt;

&lt;p&gt;Each Pi is independent. If one plays up, it takes one runner offline, not all of them. Better yet, there's nothing precious about any individual machine — the setup steps are &lt;a href="https://github.com/clusterflick/self-hosted-workflows?tab=readme-ov-file#setting-up-a-new-runner" rel="noopener noreferrer"&gt;fully documented&lt;/a&gt;, so if a Pi goes wrong I can wipe the SD card and have it back as a runner in under an hour. Cattle, not pets. A single PC running six processes doesn't give you that.&lt;/p&gt;

&lt;p&gt;Pi 4s aren't particularly cheap if you buy them new and in a hurry, but there's a reasonable secondhand market if you're patient. I watched eBay listings and Facebook Marketplace, picked them up when they matched the budget, and that's how I ended up with six of them. A few came without accessories, which meant sourcing cables separately — but even then, it worked out.&lt;/p&gt;

&lt;p&gt;One thing I learned the hard way: &lt;em&gt;the power supply matters more than you'd think&lt;/em&gt;. The Pi 4 is particular about voltage, and one of mine was on an underpowered cable. All headless, so there's no screen to hint at what's wrong — it just showed up as one runner that was less reliable than the others, dropping jobs intermittently. It took longer than I'd like to admit to narrow it down to the power supply. Swapping the power supply fixed it immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  SD Cards: The Unexpected Bottleneck
&lt;/h2&gt;

&lt;p&gt;The other thing that surprised me was how much the SD cards matter for this use case.&lt;/p&gt;

&lt;p&gt;Most Raspberry Pi guides will tell you any Class 10 card is fine, and for general use that's probably true. But GitHub Actions runners do a lot of I/O — constant git checkouts, caches being read and written, files being created and deleted across every job. Slow cards can appear fine at first, but you'll notice them becoming a bottleneck once they get a job, especially one with a lot of smaller steps. Jobs that should take 10 seconds start taking ten times as long, and you can't figure out why until you look at where the time is actually going.&lt;/p&gt;

&lt;p&gt;Swapping to &lt;em&gt;SanDisk Extreme Pro cards&lt;/em&gt; made a noticeable difference — runners were now consistently faster on anything I/O-heavy, which in practice is most jobs. I ended up writing &lt;a href="https://github.com/clusterflick/self-hosted-workflows/blob/f8109243ca07a0b5c5c39cd0b874e81fbf25eb5c/.github/workflows/check-sd-card.yml" rel="noopener noreferrer"&gt;a workflow to test SD card speed&lt;/a&gt; which uses &lt;a href="https://github.com/raspberrypi-ui/agnostics/blob/d77d0e053c884048f6656ee079bc5f3ed834c3e2/data/sdtest.sh" rel="noopener noreferrer"&gt;Raspberry Pi's own speed test script&lt;/a&gt;. It checks whether read and write speeds are fast enough to provide adequate performance, which saves finding out the hard way mid-pipeline (and I'm hoping will let me quickly diagnose if an SD card is degrading).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fndc4iziru8wx1h78kxih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fndc4iziru8wx1h78kxih.png" alt="Screenshot of self hosted runner status workflow showing SD card speed test results for a specific runner" width="800" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The other SD card lesson: &lt;em&gt;16GB is too small&lt;/em&gt;. The GitHub Actions runner cache fills up in less than a week of regular use. I have a &lt;a href="https://github.com/clusterflick/self-hosted-workflows/blob/f8109243ca07a0b5c5c39cd0b874e81fbf25eb5c/.github/workflows/free-space.yml" rel="noopener noreferrer"&gt;scheduled workflow to free up space&lt;/a&gt; — it clears the npm cache, removes all Playwright browsers, then reinstalls the latest dependencies and pre-warms everything. It works, but it's a bit of a workaround for a storage problem. I've since bumped everything to 64GB cards, I still run the workflow weekly, and so far everything's running smoothly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Physical Setup
&lt;/h2&gt;

&lt;p&gt;Six Pis sitting loose on a shelf with cables going everywhere is exactly as annoying as it sounds, so I designed a mount to keep things tidy. It's a &lt;em&gt;3D-printed mount&lt;/em&gt; that holds each Pi in place, with enough spacing for airflow and clean cable routing (power cable is supported, SD card is accessible from the top, ethernet cable is hidden underneath).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6u8zinljuhava7uqq5e.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6u8zinljuhava7uqq5e.jpg" alt="The cluster — six Pi 4s, all wired, all tidy" width="800" height="744"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you want to print one yourself, I've uploaded the STL files to &lt;a href="https://www.printables.com/model/1571451-raspberry-pi-4-frame-base-stand" rel="noopener noreferrer"&gt;Printables&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fujtrc2joz6m2bhrt283b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fujtrc2joz6m2bhrt283b.png" alt="Rendering of the 3D model for mounting the Raspberry Pis" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Everything is connected &lt;em&gt;directly to the router via ethernet&lt;/em&gt;. No Wi-Fi. I briefly considered Wi-Fi for the tidiness of it, but I've had too many experiences with Wi-Fi dropouts causing mysterious CI failures, and the whole point of this thing is reliability. Ethernet cables aren't pretty, but they don't drop connections.&lt;/p&gt;

&lt;p&gt;The full cluster sits inside an &lt;a href="https://www.ikea.com/gb/en/p/smarra-box-with-lid-natural-90348063/" rel="noopener noreferrer"&gt;IKEA SMARRA box&lt;/a&gt;. It runs quietly, doesn't generate much heat, and sits in a corner where it's easy to ignore — which is exactly what you want from infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Haven't Covered
&lt;/h2&gt;

&lt;p&gt;Getting the Pis onto the network is the easy bit. Actually registering them as self-hosted GitHub Actions runners, keeping those runners healthy, and managing the runner environment across six machines is its own topic — one for another day.&lt;/p&gt;

&lt;p&gt;The short version for the curious: GitHub provides a script you run on each machine, it registers itself in your repo's settings, and from that point on it just sits there waiting to pick up jobs. The initial setup is straightforward enough. It's everything that comes after — keeping them healthy, diagnosing npm cache issues, hunting down slow runners — where things get more interesting. I do have &lt;a href="https://github.com/clusterflick/self-hosted-workflows/blob/f8109243ca07a0b5c5c39cd0b874e81fbf25eb5c/.github/workflows/runner-stats.yml" rel="noopener noreferrer"&gt;a workflow that reports stats across all runners&lt;/a&gt; — uptime, temperature, disk space remaining — which at least makes it easy to spot a machine that's quietly having a bad time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8rlpal551mthj6s036b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8rlpal551mthj6s036b.png" alt="Screenshot of self hosted runner status workflow showing stats for a specific runner" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Next post:&lt;/strong&gt; GitHub as Infrastructure — self-hosted runners, secrets management, and using GitHub Actions as the backbone of a daily data pipeline.&lt;/p&gt;

</description>
      <category>raspberrypi</category>
      <category>cicd</category>
      <category>githubactions</category>
      <category>homelab</category>
    </item>
    <item>
      <title>Cleaning Cinema Titles Before You Can Even Search</title>
      <dc:creator>Alistair</dc:creator>
      <pubDate>Wed, 18 Mar 2026 08:55:00 +0000</pubDate>
      <link>https://dev.to/alistairjcbrown/cleaning-cinema-titles-before-you-can-even-search-1463</link>
      <guid>https://dev.to/alistairjcbrown/cleaning-cinema-titles-before-you-can-even-search-1463</guid>
      <description>&lt;p&gt;When &lt;a href="https://clusterflick.com" rel="noopener noreferrer"&gt;Clusterflick&lt;/a&gt; first started pulling listings, I assumed the hard part would be the scraping. Getting the data off 250+ different cinema websites, each with their own structure and quirks — that's where the complexity lives, right?&lt;/p&gt;

&lt;p&gt;But before any of that work pays off, before a single TMDB search can happen, there's a problem sitting right at the start of the pipeline: cinema listings don't always give you a clean film title. They give you something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BAR TRASH – THE ZODIAC KILLER (1971) at Beer Merchants Tap
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(IMAX) Princess Mononoke: 2025 Re-Release Subtited
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or my personal favourite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MUPPET PUPPETS CHRISTMAS CAROL WORKSHOP &amp;amp; SING-ALONG
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;None of those are going to find anything useful in a TMDB search. So before matching can happen, there's a normalisation step — and it's grown into something with its own test suite of nearly 15,000 cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Obvious Stuff
&lt;/h2&gt;

&lt;p&gt;The easy wins are the patterns you see immediately once you start looking at real listings. Film Clubs will attach their branding, and  cinemas love adding their series names and event types to the front of a title:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bar Trash:
DocHouse:
CLASSIC MATINEE:
Animation at War:
Family Film Club:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the end of titles is just as cluttered:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;… + Q&amp;amp;A with Director
… on 35mm film
… (4K Remaster)
… Special Screening
… with Introduction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For all of these, there's a &lt;a href="https://github.com/clusterflick/scripts/blob/cc77f913c7c2db110362b4f532d076b794e09b03/common/known-removable-phrases.js" rel="noopener noreferrer"&gt;&lt;code&gt;known-removable-phrases.js&lt;/code&gt;&lt;/a&gt; file — a flat list of exact strings and patterns to strip. It currently has around 1,000 entries. The rule for adding to it is simple: if a phrase is a superfluous label added by a venue, that isn't part of identifying the film, it goes here. Spelling corrections and encoding fixes are handled separately.&lt;/p&gt;

&lt;p&gt;The list isn't pretty, but it works. After stripping known phrases, &lt;code&gt;BAR TRASH – THE ZODIAC KILLER (1971) at Beer Merchants Tap&lt;/code&gt; becomes &lt;code&gt;THE ZODIAC KILLER (1971)&lt;/code&gt;. Progress.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Plus Problem
&lt;/h2&gt;

&lt;p&gt;A lot of venues append extra information to titles using a &lt;code&gt;+&lt;/code&gt; separator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Slade in Flame + Q&amp;amp;A with Noddy Holder
TO A LAND UNKNOWN + PRE-RECORDED Q&amp;amp;A
Goodbye to the Past + pre-recorded intro by Annette Insdorf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The solution is obvious: split on &lt;code&gt;+&lt;/code&gt; and take whatever's before it. Except — and this is where it gets awkward — some legitimate film titles contain a &lt;code&gt;+&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Romeo + Juliet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the actual title of the Baz Luhrmann film. Split naively and you'd search for "Romeo" and find nothing useful. So there's a corrections list that pre-empts the split:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Romeo + Juliet&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Romeo+Juliet&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Removing the spaces makes it invisible to the splitter, then it gets normalised back correctly downstream. It's a bit of a hack, but it does the job.&lt;/p&gt;

&lt;p&gt;The same logic applies to the &lt;code&gt;–&lt;/code&gt; and &lt;code&gt;/&lt;/code&gt; separators, which venues also use to attach event context. The pipeline strips what comes after the last separator — unless the result looks wrong, in which case there's probably a correction for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Presents" and Other Sneaky Prefixes
&lt;/h2&gt;

&lt;p&gt;Some patterns can't be handled with a fixed string list — there are too many variations. So instead we look for signal words to decide what information we can discard. If a title contains &lt;code&gt;presents:&lt;/code&gt;, for example, everything before &lt;code&gt;presents:&lt;/code&gt; is almost certainly not the film title:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Ghibliotheque presents... Spirited Away
VHS Late Tapes Takeover: LCVA presents POUT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These get handled with a regex match: if &lt;code&gt;presents?:?&lt;/code&gt; appears mid-title, take whatever follows it.&lt;/p&gt;

&lt;p&gt;The same approach works for &lt;code&gt;premiere of:&lt;/code&gt;, &lt;code&gt;screening of:&lt;/code&gt;, &lt;code&gt;retrospective screening of:&lt;/code&gt;, and a handful of others. Each one is a named match rather than a blindly applied strip, so the code can be explicit about what it's doing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Corrections List
&lt;/h2&gt;

&lt;p&gt;Even after removing known phrases and applying structural patterns, there are titles that are just wrong — or at least not in the form TMDB expects. That's where &lt;a href="https://github.com/clusterflick/scripts/blob/cc77f913c7c2db110362b4f532d076b794e09b03/common/normalize-title.js" rel="noopener noreferrer"&gt;&lt;code&gt;normalize-title.js&lt;/code&gt;&lt;/a&gt; comes in. It has a &lt;code&gt;corrections&lt;/code&gt; array with around 500 entries, covering everything from typos to venue-specific quirks to completely misnamed films.&lt;/p&gt;

&lt;p&gt;Some are straightforward spelling fixes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Carvaggio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Caravaggio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Seigfried&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Siegfried&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Labryinth&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Labyrinth&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some are encoding artefacts or odd formatting choices:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;&amp;amp;amp;&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;&amp;amp;&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;½&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt; 1/2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some are venues getting the actual film title wrong. The BFI listed a film as "Battleground" as a translation from the original Italian — the film is called "Battlefield":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Battleground + intro &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Battlefield + intro &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then there are the genuinely weird ones. &lt;code&gt;MUPPET PUPPETS CHRISTMAS CAROL WORKSHOP &amp;amp; SING-ALONG&lt;/code&gt; — that's not a film, it's an event which includes a film.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;MUPPET PUPPETS CHRISTMAS CAROL WORKSHOP &amp;amp; SING-ALONG&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Muppet Christmas Carol&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With hindsight, this is the kind of thing I try to avoid - a one-off correction for a singluar event. This probably should have not had a correction applied and instead rely on failing over to the LLM for identification using matching hints.&lt;/p&gt;

&lt;p&gt;One entry I'm particularly fond of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;/^Dr&lt;/span&gt;&lt;span class="se"&gt;\.?&lt;/span&gt;&lt;span class="sr"&gt; Strangelove$/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because cinemas almost never write the full title, but having the full title makes it much more likely to match on a TMDB search.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Gets Stripped Last
&lt;/h2&gt;

&lt;p&gt;After the corrections and phrase removal, there's a final cleanup pass: diacritics get normalised, smart quotes become straight quotes, soft hyphens get removed, trailing punctuation goes, articles at the start (&lt;code&gt;the&lt;/code&gt;, &lt;code&gt;a&lt;/code&gt;) get stripped (in most cases, not all) so that &lt;code&gt;The Big Lebowski&lt;/code&gt; and &lt;code&gt;Big Lebowski&lt;/code&gt; match the same thing.&lt;/p&gt;

&lt;p&gt;Year suffixes in brackets like &lt;code&gt;(1971)&lt;/code&gt; are kept, because they're genuinely useful disambiguation — &lt;code&gt;Psycho (1960)&lt;/code&gt; is a different film from &lt;code&gt;Psycho (1998)&lt;/code&gt; (and you'll probably want to know which version you're about to watch 😉).&lt;/p&gt;

&lt;p&gt;There's also the theatre performance problem. Some venues list National Theatre Live and Royal Ballet screenings using the same listing format as regular films. &lt;code&gt;NT Live: Dr Strangelove&lt;/code&gt; isn't looking for a film called "Dr Strangelove" — it's looking for the NT Live broadcast of it. There's a whole separate setup for that which gets detected and normalised before this pipeline runs. But that's probably worth its own post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Perfect Is the Enemy of Good
&lt;/h2&gt;

&lt;p&gt;The list of corrections is never going to be finished. New venues bring new branding. Films get re-released with different title formats. Cinemas just spell things wrong.&lt;/p&gt;

&lt;p&gt;What the normalisation step needs to do is get &lt;em&gt;most&lt;/em&gt; titles into a clean enough state that the TMDB search returns the right film. The cases it misses — titles that are too ambiguous or too corrupted — fall through to the LLM matching stage, which can handle a messier input. That's the right place for those anyway: the normalisation step is supposed to be fast and cheap, not exhaustive.&lt;/p&gt;

&lt;p&gt;The test suite in &lt;a href="https://github.com/clusterflick/scripts/blob/cc77f913c7c2db110362b4f532d076b794e09b03/common/tests/normalize-title.test.js" rel="noopener noreferrer"&gt;&lt;code&gt;normalize-title.test.js&lt;/code&gt;&lt;/a&gt; keeps the list honest. Every correction and removable phrase is supposed to have a corresponding test case in &lt;a href="https://github.com/clusterflick/scripts/blob/cc77f913c7c2db110362b4f532d076b794e09b03/common/tests/test-titles.json" rel="noopener noreferrer"&gt;&lt;code&gt;test-titles.json&lt;/code&gt;&lt;/a&gt;, so there's a record of what each entry is for and a way to verify it doesn't break anything when the list changes. And it gets updated every day as new data comes in.&lt;/p&gt;

&lt;p&gt;It's not elegant. But the alternative — sending &lt;code&gt;BAR TRASH – THE ZODIAC KILLER (1971) at Beer Merchants Tap&lt;/code&gt; to TMDB and hoping for the best — doesn't work. And now you know why 🍿&lt;/p&gt;

&lt;p&gt;P.S. Shout out to &lt;a href="https://clusterflick.com/film-clubs/bar-trash/" rel="noopener noreferrer"&gt;Bar Trash&lt;/a&gt; for having some of the most consistent and standardised titles ❤️&lt;br&gt;
Those titles make for a great example in this blog post, but they're far from being the most complex ones I need to deal with!&lt;/p&gt;

&lt;p&gt;🎬 A list of the movies mentioned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/63959-the-zodiac-killer" rel="noopener noreferrer"&gt;The Zodiac Killer (1971)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/128" rel="noopener noreferrer"&gt;Princess Mononoke (1997)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/10437-the-muppet-christmas-carol" rel="noopener noreferrer"&gt;The Muppet Christmas Carol&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/60808-flame" rel="noopener noreferrer"&gt;Flame (1975)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/1214052" rel="noopener noreferrer"&gt;To a Land Unknown (2025)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/242542-rozstanie" rel="noopener noreferrer"&gt;Goodbye to the Past (1961)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/454-romeo-juliet" rel="noopener noreferrer"&gt;Romeo + Juliet (1996)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/129" rel="noopener noreferrer"&gt;Spirited Away (2001)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/1174246-campo-di-battaglia" rel="noopener noreferrer"&gt;Battlefield (2024)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/935-dr-strangelove-or-how-i-learned-to-stop-worrying-and-love-the-bomb" rel="noopener noreferrer"&gt;Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/1401957-national-theatre-live-dr-strangelove" rel="noopener noreferrer"&gt;National Theatre Live: Dr. Strangelove (2025)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/539-psycho" rel="noopener noreferrer"&gt;Psycho (1960)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.themoviedb.org/movie/11252-psycho" rel="noopener noreferrer"&gt;Psycho (1998)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Next post:&lt;/strong&gt; &lt;del&gt;Testing Your Prompts Like You Test Your Code&lt;/del&gt;&lt;br&gt;
Unfortunately I've not gotten this work completed. So until then, the next post will be &lt;em&gt;The Raspberry Pi Cluster in My Living Room&lt;/em&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>datascience</category>
      <category>webdev</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Site Performance: Loading 30,000 Showings in a Browser</title>
      <dc:creator>Alistair</dc:creator>
      <pubDate>Wed, 11 Mar 2026 08:47:00 +0000</pubDate>
      <link>https://dev.to/alistairjcbrown/site-performance-loading-30000-showings-in-a-browser-30go</link>
      <guid>https://dev.to/alistairjcbrown/site-performance-loading-30000-showings-in-a-browser-30go</guid>
      <description>&lt;p&gt;At least twice a day, the pipeline scrapes 250+ London cinemas and produces a dataset of 1,500+ films with 30,000+ showings. Then I need to get all of that into a browser.&lt;/p&gt;

&lt;p&gt;Getting the raw data from venues is its own challenge (&lt;a href="https://dev.to/alistairjcbrown/scaling-from-3-cinemas-to-240-venues-what-broke-and-what-evolved-2jkk"&gt;covered in an earlier post&lt;/a&gt;) but even once you've got it, making it available to users fast and in a useful way has its own set of problems to solve.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://clusterflick.com/" rel="noopener noreferrer"&gt;Clusterflick&lt;/a&gt; runs entirely as a static site, served from GitHub Pages with no live server. That's a deliberate constraint — the whole project runs on GitHub's free tier, and I'd like to keep it that way (more on that in a future post). But it means the browser has to do more of the work, and that puts performance decisions front and centre.&lt;/p&gt;

&lt;p&gt;By the time data reaches the frontend, it's already been through several pipeline stages — each one producing a GitHub release that the next stage picks up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/clusterflick/data-retrieved/" rel="noopener noreferrer"&gt;&lt;strong&gt;Retrieve:&lt;/strong&gt;&lt;/a&gt; raw HTML, JSON APIs, and scraped pages from all 252 venues

&lt;ul&gt;
&lt;li&gt;~800 MB total&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;a href="//github.com/clusterflick/data-transformed/"&gt;&lt;strong&gt;Transform:&lt;/strong&gt;&lt;/a&gt; extracts structured showings from the raw data, matches films against &lt;a href="https://www.themoviedb.org/" rel="noopener noreferrer"&gt;TMDB&lt;/a&gt; and saves the ID of matches

&lt;ul&gt;
&lt;li&gt; down to ~15 MB total&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;a href="https://github.com/clusterflick/data-combined/" rel="noopener noreferrer"&gt;&lt;strong&gt;Combine:&lt;/strong&gt;&lt;/a&gt; merges the films from all venues together and hydrates films that have a TMDB ID with rich metadata (cast, genres, poster images, ratings)

&lt;ul&gt;
&lt;li&gt;~18 MB total&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;a href="https://github.com/clusterflick/clusterflick.com/blob/28ada56182d96a253e218133bcc5edcdd304cc64/scripts/process-combined-data.js" rel="noopener noreferrer"&gt;&lt;strong&gt;Process:&lt;/strong&gt;&lt;/a&gt; strips redundant data, extracts URL prefixes, splits into chunks

&lt;ul&gt;
&lt;li&gt;~5 MB raw, ~1.5 MB gzipped over the wire&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;This post is about the decisions in that last step (and one I unmade) getting from the combined JSON to something a browser can load and render quickly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sl0v05k3t3nwispga3g.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sl0v05k3t3nwispga3g.jpg" alt="Clusterflick main page" width="800" height="688"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Compression Detour
&lt;/h2&gt;

&lt;p&gt;Before building anything clever on the frontend, I wanted to be sure the raw data was as small as possible. I'd been running the JSON through &lt;a href="https://www.npmjs.com/package/compress-json" rel="noopener noreferrer"&gt;&lt;code&gt;compress-json&lt;/code&gt;&lt;/a&gt;, a library that structurally transforms JSON — deduplicating repeated values into lookup tables, encoding types differently. It made the raw file dramatically smaller. As an example, for one of the runs the full dataset without it is 10.97 MB; with it, 4.85 MB. That's a real reduction.&lt;/p&gt;

&lt;p&gt;So I ran a benchmark across every optimisation in the pipeline to see which ones were actually earning their place.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Optimisation&lt;/th&gt;
&lt;th&gt;Gzipped impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Removing showing overviews&lt;/td&gt;
&lt;td&gt;💪 -6.1% (saves 109 KB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;URL prefix extraction&lt;/td&gt;
&lt;td&gt;💪 -5.0% (saves 90 KB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Removing IDs&lt;/td&gt;
&lt;td&gt;💪 -2.4% (saves 43 KB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Removing false a11y flags&lt;/td&gt;
&lt;td&gt;🤷 ~0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trimming RT data&lt;/td&gt;
&lt;td&gt;🤷 ~0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;compress-json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;😱 &lt;strong&gt;+18.5% (hurts!)&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The headline finding: &lt;code&gt;compress-json&lt;/code&gt; makes the gzipped output &lt;em&gt;larger&lt;/em&gt;. Without it, the gzipped total is 1.43 MB. With it, 1.76 MB. That's 333 KB I was paying to make things worse.&lt;/p&gt;

&lt;p&gt;The reason makes sense once you think about it. Gzip excels at finding repeated byte sequences — exactly what &lt;code&gt;compress-json&lt;/code&gt; was doing first. The two approaches fight each other: &lt;code&gt;compress-json&lt;/code&gt;'s transformed structure is actually &lt;em&gt;harder&lt;/em&gt; for gzip to compress than plain repetitive JSON. Gzip decompression is built into every browser's network stack — native C++ code that runs before JavaScript even sees the response. &lt;code&gt;compress-json&lt;/code&gt; decompression, by contrast, runs on the main thread in JavaScript. So the current pipeline was paying three times: larger transfer size, extra JS bundle weight for the decompress library, and CPU time running &lt;code&gt;decompress()&lt;/code&gt; on every chunk.&lt;/p&gt;

&lt;p&gt;So I deleted it. The "no compress-json" variant still has all the other optimisations applied and lands at 1.43 MB — 19% smaller than before. 🎉&lt;/p&gt;

&lt;p&gt;The two optimisations that turned out to have zero impact — removing false accessibility flags and trimming Rotten Tomatoes fields — were easy to rationalise after the fact. Accessibility data is sparse; very few performances have those flags set at all, so deleting &lt;code&gt;false&lt;/code&gt; values removes almost nothing. The RT fields are a handful of small values per movie. Neither gives gzip much to work with.&lt;/p&gt;

&lt;h2&gt;
  
  
  Splitting the Data into Chunks
&lt;/h2&gt;

&lt;p&gt;Even at 1.43 MB gzipped, serving the full dataset as a single file would mean users wait for everything before seeing anything. Instead, as part of the data processing it's splits into chunks and a metadata file written alongside them.&lt;/p&gt;

&lt;p&gt;The chunking isn't by movie count — it's by &lt;strong&gt;serialised byte size&lt;/strong&gt;, with a target of ~400 KB per chunk. Chunking by movie count would produce wildly uneven file sizes; a blockbuster showing at 50+ venues generates far more data than a one-week indie run. Performance count was an earlier approach, but it still produced too much variance — chunk files ranged 65 KB - 1.2 MB. Switching to byte size brought that down to 16 KB - 727 KB, with the bulk of chunks clustering tightly between 324 KB and 436 KB.&lt;/p&gt;

&lt;p&gt;The remaining outliers are expected. The small tail chunks at the end of the alphabet simply don't have enough movies left to fill a full bucket. The large ones contain individual films whose serialised data alone exceeds the target — a blockbuster with 50+ venues and thousands of performances will do that — so they necessarily get a bucket to themselves.&lt;/p&gt;

&lt;p&gt;Movies are sorted alphabetically by &lt;em&gt;normalised&lt;/em&gt; title before being bucketed — mirroring the default sort order on the site. The idea is that we'll start downloading chunk 0 first, and it'll contain the movies a the top of the list which are visible on screen when the page first loads. So the data the user actually sees is most likely to arrive first and there's less change of visible updates as subsequent chunks load in below the fold.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data.meta.a1b2c3d4e5.json
data.0.f6g7h8i9j0.json
data.1.k1l3m5n7o9.json
...
data.&amp;lt;index&amp;gt;.&amp;lt;fingerprint&amp;gt;.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28x0o2opfub7tg5b9frp.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28x0o2opfub7tg5b9frp.jpg" alt="Screenshot of the network web development tools showing data chunks loading in" width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The metadata file carries the full lookup tables for genres, people, and venues (shared across all movies), the URL prefix table used to reconstruct booking links, and the &lt;code&gt;mapping&lt;/code&gt; that tells the client which chunk contains which movie ID. It's the one file the browser always fetches first — and it's hashed like the chunks, so its filename is baked into &lt;code&gt;NEXT_PUBLIC_DATA_FILENAME&lt;/code&gt; at build time.&lt;/p&gt;

&lt;p&gt;There's one catch with GitHub Pages: it sets a 10-minute cache TTL on everything at the browser level, which means even a fingerprinted file that hasn't changed for weeks gets revalidated every 10 minutes. Cloudflare sits in front of the site and fixes this in two ways: it caches the files at the edge, and it overrides GitHub's cache-control headers so browsers are told to store all JSON files for a year. Since every file — chunks and metadata alike — is fingerprinted, a changed file always means a new URL and a cache miss by design. A first-time visitor fetches from Cloudflare's edge and caches locally for a year. A repeat visitor gets it straight from their browser cache. Either way, they're only ever making a network request for files that have actually changed.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc56350nj7nyjmzbrluvp.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc56350nj7nyjmzbrluvp.jpg" alt="Cloudflare cache control rule for JSON files, storing them at an edge cache and setting the browser cache header to 1 year" width="800" height="780"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Once the client has the metadata, &lt;a href="https://github.com/clusterflick/clusterflick.com/blob/b90ac8737b4aa032e8be35bf0bf572d44b03e30a/src/state/cinema-data-context.tsx#L204" rel="noopener noreferrer"&gt;&lt;code&gt;CinemaDataProvider&lt;/code&gt; handles the rest&lt;/a&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Priority chunk&lt;/strong&gt; — on a movie detail page, the client looks up the movie's chunk in the mapping and &lt;a href="https://github.com/clusterflick/clusterflick.com/blob/b90ac8737b4aa032e8be35bf0bf572d44b03e30a/src/state/cinema-data-context.tsx#L272-L275" rel="noopener noreferrer"&gt;fetches it immediately&lt;/a&gt;. Showings appear before the rest of the dataset has loaded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All other chunks in parallel&lt;/strong&gt; — &lt;a href="https://github.com/clusterflick/clusterflick.com/blob/b90ac8737b4aa032e8be35bf0bf572d44b03e30a/src/state/cinema-data-context.tsx#L279-L285" rel="noopener noreferrer"&gt;via &lt;code&gt;Promise.allSettled()&lt;/code&gt;&lt;/a&gt;, so a single failed chunk doesn't block everything else from loading.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expand and prune&lt;/strong&gt; — IDs stripped before serialisation are re-added via &lt;code&gt;expandData()&lt;/code&gt; (restoring the keys that were removed to save bytes), and past performances are stripped before chunks enter React state.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Static Export Changes Everything
&lt;/h2&gt;

&lt;p&gt;Clusterflick uses &lt;a href="https://nextjs.org/" rel="noopener noreferrer"&gt;Next.js&lt;/a&gt; with &lt;code&gt;output: "export"&lt;/code&gt;. There's no live server. Every page is pre-rendered to static HTML during &lt;code&gt;npm run build&lt;/code&gt;, then served from GitHub Pages.&lt;/p&gt;

&lt;p&gt;This shapes every rendering decision. When Next.js docs talk about &lt;a href="https://nextjs.org/docs/app/getting-started/server-and-client-components" rel="noopener noreferrer"&gt;Server Components&lt;/a&gt;, in this context that means "code that runs at build time on a Node process" — not a server handling live requests. Whatever I pre-render is fixed until the next build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Grids on the Home Page
&lt;/h2&gt;

&lt;p&gt;The home page has a slightly odd architecture, and it's worth explaining why.&lt;/p&gt;

&lt;p&gt;At build time, &lt;a href="https://github.com/clusterflick/clusterflick.com/blob/b90ac8737b4aa032e8be35bf0bf572d44b03e30a/src/app/page.tsx" rel="noopener noreferrer"&gt;&lt;code&gt;app/page.tsx&lt;/code&gt;&lt;/a&gt; (a Server Component) reads the chunk files from disk, merges them, applies the default filters — films and shorts, 7-day window — and takes the first 72 results sorted by normalized title. These 72 movies are rendered as a static HTML grid of poster images and links. No JavaScript required. This grid is wrapped in an &lt;a href="https://github.com/clusterflick/clusterflick.com/blob/b90ac8737b4aa032e8be35bf0bf572d44b03e30a/src/app/ssr-only.tsx" rel="noopener noreferrer"&gt;&lt;code&gt;SSROnly&lt;/code&gt; component&lt;/a&gt; that removes itself after hydration.&lt;/p&gt;

&lt;p&gt;So during the initial paint, and for any crawler, there's a real grid of films with real titles and links in the HTML. Once JavaScript loads and mounts, &lt;code&gt;SSROnly&lt;/code&gt; cleans up that static content and hands off to the interactive grid.&lt;/p&gt;

&lt;p&gt;The 72 limit is deliberate. It's enough for a meaningful SEO payload — film titles, poster images, links — without bloating the HTML with hundreds of entries. The real, interactive grid that users actually browse is built entirely client-side with the full dataset, applying any filters which may be in effect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Virtualising 1,500+ Posters
&lt;/h2&gt;

&lt;p&gt;The filter UI is designed to give immediate visual feedback as you change options — in the current design the filter overlay is semi-transparent, so you can see the poster grid updating behind it as you adjust. That only works if rendering is fast. On an earlier design, where the filter controls sat directly above a flat list of results, the lag was obvious and painful: every filter change triggered a re-render of the entire list.&lt;/p&gt;

&lt;p&gt;The solution is &lt;a href="https://github.com/bvaughn/react-virtualized" rel="noopener noreferrer"&gt;&lt;code&gt;react-virtualized&lt;/code&gt;&lt;/a&gt; — specifically its &lt;code&gt;Grid&lt;/code&gt; component combined with &lt;code&gt;WindowScroller&lt;/code&gt;. Rather than rendering the full list, it calculates which cells are currently visible in the viewport and only renders those, plus a small buffer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;WindowScroller&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;isScrolling&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;registerChild&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;onChildScroll&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;scrollTop&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;registerChild&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Grid&lt;/span&gt;
        &lt;span class="na"&gt;autoHeight&lt;/span&gt;
        &lt;span class="na"&gt;cellRenderer&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;cellRenderer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="na"&gt;columnCount&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;columnCount&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="na"&gt;columnWidth&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;POSTER_WIDTH&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;GAP&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;   &lt;span class="c1"&gt;// 208px per column&lt;/span&gt;
        &lt;span class="na"&gt;rowHeight&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;POSTER_HEIGHT&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;GAP&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;    &lt;span class="c1"&gt;// 308px per row&lt;/span&gt;
        &lt;span class="na"&gt;rowCount&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;rowCount&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="na"&gt;overscanRowCount&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;               &lt;span class="c1"&gt;// pre-render 3 rows above/below viewport&lt;/span&gt;
        &lt;span class="na"&gt;scrollTop&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;scrollTop&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="na"&gt;isScrolling&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;isScrolling&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="na"&gt;onScroll&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;onChildScroll&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="err"&gt;...&lt;/span&gt;
      &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nc"&gt;WindowScroller&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;~ &lt;a href="https://github.com/clusterflick/clusterflick.com/blob/b90ac8737b4aa032e8be35bf0bf572d44b03e30a/src/app/page-content.tsx#L186-L213" rel="noopener noreferrer"&gt;&lt;code&gt;src/app/page-content.tsx&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;WindowScroller&lt;/code&gt; ties the grid's scroll position to the page's native scroll rather than creating a separate scrollable container. That keeps the browser scrollbar, avoids scroll-jank on mobile, and means the address bar hides naturally on iOS.&lt;/p&gt;

&lt;p&gt;Fixed cell dimensions (always 200×300px with an 8px gap) let react-virtualized calculate row and column positions with simple arithmetic, avoiding expensive DOM measurement. Window width isn't available at build time, so the component initialises with a single-column placeholder and sets real dimensions in a &lt;code&gt;useEffect&lt;/code&gt; after mount.&lt;/p&gt;

&lt;p&gt;The first two rows are above the fold on most screens, so &lt;code&gt;next/image&lt;/code&gt; is told to load those eagerly with &lt;code&gt;fetchpriority="high"&lt;/code&gt;. Everything below row 2 is lazy-loaded as the user scrolls.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdwxo08asovdqvcx5317.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdwxo08asovdqvcx5317.jpg" alt="Poster grid showing that only the visible posters are in the DOM" width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One wrinkle: the intro section above the grid can be collapsed or expanded, which shifts the grid's offset on the page. &lt;code&gt;WindowScroller&lt;/code&gt; needs to know about this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="nf"&gt;requestAnimationFrame&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dispatchEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;resize&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A synthetic &lt;code&gt;resize&lt;/code&gt; event prompts &lt;code&gt;WindowScroller&lt;/code&gt; to recalculate its position. Not elegant, but it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Movie Detail Pages: Stripping Performances Before They Cross the Wire
&lt;/h2&gt;

&lt;p&gt;Each film has its own pre-rendered page. &lt;a href="https://github.com/clusterflick/clusterflick.com/blob/b90ac8737b4aa032e8be35bf0bf572d44b03e30a/src/app/movies/%5Bid%5D/%5Bslug%5D/page.tsx#L15-L22" rel="noopener noreferrer"&gt;&lt;code&gt;generateStaticParams()&lt;/code&gt;&lt;/a&gt; iterates every movie at build time and Next.js generates a static HTML file for each — typically 1,500+ pages per build.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/clusterflick/clusterflick.com/blob/b90ac8737b4aa032e8be35bf0bf572d44b03e30a/src/app/movies/%5Bid%5D/%5Bslug%5D/page.tsx#L223-L244" rel="noopener noreferrer"&gt;&lt;code&gt;app/movies/[id]/[slug]/page.tsx&lt;/code&gt; Server Component&lt;/a&gt; does the structurally stable work: resolves genres, people, and venues for the film; generates JSON-LD structured data (&lt;code&gt;Movie&lt;/code&gt;, &lt;code&gt;BreadcrumbList&lt;/code&gt;, &lt;code&gt;ScreeningEvent&lt;/code&gt;) for search engine rich results. Then — critically — it strips &lt;code&gt;performances&lt;/code&gt; from the movie prop before passing it to the client component:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;performances&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;_performances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;movieWithoutPerformances&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;movie&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That means the pre-rendered HTML — and the inline JSON Next.js serialises into it for hydration — only contains movie metadata (title, poster, ratings, cast). The actual showtimes are fetched at runtime by the data context.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/clusterflick/clusterflick.com/blob/b90ac8737b4aa032e8be35bf0bf572d44b03e30a/src/app/movies/%5Bid%5D/%5Bslug%5D/page-content.tsx#L47-L54" rel="noopener noreferrer"&gt;&lt;code&gt;app/movies/[id]/[slug]/page-content.tsx&lt;/code&gt; Client Component&lt;/a&gt; calls &lt;a href="https://github.com/clusterflick/clusterflick.com/blob/b90ac8737b4aa032e8be35bf0bf572d44b03e30a/src/app/movies/%5Bid%5D/%5Bslug%5D/page-content.tsx#L81" rel="noopener noreferrer"&gt;&lt;code&gt;getDataWithPriority(movie.id)&lt;/code&gt;&lt;/a&gt; on mount, which fetches the chunk containing &lt;em&gt;this&lt;/em&gt; film first before loading everything else in parallel. A &lt;code&gt;startTransition&lt;/code&gt; defers the showings computation until after the hero section has rendered — so the poster, title, and ratings appear immediately, with showtimes filling in shortly after.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzsr87rxvaeiwtbqbl5i0.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzsr87rxvaeiwtbqbl5i0.gif" alt="Animation showing performancing loading in after the main page content on the Project Hail Mary movie page" width="1240" height="1071"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where It Stands
&lt;/h2&gt;

&lt;p&gt;With all of this in place, I ran Lighthouse against the site across cold and warm cache — averaged over three runs on desktop.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Cold cache&lt;/th&gt;
&lt;th&gt;Warm cache&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lighthouse score&lt;/td&gt;
&lt;td&gt;74/100&lt;/td&gt;
&lt;td&gt;92/100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First Contentful Paint&lt;/td&gt;
&lt;td&gt;459ms&lt;/td&gt;
&lt;td&gt;23ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Largest Contentful Paint&lt;/td&gt;
&lt;td&gt;2.5s&lt;/td&gt;
&lt;td&gt;281ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed Index&lt;/td&gt;
&lt;td&gt;2.5s&lt;/td&gt;
&lt;td&gt;42ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cumulative Layout Shift&lt;/td&gt;
&lt;td&gt;0.197&lt;/td&gt;
&lt;td&gt;0.18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transfer size&lt;/td&gt;
&lt;td&gt;5.5 MB&lt;/td&gt;
&lt;td&gt;20 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feb17g0ij4td50ozmwq6g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feb17g0ij4td50ozmwq6g.png" alt="Screenshot of the CLI output which has the same information as the above table" width="800" height="443"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The warm cache numbers are the point of everything in this post — 308 of 336 network requests served from cache, 5.5 MB down to 20 KB (less than 1% of the data going across the wire), LCP dropping from 2.5s to 281ms (about 10% of the original time). That's what content-hashed files plus a year-long browser TTL actually buys you.&lt;/p&gt;

&lt;p&gt;Cold cache is where there's still work to do. A 74/100 and a 2.5s LCP on first visit isn't bad, but it's not where I'd like it to be. The LCP is the main thing to improve — 2.5s is right at the edge of Google's "needs improvement" threshold, and it's what's dragging the cold cache score down. The CLS (0.197) is a known trade-off from the SSR grid handing off to the virtualised interactive grid, but given warm cache sits at 0.18 and still scores 92/100, it's clearly not the bottleneck.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Next post:&lt;/strong&gt; Cleaning Cinema Titles Before You Can Even Search&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>frontend</category>
      <category>performance</category>
      <category>webdev</category>
    </item>
    <item>
      <title>A Brief Detour: Two Writing Challenges and What Came Out of Them</title>
      <dc:creator>Alistair</dc:creator>
      <pubDate>Wed, 04 Mar 2026 08:30:00 +0000</pubDate>
      <link>https://dev.to/alistairjcbrown/a-brief-detour-two-writing-challenges-and-what-came-out-of-them-4h8h</link>
      <guid>https://dev.to/alistairjcbrown/a-brief-detour-two-writing-challenges-and-what-came-out-of-them-4h8h</guid>
      <description>&lt;p&gt;Regular Clusterflick series readers: I got distracted. Twice 😅&lt;/p&gt;

&lt;p&gt;In the last week I entered a couple of dev.to writing challenges, and both turned out to be good excuses to write about things that were already on the series roadmap — just earlier and in a slightly different shape than I'd originally planned.&lt;/p&gt;

&lt;p&gt;The first was the 1️⃣ &lt;a href="https://dev.to/challenges/weekend-2026-02-28"&gt;DEV Weekend Challenge: Community&lt;/a&gt;, which I used to write about the &lt;a href="https://clusterflick.com/film-clubs/" rel="noopener noreferrer"&gt;film club discovery&lt;/a&gt; and &lt;a href="https://clusterflick.com/near-me/" rel="noopener noreferrer"&gt;"near me"&lt;/a&gt; features, which I finally took the time to build. The second was the 2️⃣ &lt;a href="https://dev.to/challenges/mlh/built-with-google-gemini-02-25-26"&gt;Built with Google Gemini: Writing Challenge&lt;/a&gt;, which pulled forward what was going to be a later post about using LLMs in the data pipeline.&lt;/p&gt;

&lt;p&gt;Both are standalone submissions, but they're very much part of this project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/alistairjcbrown/i-built-a-film-club-discovery-tool-for-londons-cinema-community-2md"&gt;Making London's hidden film clubs discoverable&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/alistairjcbrown/three-things-i-learned-using-llms-in-a-data-pipeline-51c3"&gt;Three Things I Learned Using LLMs in a Data Pipeline&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The LLM post in particular covers things I'd have gotten to eventually in this series — the matching pipeline, the &lt;code&gt;reason&lt;/code&gt; key trick, defensive JSON parsing. Worth a read if you've been following along!&lt;/p&gt;

&lt;p&gt;Back to the regular schedule next week 🫡&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Next post:&lt;/strong&gt; Site Performance: Loading 30,000+ Showings in a Browser&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>ai</category>
      <category>opensource</category>
      <category>clusterflick</category>
    </item>
    <item>
      <title>Three Things I Learned Using LLMs in a Data Pipeline</title>
      <dc:creator>Alistair</dc:creator>
      <pubDate>Mon, 02 Mar 2026 19:44:27 +0000</pubDate>
      <link>https://dev.to/alistairjcbrown/three-things-i-learned-using-llms-in-a-data-pipeline-51c3</link>
      <guid>https://dev.to/alistairjcbrown/three-things-i-learned-using-llms-in-a-data-pipeline-51c3</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/mlh-built-with-google-gemini-02-25-26"&gt;Built with Google Gemini: Writing Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built with Google Gemini
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;"Ghibliotheque Presents: My Neighbor Totoro + Intro"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's a real cinema listing title, but it's not a title you can just search for. And as titles go, it's one of the more straightforward ones. Things get even messier when we get into cinema listing pages. I've seen venues that don't include a year, don't include the director, or give you little more than a title and a one-line description. If you're building an aggregator that needs to identify what's actually showing, you spend a lot of time staring at strings like this.&lt;/p&gt;

&lt;p&gt;I've been building &lt;a href="https://clusterflick.com" rel="noopener noreferrer"&gt;Clusterflick&lt;/a&gt;, a cinema aggregator for London that pulls listings from 250+ venues daily. I thought scraping would be the hard part. But figuring out what a listing actually &lt;em&gt;is&lt;/em&gt; — which film, matched to which entry in &lt;a href="https://themoviedb.org" rel="noopener noreferrer"&gt;The Movie DB&lt;/a&gt; — is where a lot of complexity lies. And it's where I've been using Gemini.&lt;/p&gt;

&lt;p&gt;There's a whole layer of work involved in cleaning raw listing strings down to something searchable — that's worth a post of its own — but even with a clean title, the matching problem doesn't go away. Many venues don't include the necessary information to programatically search using The Movie DB API — just a title and maybe a vague description. Even when they do have more data, e.g. title plus year or even title plus director, it doesn't necessarily uniquely identify a film. And legitimate films with short or common names can be difficult to surface in TMDB search results at all.&lt;/p&gt;

&lt;p&gt;I use Gemini to help at four stages in the identification pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Match against TMDB&lt;/strong&gt; — given the cinema listing and a list of search results from TMDB, Gemini picks the best match. This handles the majority of cases.

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/clusterflick/scripts/blob/b0d0954749836c5ab4ad3c685811fbdf28410340/common/ask-llm-to-review-results.js" rel="noopener noreferrer"&gt;&lt;code&gt;common/ask-llm-to-review-results.js&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Direct identification&lt;/strong&gt; — if TMDB search returns nothing useful, I ask Gemini if it recognises the film from the listing alone. Its training data often knows about films that don't surface well through search.

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/clusterflick/scripts/blob/b0d0954749836c5ab4ad3c685811fbdf28410340/common/ask-llm.js" rel="noopener noreferrer"&gt;&lt;code&gt;common/ask-llm.js&lt;/code&gt;&lt;/a&gt; (The original use of Gemini in the project — everything else has grown from this first step)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Classify the listing&lt;/strong&gt; — if we still can't identify a film, I ask Gemini what the listing actually &lt;em&gt;is&lt;/em&gt;: a film, a short, a double bill, a quiz night, a live event, a comedy show. That classification feeds into filters on the website, and it determines what happens next in the pipeline — a listing classified as multiple films or shorts triggers its own follow-up steps.

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/clusterflick/scripts/blob/b0d0954749836c5ab4ad3c685811fbdf28410340/common/ask-llm-to-categorise.js" rel="noopener noreferrer"&gt;&lt;code&gt;common/ask-llm-to-categorise.js&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extract multiple films or shorts&lt;/strong&gt; — if a listing is identified as containing multiple films or shorts (a double bill, a shorts programme, a marathon), I ask Gemini to pull out the individual titles so each can be matched separately.

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/clusterflick/scripts/blob/b0d0954749836c5ab4ad3c685811fbdf28410340/scripts/transform/identify-multiple-movies.js" rel="noopener noreferrer"&gt;&lt;code&gt;scripts/transform/identify-multiple-movies.js&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/clusterflick/scripts/blob/b0d0954749836c5ab4ad3c685811fbdf28410340/scripts/transform/identify-shorts.js" rel="noopener noreferrer"&gt;&lt;code&gt;scripts/transform/identify-shorts.js&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each stage only fires if the previous one didn't produce a result. That keeps costs down and means Gemini is only doing the hard work when simpler approaches have already failed.&lt;/p&gt;

&lt;p&gt;The model I'm using is &lt;code&gt;gemini-2.5-flash-lite&lt;/code&gt;. I'd been running on &lt;code&gt;gemini-2.0-flash&lt;/code&gt; for a while and recently upgraded — one line change in the code, and I saw no noticeable difference in the identification and categorisation output from the previous run. Free performance improvement!&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Clusterflick is live at &lt;a href="https://clusterflick.com" rel="noopener noreferrer"&gt;clusterflick.com&lt;/a&gt; — 250+ venues and thousands of films across London, updated daily.&lt;/p&gt;

&lt;p&gt;The pipeline code is open source (&lt;a href="https://github.com/clusterflick/scripts" rel="noopener noreferrer"&gt;github.com/clusterflick/scripts&lt;/a&gt;), and runs across GitHub's cloud runners and &lt;a href="https://dev.to/alistairjcbrown/scaling-from-3-cinemas-to-240-venues-what-broke-and-what-evolved-2jkk"&gt;a cluster of 6 Raspberry Pis in my living room&lt;/a&gt; — so if the judges are looking for a good home for that prize, I have a shelf ready! 🍿&lt;/p&gt;

&lt;p&gt;The parsing layer discussed below is in &lt;a href="https://github.com/clusterflick/scripts/blob/b0d0954749836c5ab4ad3c685811fbdf28410340/common/llm-client.js" rel="noopener noreferrer"&gt;llm-client.js&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Asking for a reason made the model more honest
&lt;/h3&gt;

&lt;p&gt;When I first started asking Gemini to match listings to TMDB results, I was asking it to return a match and a confidence score (I use 0–9). It worked, but I was getting too many confident wrong answers — the model would pick something and report high confidence even when it was clearly a stretch.&lt;/p&gt;

&lt;p&gt;The fix was adding a &lt;code&gt;reason&lt;/code&gt; key to the expected JSON response. Forcing the model to articulate &lt;em&gt;why&lt;/em&gt; it had chosen a match made it noticeably more cautious. It's like the difference between someone blurting out an answer and someone having to show their working. The false positives dropped.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Listing matched description of magical forest spirits and animation style"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8392&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I now apply the same pattern wherever I need the model to make a judgement call. Structured output with a reason field is the single most effective prompt change I've made.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using Gemini to improve my prompts
&lt;/h3&gt;

&lt;p&gt;At some point I realised I was spending more time tweaking prompts than writing actual pipeline code. So I started asking Gemini to critique and rewrite them for me.&lt;/p&gt;

&lt;p&gt;It sounds circular, but it works. The model is better than I am at structuring instructions for itself — clearer constraints, better edge case handling, more consistent output. Now when a prompt isn't giving me the results I want, my first step is to paste it into a fresh conversation and ask the model what's wrong with it and how it would rewrite it.&lt;/p&gt;

&lt;p&gt;The results are often prompts I wouldn't have written myself. More explicit about edge cases. Better at specifying output format. And because the model wrote them, they tend to produce more predictable responses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Defensive parsing is non-negotiable
&lt;/h3&gt;

&lt;p&gt;Even with well-crafted prompts, LLM output in production will occasionally be malformed. I found this out when the model truncated a film overview mid-sentence and left a trailing backslash — one bad character broke &lt;code&gt;JSON.parse&lt;/code&gt; and failed the entire job.&lt;/p&gt;

&lt;p&gt;The longer the pipeline ran, the more edge cases surfaced. The model occasionally hallucinates fields that aren't in the schema (&lt;code&gt;backdrop_path&lt;/code&gt; appearing uninvited was a fun one). It sometimes leaves unescaped quotes inside string values. Markdown code fences show up often enough that stripping them became standard. Each of these is now a line in the sanitisation layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;chatSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sendMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Unwrap the string if it's been wrapped in a markdown block&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;jsonString&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;`json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;correctedJsonString&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonString&lt;/span&gt;
  &lt;span class="c1"&gt;// Apply corrections for malformed escape characters (perhaps due to truncation)&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\\(?![&lt;/span&gt;&lt;span class="sr"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\/&lt;/span&gt;&lt;span class="sr"&gt;bfnrtu&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;|u&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;0-9a-fA-F&lt;/span&gt;&lt;span class="se"&gt;]{4})&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;// Apply corrections for hallucinated invalid additions&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/"backdrop_path": "&lt;/span&gt;&lt;span class="se"&gt;[^&lt;/span&gt;&lt;span class="sr"&gt;,&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sr"&gt;/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;// Fix unescaped quotes within the "reason" field value&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sr"&gt;/"reason"&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;*:&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;*"&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;.*&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;([&lt;/span&gt;&lt;span class="sr"&gt;,}&lt;/span&gt;&lt;span class="se"&gt;])&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;_match&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;reasonContent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;terminator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fixed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;reasonContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;(?&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;!&lt;/span&gt;&lt;span class="se"&gt;\\)&lt;/span&gt;&lt;span class="sr"&gt;"/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;"&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;`"reason":"&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;fixed&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;terminator&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;correctedJsonString&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Error parsing LLM answer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;--- Original response: -----------------------&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;--- Corrected response: ----------------------&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;correctedJsonString&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every line in there is a real production issue. Treat LLM responses as untrusted input, sanitise before you parse, and log both the original and corrected response when things go wrong — you'll want that context when debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Google Gemini Feedback
&lt;/h2&gt;

&lt;p&gt;Flash-lite has been reliable and cheap, which matters when you're running a pipeline daily across hundreds of venues and thousands of films. Cost has stayed predictable as the number of venues has grown, which is exactly what I needed.&lt;/p&gt;

&lt;p&gt;One deliberate choice worth mentioning: I run with &lt;code&gt;temperature: 0&lt;/code&gt;. This is a data pipeline, not a creative writing tool — I want output as close to deterministic and consistent as possible.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;generationConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;topP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;topK&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;maxOutputTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The upgrade from 2.0 to 2.5 was painless — one line change, no prompt tuning needed. To confirm nothing had shifted, I ran the pipeline twice with each model version and compared the transformed output. No noticeable differences for any venues. That kind of stability is worth a lot in production.&lt;/p&gt;

&lt;p&gt;The main frustration I haven't fully solved is flip-flopping. The pipeline runs daily, and occasionally a listing that was confidently matched to film X on one run comes back as film Y the next. The confidence is right on the edge either way — only one can be right, or both can be wrong — and &lt;code&gt;temperature: 0&lt;/code&gt; helps but doesn't eliminate it. I'd love better signalling when the model is genuinely on the fence, rather than having to infer uncertainty from a confidence score that turns out not to be reliable enough to always act on.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>geminireflections</category>
      <category>gemini</category>
    </item>
    <item>
      <title>Making London's hidden film clubs discoverable</title>
      <dc:creator>Alistair</dc:creator>
      <pubDate>Sun, 01 Mar 2026 11:50:52 +0000</pubDate>
      <link>https://dev.to/alistairjcbrown/i-built-a-film-club-discovery-tool-for-londons-cinema-community-2md</link>
      <guid>https://dev.to/alistairjcbrown/i-built-a-film-club-discovery-tool-for-londons-cinema-community-2md</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/weekend-2026-02-28"&gt;DEV Weekend Challenge: Community&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Community
&lt;/h2&gt;

&lt;p&gt;I've spent the last year building &lt;a href="https://clusterflick.com" rel="noopener noreferrer"&gt;Clusterflick&lt;/a&gt; — a site that pulls together cinema listings from across London so you can see everything showing, everywhere, without jumping between a dozen different websites. It started as a personal itch: I just wanted to know what was on (for the backstory, &lt;a href="https://dev.to/alistairjcbrown/building-clusterflick-a-london-cinema-aggregator-kk3"&gt;see my intro post&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;But the more I used it, the more I realised I was only solving half the problem. I could tell you &lt;em&gt;what&lt;/em&gt; was showing at &lt;em&gt;which venue&lt;/em&gt; — but I couldn't tell you if the screening was part of a &lt;strong&gt;film club&lt;/strong&gt;, whether the club screenings were accessible, or even that the club existed at all. London has a genuinely brilliant film club scene: community cinemas, genre nights, archive screenings, disability-led clubs. Most of them are invisible unless you already know to look for them.&lt;/p&gt;

&lt;p&gt;That felt wrong. These communities deserve better than a buried events page most people never find.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;Two new features, both aimed at making London's film club community more discoverable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Film Club Pages
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://clusterflick.com/film-clubs" rel="noopener noreferrer"&gt;clusterflick.com/film-clubs&lt;/a&gt; gives each film club its own dedicated page. Each page shows their logo, a short description of who they are and what they programme, links back to their own site, and — crucially — pulls together their full upcoming lineup across &lt;em&gt;all&lt;/em&gt; the venues they screen at. A lot of clubs move around; they're not tied to a single cinema. Clusterflick now reflects that.&lt;/p&gt;

&lt;p&gt;To give a sense of the range:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://clusterflick.com/film-clubs/bar-trash/" rel="noopener noreferrer"&gt;Bar Trash&lt;/a&gt; programmes cult and curiosity films for people who've exhausted the mainstream;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://clusterflick.com/film-clubs/pitchblack-playback/" rel="noopener noreferrer"&gt;Pitchblack Playback&lt;/a&gt; runs immersive listening sessions in the dark, using cinema sound systems the way most people never get to hear them;&lt;/li&gt;
&lt;li&gt;and &lt;a href="https://clusterflick.com/film-clubs/lost-reels/" rel="noopener noreferrer"&gt;Lost Reels&lt;/a&gt; specialises in bringing forgotten, lost, or otherwise unavailable films back to UK screens.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three very different clubs, all doing something you won't find on a standard listings site, and all working across multiple venues.&lt;/p&gt;

&lt;p&gt;I also included accessibility information on each club page, surfaced directly from the screening data. If a club regularly programmes relaxed screenings or subtitled showings, that's highlighted. It shouldn't take three clicks to find out whether a club is somewhere you can actually go.&lt;/p&gt;

&lt;h3&gt;
  
  
  Near Me
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://clusterflick.com/near-me" rel="noopener noreferrer"&gt;clusterflick.com/near-me&lt;/a&gt; uses the browser's location API to show you what's geographically closest to wherever you are right now — venues, films showing there, and the film clubs attached to those screenings. It's not trying to be Google Maps. The goal is simpler: give someone a starting point. "What's on near me tonight?" is one of the most natural questions in the world, and it's surprisingly hard to answer if you don't already know which cinemas are in your area. And alongside "what's on near me?", it now also answers "what film clubs are near me?" — surfacing the clubs connected to those local venues.&lt;/p&gt;

&lt;p&gt;Together, these two features turn Clusterflick from a listings aggregator into something closer to a community directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Both features are live now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🎬 Film clubs: &lt;a href="https://clusterflick.com/film-clubs" rel="noopener noreferrer"&gt;clusterflick.com/film-clubs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📍 Near me: &lt;a href="https://clusterflick.com/near-me" rel="noopener noreferrer"&gt;clusterflick.com/near-me&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/9Kc8_OBBwic"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqmsizp1bnltbiqvvx6o7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqmsizp1bnltbiqvvx6o7.png" alt="Bar Trash Film Club page on Clusterflick"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi3uxpswcihzuikeqsexz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi3uxpswcihzuikeqsexz.png" alt="Near You page on Clusterflick, showing Film Clubs in Hackney"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/clusterflick" rel="noopener noreferrer"&gt;
        clusterflick
      &lt;/a&gt; / &lt;a href="https://github.com/clusterflick/clusterflick.com" rel="noopener noreferrer"&gt;
        clusterflick.com
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Code for the clusterflick website
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Clusterflick&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://clusterflick.com" rel="nofollow noopener noreferrer"&gt;clusterflick.com&lt;/a&gt;&lt;/strong&gt; · &lt;strong&gt;&lt;a href="https://main--6984c607d80835bfe88c8309.chromatic.com" rel="nofollow noopener noreferrer"&gt;Storybook (Chromatic)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Every film, every cinema, one place.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Clusterflick is an open-source web app that aggregates film screenings from
across London cinemas into a single, searchable interface. Compare screenings
find showtimes, and discover what's on — whether you're chasing new releases or
cult classics.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Features&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unified Cinema Listings&lt;/strong&gt; — Browse film screenings from 250+ London cinemas
in one place&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich Movie Data&lt;/strong&gt; — View ratings and reviews from IMDb, Letterboxd
Metacritic, and Rotten Tomatoes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple Event Types&lt;/strong&gt; — Find movies, TV screenings, comedy, music events,
talks, workshops, and more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Venues &amp;amp; Boroughs&lt;/strong&gt; — Browse all cinemas by venue or explore all 33 London
boroughs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Festival Pages&lt;/strong&gt; — Dedicated pages for London film festivals with full
programme listings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accessibility Filters&lt;/strong&gt; — Filter by audio description, subtitles, hard of
hearing support, relaxed screenings, and baby-friendly showings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Geolocation&lt;/strong&gt; — Sort venues by distance from your current location&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shareable Filters&lt;/strong&gt; —…&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/clusterflick/clusterflick.com" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;




&lt;p&gt;And the data pipeline that feeds the cinema data the site relies on is here: &lt;a href="https://github.com/clusterflick/data-combined" rel="noopener noreferrer"&gt;github.com/clusterflick/data-combined&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built It
&lt;/h2&gt;

&lt;p&gt;The site is built with &lt;a href="https://nextjs.org/" rel="noopener noreferrer"&gt;Next.js&lt;/a&gt; and TypeScript, hosted on GitHub Pages. The film club pages are server-side rendered — all the data is known ahead of time, so they can be fully built at deploy. Near Me is the opposite: since it depends on the user's location, there's nothing to pre-render. The venue and screening data loads client-side, and the results appear once both that data and the user's location are available.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Near Me&lt;/strong&gt; logic is straightforward in principle: grab the user's coordinates from the browser Location API, load the cinema location data from the data pipeline, calculate distances, sort, render. The trickier part was deciding what "near" means when you're in London. After some trial and error, 2 miles turned out to be the sweet spot — enough to surface a decent set of options without stretching the definition of "nearby" too far.&lt;/p&gt;

&lt;p&gt;For the &lt;strong&gt;film club pages&lt;/strong&gt;, the main work was research and curation. I used Claude to help with the initial research pass — pulling together descriptions, verifying club details, and drafting copy — then reviewed and edited everything manually. The club-to-screening relationships come from the data pipeline, which already tags screenings with their organiser where that data is available. In the end I've added 22 clubs to the system, and over time I'll continue to add more.&lt;/p&gt;

&lt;p&gt;CI/CD runs via GitHub Actions. The data pipeline runs twice a day, and the site rebuilds automatically each time it finishes — so listings stay fresh without any manual intervention. I can also kick off a deployment manually when there are site updates to ship.&lt;/p&gt;

&lt;p&gt;This has been &lt;a href="https://github.com/orgs/clusterflick/projects/3/views/1" rel="noopener noreferrer"&gt;sitting in my GitHub issues&lt;/a&gt; for the last few months — five separate issues, all variations on the same ask; "what's nearby?" and "how do I find film clubs?". I kept kicking them down the road. This weekend challenge was the forcing function I needed to actually ship them. 🎉&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>weekendchallenge</category>
      <category>showdev</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Getting the Data Model Right: Movie -&gt; Showings -&gt; Performances</title>
      <dc:creator>Alistair</dc:creator>
      <pubDate>Wed, 25 Feb 2026 08:47:00 +0000</pubDate>
      <link>https://dev.to/alistairjcbrown/getting-the-data-model-right-movie-showings-performances-25pm</link>
      <guid>https://dev.to/alistairjcbrown/getting-the-data-model-right-movie-showings-performances-25pm</guid>
      <description>&lt;p&gt;When I started building cinema aggregation tooling — pulling listings from multiple independent cinemas — the first real decision was the data model. I've fought bad schemas before. So I sat with this one for a while before writing any code.&lt;/p&gt;

&lt;p&gt;The hierarchy I landed on is &lt;strong&gt;Movie → Showings → Performances&lt;/strong&gt;, and while it might sound over-engineered at first glance, every layer earns its place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just Movie → Performances?
&lt;/h2&gt;

&lt;p&gt;My first schema was essentially flat. A movie had a title, some overview metadata (directors, actors, duration), and an array of performances — times you could go and see it. Simple enough, and it worked fine when I was dealing with a single cinema's listings.&lt;/p&gt;

&lt;p&gt;But a cinema doesn't just &lt;em&gt;show a film&lt;/em&gt;. It shows &lt;strong&gt;variants&lt;/strong&gt; of a screening. Take &lt;a href="https://clusterflick.com/venues/hackney-picturehouse/" rel="noopener noreferrer"&gt;Hackney Picturehouse&lt;/a&gt;'s 40th anniversary run of &lt;em&gt;&lt;a href="https://letterboxd.com/film/labyrinth/" rel="noopener noreferrer"&gt;Labyrinth&lt;/a&gt;&lt;/em&gt;. They didn't just list it once with a bunch of times — they had regular showings, a "Kids' Club" baby-friendly screening, and a "Relaxed Screening" for folks needing additional support, including neurodivergent audiences and those living with dementia. These aren't just different times — they're fundamentally different experiences, each with their own listing page, their own description, and their own set of performance slots.&lt;/p&gt;

&lt;p&gt;That middle layer — the &lt;strong&gt;Showing&lt;/strong&gt; — captures this. A Showing represents one cinema's particular presentation of a movie. It carries the variant-specific context: the URL for that listing, any notes about what makes it different, and its own array of performances underneath. Hackney Picturehouse's &lt;em&gt;Labyrinth&lt;/em&gt; becomes three Showings, each with their own performances — rather than one flat list of times where you have to squint at freetext notes to figure out which screening is which.&lt;/p&gt;

&lt;h2&gt;
  
  
  The original schema
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/alistairjcbrown/hackney-cinema-calendar/blob/main/schema.json" rel="noopener noreferrer"&gt;The first version of my transform schema&lt;/a&gt; — the contract that each cinema's scraper had to produce — looked roughly like this: a flat array of objects, each with a &lt;code&gt;title&lt;/code&gt;, a &lt;code&gt;url&lt;/code&gt;, an &lt;code&gt;overview&lt;/code&gt; block of metadata, and an array of &lt;code&gt;performances&lt;/code&gt;. Each performance had a &lt;code&gt;time&lt;/code&gt;, optional &lt;code&gt;screen&lt;/code&gt;, freetext &lt;code&gt;notes&lt;/code&gt;, and a &lt;code&gt;bookingUrl&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It got the job done for a single venue. But it was doing too much in too few layers. The "notes" field on each performance was carrying all the variant information as unstructured text. Categories lived in the overview, but there was no way to distinguish between a film, a live comedy night, and a quiz. Duration was required, which made sense &lt;a href="https://dev.to/alistairjcbrown/calendar-feeds-where-it-all-started-27o2"&gt;when we were only generating calendar events&lt;/a&gt;, but caused problems when the data was missing. And there was no hook for enriching the data with external sources.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed
&lt;/h2&gt;

&lt;p&gt;The evolved schema introduces several things the original couldn't support cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A &lt;code&gt;showingId&lt;/code&gt;&lt;/strong&gt; gives each showing a stable identity. This matters when you're deduplicating across sources or tracking what's changed between scrapes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A &lt;code&gt;category&lt;/code&gt; enum&lt;/strong&gt; (&lt;code&gt;movie&lt;/code&gt;, &lt;code&gt;tv&lt;/code&gt;, &lt;code&gt;quiz&lt;/code&gt;, &lt;code&gt;comedy&lt;/code&gt;, &lt;code&gt;music&lt;/code&gt;, &lt;code&gt;talk&lt;/code&gt;, &lt;code&gt;workshop&lt;/code&gt;, &lt;code&gt;shorts&lt;/code&gt;, &lt;code&gt;event&lt;/code&gt;) acknowledges that modern independent cinemas are not just cinemas. They host all kinds of events, and your data model needs to represent that without shoehorning everything into a film-shaped hole. It also set the scene for going beyond cinemas to any venue that screens films and might have other interesting events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured accessibility data&lt;/strong&gt; at the performance level replaces freetext notes for things like audio description, baby-friendly screenings, hard-of-hearing support, relaxed sessions, and subtitles. This is crucial — accessibility isn't a property of the movie, or even the showing. It's a property of &lt;em&gt;that specific screening at that specific time&lt;/em&gt;. A Tuesday afternoon showing might be relaxed; the Saturday evening one isn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A &lt;code&gt;status&lt;/code&gt; object&lt;/strong&gt; on each performance captures things like whether it's sold out. Again, this is inherently performance-level data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External enrichment fields&lt;/strong&gt; — &lt;code&gt;themoviedb&lt;/code&gt; and &lt;code&gt;themoviedbs&lt;/code&gt; (plural) — provide the hook for hydrating listings with data from TMDB. The singular version covers standard films; the plural handles double bills or curated screening programmes where a single showing maps to multiple movies.&lt;/p&gt;

&lt;p&gt;And several small refinements: &lt;code&gt;duration&lt;/code&gt; is no longer required (because a quiz night doesn't have a runtime), &lt;code&gt;year&lt;/code&gt; was added to the overview, &lt;code&gt;classification&lt;/code&gt; replaced the awkwardly-named &lt;code&gt;age-restriction&lt;/code&gt;, and &lt;code&gt;additionalProperties: false&lt;/code&gt; was added throughout the schema to keep the data tight when validating.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr52o45anpk4k4u60qtv4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr52o45anpk4k4u60qtv4.png" alt="Entity relationship style diagram of the final transform schema" width="800" height="964"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it gets interesting: combining venues
&lt;/h2&gt;

&lt;p&gt;The transform schema represents what comes out of a single venue's scrape. Each cinema produces its own array of showings. But the aggregation site needs to combine these into a unified view: one movie, with showings from multiple cinemas, each with their own performances.&lt;/p&gt;

&lt;p&gt;This is where the hierarchy really pays off. The Movie → Showings → Performances structure scales naturally from single-venue to multi-venue. You don't need to restructure anything — you just group showings under a shared movie identity.&lt;/p&gt;

&lt;p&gt;But combining also means deduplicating, and that's where things get nuanced. When the same movie appears at three different cinemas, you'll have overlapping metadata at different levels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Director and cast info&lt;/strong&gt; might exist in the showing-level overview (scraped from the cinema's own listing) &lt;em&gt;and&lt;/em&gt; at the movie level (from TMDB). Which do you trust? Usually the external source is more reliable and complete, but not always — a cinema might list a special guest or a different cut.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accessibility information&lt;/strong&gt; is firmly performance-level. No deduplication needed — it's inherently specific to that time slot at that venue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Categories and genres&lt;/strong&gt; can drift between sources. One cinema might tag something as "Drama", another as "Drama / Thriller", and TMDB might call it "Drama, Crime". You need a strategy for reconciling these.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deduplication isn't a single operation — it's a per-field decision about which source of truth wins at which level of the hierarchy. Having clean separation between movies, showings, and performances makes those decisions much more tractable than they'd be in a flat structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The payoff
&lt;/h2&gt;

&lt;p&gt;Spending time upfront on the data model meant that when complexity arrived — new venue types, accessibility requirements, external data enrichment, multi-venue aggregation — the schema absorbed it instead of fighting it. The hierarchy isn't clever for its own sake; it maps onto how cinemas actually programme their events, and that's what makes it hold up.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Next post:&lt;/strong&gt; &lt;del&gt;Site Performance: Loading 30,000+ Showings in a Browser&lt;/del&gt;&lt;br&gt;
Change in the schedule: &lt;a href="https://dev.to/alistairjcbrown/a-brief-detour-two-writing-challenges-and-what-came-out-of-them-4h8h"&gt;A Brief Detour: Two Writing Challenges and What Came Out of Them&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>json</category>
      <category>javascript</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Scaling From 3 Cinemas to 240+ Venues: What Broke and What Evolved</title>
      <dc:creator>Alistair</dc:creator>
      <pubDate>Wed, 18 Feb 2026 08:47:00 +0000</pubDate>
      <link>https://dev.to/alistairjcbrown/scaling-from-3-cinemas-to-240-venues-what-broke-and-what-evolved-2jkk</link>
      <guid>https://dev.to/alistairjcbrown/scaling-from-3-cinemas-to-240-venues-what-broke-and-what-evolved-2jkk</guid>
      <description>&lt;p&gt;When I started scraping London cinema listings, I had three venues and a simple script. Fetch a page, parse it, done. Fast forward to today: 240+ venues, half a dozen different platform types, and a pipeline that runs daily across both GitHub's cloud runners and a cluster of 6 Raspberry Pis in my living room.&lt;/p&gt;

&lt;p&gt;Here's what I learned about building extraction systems that scale, and the architectural decisions that emerged from necessity rather than planning.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Retrieve/Transform Split: How Purity Became Practical
&lt;/h2&gt;

&lt;p&gt;Early on, I had a simple mental model: &lt;code&gt;retrieve&lt;/code&gt; grabs the main page, &lt;code&gt;transform&lt;/code&gt; figures out what to do with it. If transform needed more data, it just... made more requests. Simple enough, right?&lt;/p&gt;

&lt;p&gt;Wrong 😅&lt;/p&gt;

&lt;p&gt;This made transform &lt;em&gt;impure&lt;/em&gt;. It was making network calls, which created a cascading set of problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Debugging was a nightmare&lt;/strong&gt; - request code wasn't all in one place&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching became complicated&lt;/strong&gt; - you now have to cache in two different jobs. If you clear the cache of one job, what impact will that have on the other job?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing was fragile&lt;/strong&gt; - you couldn't test transform logic without network access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The solution wasn't about network topology or runner management. It was about simplicity and separation of concerns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The new contract is simple:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;retrieve&lt;/code&gt; does &lt;em&gt;all&lt;/em&gt; the fetching - even if it needs to parse HTML to find links to follow&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;transform&lt;/code&gt; makes &lt;em&gt;zero&lt;/em&gt; network calls - it takes inputs and produces data that adheres to the schema, that's the guarantee&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each function has a single responsibility. Retrieve handles the messy, stateful, network-dependent work. Transform does the pure, testable, repeatable work.&lt;/p&gt;

&lt;p&gt;In practice, this means retrieve might fetch a main page, parse it for film listing URLs, fetch all of those, and hand everything to transform as a bundle. Transform just processes what it's given.&lt;/p&gt;

&lt;p&gt;This matters for more than just clean code. Once all retrieves complete, the pipeline creates a GitHub release with an immutable blob of all the raw data. Then transform jobs run against that release. If I change downstream code later, I can re-run transforms on old data without hitting anyone's servers again. That only works if transforms are pure functions.&lt;/p&gt;

&lt;p&gt;The retrieve workflow lives in one repository, transform in another. Each creates releases named by timestamp. Clean separation all the way down.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Variety of Retrieval Strategies
&lt;/h2&gt;

&lt;p&gt;With 240 venues, you see every possible variation of how a cinema might publish its data. Here's what emerged:&lt;/p&gt;

&lt;h3&gt;
  
  
  Single Page: The Dream
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; &lt;a href="https://clusterflick.com/venues/prince-charles-cinema/" rel="noopener noreferrer"&gt;Prince Charles Cinema&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One big page with everything you need. Parse it once, you're done. These are vanishingly rare and I treasure them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Main Page + Listing Pages: The Common Pattern
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; &lt;a href="https://clusterflick.com/venues/the-castle-cinema/" rel="noopener noreferrer"&gt;The Castle Cinema&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is by far the most common pattern. You fetch the main "what's on" page to discover what films are showing, then fetch each film's individual listing page for the rich data you need for proper matching - full synopsis, runtime, cast, directors.&lt;/p&gt;

&lt;p&gt;It's two-stage, but predictable. Retrieve handles both stages, transform gets a complete dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  JSON/API Endpoints: The Developer's Joy
&lt;/h3&gt;

&lt;p&gt;When a cinema exposes a proper API, everything gets easier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Normal JSON:&lt;/strong&gt; &lt;a href="https://clusterflick.com/venues/cineworld-leicester-square/" rel="noopener noreferrer"&gt;Cineworld&lt;/a&gt; has straightforward endpoints. Hit them, parse the response, done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Big Standard (OCAPI):&lt;/strong&gt; This is where it gets interesting. Open Commerce API (OCAPI) is a standardised ticketing platform API used by both &lt;a href="https://clusterflick.com/venues/curzon-mayfair/" rel="noopener noreferrer"&gt;Curzon&lt;/a&gt; and &lt;a href="https://clusterflick.com/venues/odeon-luxe-leicester-square/" rel="noopener noreferrer"&gt;ODEON&lt;/a&gt;. One unified codebase handles two of the biggest cinema chains in London. When you discover a new cinema runs on OCAPI, it's trivial to add - just point the existing module at their endpoints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weird JSON:&lt;/strong&gt; &lt;a href="https://clusterflick.com/venues/metro-cinema/" rel="noopener noreferrer"&gt;Metro Cinema&lt;/a&gt; technically has a JSON API, but it requires signed requests with hard coded API key on the front-end. There's a bunch of hoop-jumping involved. Still better than parsing HTML, but barely.&lt;/p&gt;

&lt;h3&gt;
  
  
  GraphQL: Same Benefits, Different Query Language
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; &lt;a href="https://clusterflick.com/venues/act-one-cinema/" rel="noopener noreferrer"&gt;ActOne Cinema&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Like JSON endpoints, but with GraphQL queries. You get structured data without HTML wrangling. The learning curve is steeper than REST, but the payoff is the same - no HTML parsing.&lt;/p&gt;

&lt;h3&gt;
  
  
  The HTML Parsing Toolkit: Cheerio, Playwright, and date-fns
&lt;/h3&gt;

&lt;p&gt;When there's no API and you're parsing HTML, you need the right tools for the job.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cheerio.js.org/" rel="noopener noreferrer"&gt;&lt;strong&gt;Cheerio&lt;/strong&gt;&lt;/a&gt; - For sites that let you just fetch their HTML. Cheerio is like jQuery but without an actual DOM. You can do CSS selectors and extraction without spinning up a browser. Fast and lightweight.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://playwright.dev/" rel="noopener noreferrer"&gt;&lt;strong&gt;Playwright&lt;/strong&gt;&lt;/a&gt; - For sites that won't let you just fetch HTML. Maybe they have bot detection, maybe they're heavily client-side rendered, maybe they need requests from residential IPs (hello, cluster of 6 Pis). You need a real browser to make it work.&lt;/p&gt;

&lt;p&gt;The BFI is the worst offender for needing this. Both &lt;a href="https://clusterflick.com/venues/bfi-southbank/" rel="noopener noreferrer"&gt;BFI Southbank&lt;/a&gt; and &lt;a href="https://clusterflick.com/venues/bfi-imax/" rel="noopener noreferrer"&gt;BFI IMAX&lt;/a&gt; run on the same slow, inconsistent site. Pages load in pieces asynchronously and often time out. It's the longest-running retrieve in the entire pipeline. There's no API. It's just a slog 😭&lt;/p&gt;

&lt;p&gt;&lt;a href="https://date-fns.org/" rel="noopener noreferrer"&gt;&lt;strong&gt;date-fns&lt;/strong&gt;&lt;/a&gt; - Once you've extracted the data, you still have to parse it. Cinema websites output dates and times in wildly different formats. &lt;code&gt;date-fns&lt;/code&gt; handles converting these strings into date objects so we can generate the timestamps the schema requires. Anyone who's worked with dates knows how much of a headache they can be without a good library!&lt;/p&gt;

&lt;h3&gt;
  
  
  Complex Multi-Page: When Listings and Booking Are Separate
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; &lt;a href="https://clusterflick.com/venues/science-museum/" rel="noopener noreferrer"&gt;Science Museum&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where it gets properly complicated:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Retrieve "products" from their JSON API&lt;/li&gt;
&lt;li&gt;Filter for movies (because they sell all kinds of products)&lt;/li&gt;
&lt;li&gt;Now we've got the titles - but nothing else, there's no link to detail pages in this data&lt;/li&gt;
&lt;li&gt;Use their HTML search page to search for each title and scrape the first match (this only works because the Science Museum doesn't show many films and they have distinct titles)&lt;/li&gt;
&lt;li&gt;Fetch the listing page HTML for each match to get full movie details&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It's a multi-stage dance between JSON and HTML, search and direct fetch, just to get a complete dataset. And Retrieve handles all of this. Transform just processes the final bundle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shared Cinema Platforms: When Adding Venues Becomes Trivial
&lt;/h2&gt;

&lt;p&gt;The absolute best moment in maintaining this pipeline is discovering a new cinema runs on a platform you already support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OCAPI&lt;/strong&gt; powers ODEON and Curzon. One codebase, two major chains, dozens of screens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Savoy&lt;/strong&gt; is the big one for independent cinemas - when you find a new independent cinema and realize it's running Savoy's platform, you just configure a new venue to point at it. No new extraction code needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Indy Cinema Group&lt;/strong&gt; and &lt;strong&gt;AdmitOne&lt;/strong&gt; both power multiple cinemas in the dataset. Same pattern - write the platform integration once, point it at new venues as you discover them.&lt;/p&gt;

&lt;p&gt;When a cinema migrates between platforms you already know, updating is a trivial config change. This is what makes scaling from a few venues to 200+ feasible - you're not writing 200 different scrapers, you're pointing a dozen implementations at different configurations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Event Platforms: When Venues Don't Have Their Own Sites
&lt;/h2&gt;

&lt;p&gt;Not every screening venue maintains its own website with listings. Some only publish events on platforms like Eventbrite, Dice, or OutSavvy (in the codebase we call them "sources")&lt;/p&gt;

&lt;p&gt;Here's how the pipeline handles this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Once per retrieve run&lt;/strong&gt;, pull all London film-specific events from each source. How we get those varies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some source let you filter directly on "Films"&lt;/li&gt;
&lt;li&gt;For others we search "Films" and "Theatre" (to catch theatre-on-film like NT Live)&lt;/li&gt;
&lt;li&gt;Some require keyword searches and some processing once we have the data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From the source, we now have a bunch of events for lots of different venues, some of which may not even be in London. This is where the setup for sources differs - sources don't transform, they "find". Using the venue attributes - name, address, coordinates, alternative names - they find matching events that the venue's transform function can then encorporate when outputting the final list of venue events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Each source is responsible for matching&lt;/strong&gt; based on what data it has. Most compare against the venue name (and list of alternative names like "The Ritzy" vs "Ritzy Picturehouse") plus either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Coordinate match within 100m, or&lt;/li&gt;
&lt;li&gt;Postcode match (some listings have wrong coordinates but correct addresses)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Name matching is fuzzy - basic normalization before comparing. I've never seen false positives because the matching is pretty specific, so we're more likely to miss events than missmatch. There are analysis scripts for each source showing which events matched and which didn't, so we can manually review for missing events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Event-source-only venues&lt;/strong&gt; don't have a website to retrieve at all. And their transform just returns whatever the sources found.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; &lt;a href="https://clusterflick.com/venues/bfi-stephen-street/" rel="noopener noreferrer"&gt;BFI Stephen Street&lt;/a&gt; - a private hire screen that only appears on event platforms when someone books it for a public screening.&lt;/p&gt;

&lt;p&gt;The beauty of this pattern: when a new venue shows up on Eventbrite, adding it is minimal effort. The event data is already being pulled daily. You just register the venue metadata and let the matching happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like In Practice
&lt;/h2&gt;

&lt;p&gt;Here's the flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Retrieve jobs run&lt;/strong&gt; - some on GitHub's cloud runners, some on the local cluster for sites that need residential IPs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data gets aggregated&lt;/strong&gt; into &lt;a href="https://github.com/clusterflick/data-retrieved/releases/latest" rel="noopener noreferrer"&gt;a GitHub release in the retrieve repository&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform jobs pull that release&lt;/strong&gt; and run on GitHub's cloud&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Each transform&lt;/strong&gt; is pure - it processes the data it's given, optionally merging in matched events from the event sources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output&lt;/strong&gt; is data conforming to a standardized schema, regardless of whether the source was a single HTML page, a GraphQL API, or an Eventbrite search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Final transformed data&lt;/strong&gt; gets published as &lt;a href="https://github.com/clusterflick/data-transformed/releases/latest" rel="noopener noreferrer"&gt;a release in the transform repository&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The system isn't elegant because I designed it to be. It's elegant because each constraint - rate limits, IP restrictions, venue variety, platform diversity - forced a clean separation of concerns.&lt;/p&gt;

&lt;p&gt;And somehow, it all runs daily, for 240+ venues, without falling over* 🍿&lt;/p&gt;

&lt;p&gt;* it sometimes falls over&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Next post:&lt;/strong&gt; &lt;a href="https://dev.to/alistairjcbrown/getting-the-data-model-right-movie-showings-performances-25pm"&gt;Getting the Data Model Right: Movie -&amp;gt; Showings -&amp;gt; Performances&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>architecture</category>
      <category>automation</category>
      <category>datapipeline</category>
    </item>
    <item>
      <title>Calendar Feeds: Where It All Started</title>
      <dc:creator>Alistair</dc:creator>
      <pubDate>Wed, 11 Feb 2026 08:34:00 +0000</pubDate>
      <link>https://dev.to/alistairjcbrown/calendar-feeds-where-it-all-started-27o2</link>
      <guid>https://dev.to/alistairjcbrown/calendar-feeds-where-it-all-started-27o2</guid>
      <description>&lt;p&gt;When I lived in Belfast, I had one problem: I wanted to know what was showing at &lt;a href="https://strandartscentre.com/" rel="noopener noreferrer"&gt;the Strand Cinema&lt;/a&gt; without having to remember to check their website. I wanted to look at next Friday in my calendar and see if there was anything worth going to.&lt;/p&gt;

&lt;p&gt;So I built a scraper. Pull the listings, transform them into something structured, generate an ICS file. Done.&lt;/p&gt;

&lt;p&gt;That was June 2023. That workflow—retrieve, transform, output—is still the foundation of everything Clusterflick does today.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Looks Like Now
&lt;/h2&gt;

&lt;p&gt;I currently have 14 cinema calendar feeds in my Google Calendar, for those venues I go to most often. When I want to see what's on, I toggle a few of them on—maybe the &lt;a href="https://clusterflick.com/venues/bfi-southbank/" rel="noopener noreferrer"&gt;BFI&lt;/a&gt;, &lt;a href="https://clusterflick.com/venues/the-castle-cinema/" rel="noopener noreferrer"&gt;The Castle Cinema&lt;/a&gt;, &lt;a href="https://clusterflick.com/venues/genesis-cinema/" rel="noopener noreferrer"&gt;Genesis Cinema&lt;/a&gt;, and &lt;a href="https://clusterflick.com/venues/hackney-picturehouse/" rel="noopener noreferrer"&gt;Hackney Picturehouse&lt;/a&gt; if I'm planning for the weekend. When I book tickets, I just copy that event over to my personal calendar.&lt;/p&gt;

&lt;p&gt;Adding a feed is as simple as pasting a URL into Google Calendar. If you want to try it yourself, &lt;a href="https://github.com/clusterflick/data-calendar/" rel="noopener noreferrer"&gt;the 📅 &lt;code&gt;data-calendar&lt;/code&gt; repo has instructions&lt;/a&gt; and feed URLs for every venue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; It's now even easier 🎉 - handy calendar links are included on venue pages in Clusterflick. You can add to Google Calendar, Outlook, or any calendar app that supports &lt;a href="https://en.wikipedia.org/wiki/Webcal" rel="noopener noreferrer"&gt;Webcal&lt;/a&gt; with one click!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5vyymsmlf5yvixpmho1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5vyymsmlf5yvixpmho1.png" alt="Screenshot of the Prince Charles Cinema venue page on Clusterflick, showing the logo, name, socials and newly added calendar buttons" width="800" height="286"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;👆 Calendar buttons, now at the top of &lt;em&gt;every&lt;/em&gt; venue page. Super easy to get your favourite (&lt;a href="https://clusterflick.com/venues/prince-charles-cinema/" rel="noopener noreferrer"&gt;Prince Charles Cinema&lt;/a&gt;?) schedule right in your calendar 🎬&lt;/p&gt;

&lt;h2&gt;
  
  
  Rich Events, Not Just "7pm — Cinema"
&lt;/h2&gt;

&lt;p&gt;Each calendar event includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The venue name and location (so Google Maps knows where you're going)&lt;/li&gt;
&lt;li&gt;A link back to the original listing page&lt;/li&gt;
&lt;li&gt;The movie title as the cinema lists it&lt;/li&gt;
&lt;li&gt;Whatever metadata we managed to extract: directors, actors, a plot summary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below all of that, we include our match with The Movie Database: so you also have the canonical title, the year, an overview, and a link back to TMDB if you want to look up more—but the event title itself stays as the cinema's original listing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcds529b0drhy7fq88kpi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcds529b0drhy7fq88kpi.png" alt="Screenshot of Google calendar showing the Prince Charles Cinema schedule for next week, which was generated as part of the Clusterflick datapipline" width="800" height="392"&gt;&lt;/a&gt;&lt;br&gt;
👆 &lt;em&gt;Prince Charles Cinema schedule for next week&lt;/em&gt; 📆&lt;/p&gt;

&lt;p&gt;This is different from the website, where everything gets unified under one canonical movie title. Calendar feeds are venue-specific—they're mirroring what's on that cinema's website, so using their original title makes sense. If the Prince Charles Cinema is showing "Troll 2 (aka Best Worst Movie)" and we've matched it to &lt;em&gt;Troll 2&lt;/em&gt; in TMDB, that's fine. The feed is telling you what's on at that venue, not trying to reconcile it with every other cinema's listing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Duration Problem
&lt;/h2&gt;

&lt;p&gt;Here's the annoying thing about cinema listings: they tell you when the film starts, but rarely how long it is. And if you're putting something in a calendar, you need an end time.&lt;/p&gt;

&lt;p&gt;Early on, I just defaulted everything to 90 minutes. If the listing happens to include a runtime, we use it. And since we match more than 96% of films against TMDB, we can pull the actual runtime from there. So if it's a 2h20m film, you get a 2h20m calendar event.&lt;/p&gt;

&lt;p&gt;It's not perfect—it doesn't account for the 20 minutes of trailers most cinemas front-load. But it's close enough to be useful. A two-hour film showing up as a two-hour block in your calendar is good enough for planning your evening.&lt;/p&gt;

&lt;h2&gt;
  
  
  It Branches Early
&lt;/h2&gt;

&lt;p&gt;One of the nice architectural wins here: calendar feeds come straight off the transform step. They don't need the combining logic, the caching layer, or the TMDB enrichment that the website requires.&lt;/p&gt;

&lt;p&gt;The website has to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Combine showings from multiple venues into canonical movies&lt;/li&gt;
&lt;li&gt;Cache TMDB lookups to avoid rate limits&lt;/li&gt;
&lt;li&gt;Fetch rich metadata (full cast, crew, posters, trailers)&lt;/li&gt;
&lt;li&gt;Generate static pages for every film&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The calendar feeds skip all of that. They're just: &lt;em&gt;here's what this venue says is showing, in a format your calendar app understands&lt;/em&gt;. We branch off right after transform and generate the ICS file. Simple.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;This is still the simplest, most personally useful output of the whole project. Everything else—the website, the movie matching, the LLM-assisted disambiguation—grew from this.&lt;/p&gt;

&lt;p&gt;I just wanted to see what was on at the cinema without having to check their website. Two years later, I still use these feeds every week. The rest of Clusterflick exists because this one thing was useful enough to keep building on.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Next post:&lt;/strong&gt; Scaling From 3 Cinemas to 240 Venues: What Broke and What Evolved&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>webdev</category>
      <category>javascript</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Building Clusterflick: A London Cinema Aggregator</title>
      <dc:creator>Alistair</dc:creator>
      <pubDate>Fri, 06 Feb 2026 18:20:48 +0000</pubDate>
      <link>https://dev.to/alistairjcbrown/building-clusterflick-a-london-cinema-aggregator-kk3</link>
      <guid>https://dev.to/alistairjcbrown/building-clusterflick-a-london-cinema-aggregator-kk3</guid>
      <description>&lt;p&gt;I've been working on a personal project called &lt;a href="https://clusterflick.com" rel="noopener noreferrer"&gt;Clusterflick&lt;/a&gt; — a single source for every movie showing across London. Right now it's tracking &lt;strong&gt;240 venues&lt;/strong&gt; across &lt;strong&gt;5 event platforms&lt;/strong&gt;, currently pulling in &lt;strong&gt;1,398 events&lt;/strong&gt; and over &lt;strong&gt;30,000 showings&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It started simply enough: I just wanted cinema times on my calendar. But it quickly spiralled into a full data pipeline running on GitHub Actions, a statically generated Next.js site, and a cluster of Raspberry Pis in my living room.&lt;/p&gt;

&lt;p&gt;Some of the most interesting challenges so far:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Movie matching is deceptively hard.&lt;/strong&gt; You'd think title + year would uniquely identify a film. It doesn't. Neither does title + director. Sometimes cinema listings don't even give you enough to identify a movie as a human.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scraping at scale without a budget.&lt;/strong&gt; GitHub runner IPs get blocked, so now there's a Raspberry Pi cluster handling the tricky ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using LLMs for data quality.&lt;/strong&gt; When fuzzy matching falls short, LLMs have been surprisingly useful for resolving ambiguous movie lookups against The Movie DB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keeping it cheap.&lt;/strong&gt; The whole thing runs on near-zero infrastructure costs — GitHub Actions for orchestration, Releases as storage, static site generation to avoid hosting costs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The whole project is open source on &lt;a href="https://github.com/clusterflick/" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. If any of this sounds interesting, I'd love to hear from others working on similar scraping/aggregation/data pipeline projects.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>webdev</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
