DEV Community

Cover image for Github as Infrastructure
Alistair
Alistair

Posted on

Github as Infrastructure

Clusterflick has always been a personal project, which means keeping costs down has always been a goal. I already had a GitHub account for the code, so the question was how far I could push that. The answer turned out to be: further than I expected.

This post is about using GitHub not just as a place to store code, but as the actual infrastructure the project runs on. Some of it is straightforward. Some of it is a bit unconventional. All of it comes back to the same goal: keep it cheap, keep it open.

Actions as the Pipeline Engine

The core of Clusterflick is a data pipeline: retrieve cinema listings, transform them, enrich them, combine them, generate outputs. That pipeline runs on GitHub Actions.

Every morning, a workflow kicks off spinning up dozens of jobs to retrieve raw cinema data. As those finish, downstream workflows start spinning up more jobs — transforming, enriching, combining, and finally generating the website. Each step is a separate workflow in a separate repo, with outputs feeding into the next. The project readme has more details of the full flow if you want the detail.

Flow diagram of the Clusterflick data pipeline

One thing that makes this work in practice is GitHub's secrets management. API keys for TMDB, the LLM provider, and the tokens needed for cross-repo dispatch events are all stored as secrets and injected into workflows at runtime — none of it sitting in the codebase.

The free tier for public repos covers all of this comfortably. The open source decision and the infrastructure decision are linked — without public repos, the free Actions minutes disappear.

Releases as a Database

This is the part people tend to raise an eyebrow at 😁

Instead of a database, or an S3 bucket, or any paid storage layer, the output of all the scrapes is a bunch of JSON files uploaded as assets of a GitHub Release on a public repo. A release is just a named snapshot and you can attach arbitrary files to it. The latest run uses the "latest" tag.

Downstream workflows then download the latest release, and the pipeline continues. Each stage produces its own release in its own repo, which makes getting "the latest data" trivial — there's no pattern matching on release names, no querying a database. Each repo has one job: produce a release. Grab the latest release from that repo to get the data.

It's not a pattern you'd reach for if you were building something traditional, but for a project where the data is public anyway it works well — the data is versioned, publicly accessible, queryable via the GitHub API, and costs nothing. There's also a nice side effect: every daily run produces an immutable, timestamped snapshot of cinema data. If you ever wanted to analyse how London's cinema landscape changes over time, that archive is just sitting there.

Github releases page showing a recently data retrieval

A Multi-Repo Architecture With Cross-Repo Triggers

The pipeline spans multiple repos — data-retrieved, data-transformed, data-cached, data-combined, and others — with workflows in one repo triggering workflows in the next via repository_dispatch events.

The reason for the split is the Releases-as-storage pattern above. Each repo holds one stage's data as its latest release, and downstream repos pull from it. That boundary — one repo, one release, one stage — is what makes the "grab the latest" approach work cleanly. Some of these repos contain little more than a workflow file, a package.json, and a readme.

The trade-off is that tracing a failure across the chain means navigating between repos. The practical fix is a readme with status badges for each workflow — a glance tells you which stage broke, rather than having to click through each repo's Actions tab to find out.

Github status badges showing the status of each job

Self-Hosted Runners: When Cloud Runners Get Blocked

Some cinema venues block requests from GitHub's cloud runner IP ranges — they're well-known and easy to identify as automated traffic. To handle those, I run a cluster of Raspberry Pi 4s at home as self-hosted runners. They use a residential IP address, so requests look like regular browser traffic.

The previous post in this series covers the hardware side of that setup. From GitHub's perspective: register each Pi as a runner in the org, add it to a runner group, target jobs at that group with runs-on: self-hosted. The combination of cloud runners for most venues and self-hosted for the tricky ones means the pipeline rarely hits a wall.

Pages for the Site

The site itself is a statically generated Next.js app, built in CI and deployed to GitHub Pages via actions/deploy-pages. No server to manage, no hosting bill.

Pages works well, but the default caching headers are conservative — not ideal when you're serving a lot of static assets. I've got Cloudflare sitting in front to handle that properly. The site performance post goes into that in more detail.

Screenshot of the Clusterflick website homepage

Actions Beyond the Pipeline

The pipeline is the obvious use of Actions, but it's not the only one. A few workflows that exist purely for maintenance or operational visibility:

Weekly dependency cache cleanup. GitHub-hosted runners cache node_modules and Playwright browser installs between runs, but those caches can get stale or bloated. A scheduled weekly workflow clears and rebuilds them, which keeps job times consistent.

Automated PRs for title normalisation. When new cinema titles come in, a workflow records each title alongside its normalised output and opens a PR for manual review. This serves two purposes: it builds up a set of real titles to confirm that changes to the normaliser work as expected, and the PR diff makes it easy to spot cases where the normaliser isn't behaving correctly. I tried automating the review step itself with an LLM — that post covers how that went.

Comparison against Accessible Screenings UK. Accessible Screenings UK maintains their own dataset of accessible cinema showings. A workflow runs a comparison between their data and Clusterflick's and surfaces anything that's in their data but missing from ours — a useful cross-check for coverage gaps that would otherwise be invisible.

Summary of a previous accessible screenings comparison

Having these as scheduled or triggered workflows means they happen consistently rather than being something that gets done when someone remembers.

The Scripts Repo as an npm Package

One pattern that's worked well is treating the shared scripts repo as an npm package installed directly from GitHub, rather than publishing it to the npm registry.

In each repo's package.json, the dependency looks like:

"scripts": "github:clusterflick/scripts"
Enter fullscreen mode Exit fullscreen mode

Running npm install pulls the latest from the default branch. No versioning ceremony, no publishing step — every repo that depends on the scripts always gets the current version.

There's no pinning by default, so a breaking change in the scripts repo will affect anything that reinstalls. The flip side is that a fix in the scripts repo propagates automatically the next time a job runs npm install — so you can fix a broken job without re-running the entire workflow.

GitHub Projects for Task Management

The last piece is GitHub Projects for tracking work. Issues live on the repos they relate to, and the project board pulls them all together.

Keeping task management in GitHub means everything — code, data, CI, tasks — lives in one place. It's great for a project worked on in spare moments rather than full days.

New cinemas Github project

The Bigger Picture

What I've ended up with is a project where GitHub does a lot more than host the code. It runs the pipeline, stores the data, hosts the site, handles the automation, and tracks the work. That wasn't the plan from the start — it accumulated decision by decision, each one driven by the same question: what's the cheapest way to do this that doesn't create more problems than it solves?

The Releases-as-storage pattern still feels a bit odd to explain, but it works. The multi-repo cross-trigger setup adds some operational complexity, but it keeps each stage understandable on its own. None of it is architecture for architecture's sake — it's just what emerged from trying to keep the whole thing running for as little as possible.

Top comments (0)