I've been working on a personal project called Clusterflick — a single source for every movie showing across London. Right now it's tracking 240 venues across 5 event platforms, currently pulling in 1,398 events and over 30,000 showings.
It started simply enough: I just wanted cinema times on my calendar. But it quickly spiralled into a full data pipeline running on GitHub Actions, a statically generated Next.js site, and a cluster of Raspberry Pis in my living room.
Some of the most interesting challenges so far:
- Movie matching is deceptively hard. You'd think title + year would uniquely identify a film. It doesn't. Neither does title + director. Sometimes cinema listings don't even give you enough to identify a movie as a human.
- Scraping at scale without a budget. GitHub runner IPs get blocked, so now there's a Raspberry Pi cluster handling the tricky ones.
- Using LLMs for data quality. When fuzzy matching falls short, LLMs have been surprisingly useful for resolving ambiguous movie lookups against The Movie DB.
- Keeping it cheap. The whole thing runs on near-zero infrastructure costs — GitHub Actions for orchestration, Releases as storage, static site generation to avoid hosting costs.
The whole project is open source on GitHub. If any of this sounds interesting, I'd love to hear from others working on similar scraping/aggregation/data pipeline projects.
Top comments (1)
🎉