At least twice a day, the pipeline scrapes 250+ London cinemas and produces a dataset of 1,500+ films with 30,000+ showings. Then I need to get all of that into a browser.
Getting the raw data from venues is its own challenge (covered in an earlier post) but even once you've got it, making it available to users fast and in a useful way has its own set of problems to solve.
Clusterflick runs entirely as a static site, served from GitHub Pages with no live server. That's a deliberate constraint — the whole project runs on GitHub's free tier, and I'd like to keep it that way (more on that in a future post). But it means the browser has to do more of the work, and that puts performance decisions front and centre.
By the time data reaches the frontend, it's already been through several pipeline stages — each one producing a GitHub release that the next stage picks up:
-
Retrieve: raw HTML, JSON APIs, and scraped pages from all 252 venues
- ~800 MB total
-
Transform: extracts structured showings from the raw data, matches films against TMDB and saves the ID of matches
- down to ~15 MB total
-
Combine: merges the films from all venues together and hydrates films that have a TMDB ID with rich metadata (cast, genres, poster images, ratings)
- ~18 MB total
-
Process: strips redundant data, extracts URL prefixes, splits into chunks
- ~5 MB raw, ~1.5 MB gzipped over the wire
This post is about the decisions in that last step (and one I unmade) getting from the combined JSON to something a browser can load and render quickly.
The Compression Detour
Before building anything clever on the frontend, I wanted to be sure the raw data was as small as possible. I'd been running the JSON through compress-json, a library that structurally transforms JSON — deduplicating repeated values into lookup tables, encoding types differently. It made the raw file dramatically smaller. As an example, for one of the runs the full dataset without it is 10.97 MB; with it, 4.85 MB. That's a real reduction.
So I ran a benchmark across every optimisation in the pipeline to see which ones were actually earning their place.
| Optimisation | Gzipped impact |
|---|---|
| Removing showing overviews | 💪 -6.1% (saves 109 KB) |
| URL prefix extraction | 💪 -5.0% (saves 90 KB) |
| Removing IDs | 💪 -2.4% (saves 43 KB) |
| Removing false a11y flags | 🤷 ~0% |
| Trimming RT data | 🤷 ~0% |
compress-json |
😱 +18.5% (hurts!) |
The headline finding: compress-json makes the gzipped output larger. Without it, the gzipped total is 1.43 MB. With it, 1.76 MB. That's 333 KB I was paying to make things worse.
The reason makes sense once you think about it. Gzip excels at finding repeated byte sequences — exactly what compress-json was doing first. The two approaches fight each other: compress-json's transformed structure is actually harder for gzip to compress than plain repetitive JSON. Gzip decompression is built into every browser's network stack — native C++ code that runs before JavaScript even sees the response. compress-json decompression, by contrast, runs on the main thread in JavaScript. So the current pipeline was paying three times: larger transfer size, extra JS bundle weight for the decompress library, and CPU time running decompress() on every chunk.
So I deleted it. The "no compress-json" variant still has all the other optimisations applied and lands at 1.43 MB — 19% smaller than before. 🎉
The two optimisations that turned out to have zero impact — removing false accessibility flags and trimming Rotten Tomatoes fields — were easy to rationalise after the fact. Accessibility data is sparse; very few performances have those flags set at all, so deleting false values removes almost nothing. The RT fields are a handful of small values per movie. Neither gives gzip much to work with.
Splitting the Data into Chunks
Even at 1.43 MB gzipped, serving the full dataset as a single file would mean users wait for everything before seeing anything. Instead, as part of the data processing it's splits into chunks and a metadata file written alongside them.
The chunking isn't by movie count — it's by serialised byte size, with a target of ~400 KB per chunk. Chunking by movie count would produce wildly uneven file sizes; a blockbuster showing at 50+ venues generates far more data than a one-week indie run. Performance count was an earlier approach, but it still produced too much variance — chunk files ranged 65 KB - 1.2 MB. Switching to byte size brought that down to 16 KB - 727 KB, with the bulk of chunks clustering tightly between 324 KB and 436 KB.
The remaining outliers are expected. The small tail chunks at the end of the alphabet simply don't have enough movies left to fill a full bucket. The large ones contain individual films whose serialised data alone exceeds the target — a blockbuster with 50+ venues and thousands of performances will do that — so they necessarily get a bucket to themselves.
Movies are sorted alphabetically by normalised title before being bucketed — mirroring the default sort order on the site. The idea is that we'll start downloading chunk 0 first, and it'll contain the movies a the top of the list which are visible on screen when the page first loads. So the data the user actually sees is most likely to arrive first and there's less change of visible updates as subsequent chunks load in below the fold.
data.meta.a1b2c3d4e5.json
data.0.f6g7h8i9j0.json
data.1.k1l3m5n7o9.json
...
data.<index>.<fingerprint>.json
The metadata file carries the full lookup tables for genres, people, and venues (shared across all movies), the URL prefix table used to reconstruct booking links, and the mapping that tells the client which chunk contains which movie ID. It's the one file the browser always fetches first — and it's hashed like the chunks, so its filename is baked into NEXT_PUBLIC_DATA_FILENAME at build time.
There's one catch with GitHub Pages: it sets a 10-minute cache TTL on everything at the browser level, which means even a fingerprinted file that hasn't changed for weeks gets revalidated every 10 minutes. Cloudflare sits in front of the site and fixes this in two ways: it caches the files at the edge, and it overrides GitHub's cache-control headers so browsers are told to store all JSON files for a year. Since every file — chunks and metadata alike — is fingerprinted, a changed file always means a new URL and a cache miss by design. A first-time visitor fetches from Cloudflare's edge and caches locally for a year. A repeat visitor gets it straight from their browser cache. Either way, they're only ever making a network request for files that have actually changed.
Once the client has the metadata, CinemaDataProvider handles the rest:
- Priority chunk — on a movie detail page, the client looks up the movie's chunk in the mapping and fetches it immediately. Showings appear before the rest of the dataset has loaded.
-
All other chunks in parallel — via
Promise.allSettled(), so a single failed chunk doesn't block everything else from loading. -
Expand and prune — IDs stripped before serialisation are re-added via
expandData()(restoring the keys that were removed to save bytes), and past performances are stripped before chunks enter React state.
Static Export Changes Everything
Clusterflick uses Next.js with output: "export". There's no live server. Every page is pre-rendered to static HTML during npm run build, then served from GitHub Pages.
This shapes every rendering decision. When Next.js docs talk about Server Components, in this context that means "code that runs at build time on a Node process" — not a server handling live requests. Whatever I pre-render is fixed until the next build.
Two Grids on the Home Page
The home page has a slightly odd architecture, and it's worth explaining why.
At build time, app/page.tsx (a Server Component) reads the chunk files from disk, merges them, applies the default filters — films and shorts, 7-day window — and takes the first 72 results sorted by normalized title. These 72 movies are rendered as a static HTML grid of poster images and links. No JavaScript required. This grid is wrapped in an SSROnly component that removes itself after hydration.
So during the initial paint, and for any crawler, there's a real grid of films with real titles and links in the HTML. Once JavaScript loads and mounts, SSROnly cleans up that static content and hands off to the interactive grid.
The 72 limit is deliberate. It's enough for a meaningful SEO payload — film titles, poster images, links — without bloating the HTML with hundreds of entries. The real, interactive grid that users actually browse is built entirely client-side with the full dataset, applying any filters which may be in effect.
Virtualising 1,500+ Posters
The filter UI is designed to give immediate visual feedback as you change options — in the current design the filter overlay is semi-transparent, so you can see the poster grid updating behind it as you adjust. That only works if rendering is fast. On an earlier design, where the filter controls sat directly above a flat list of results, the lag was obvious and painful: every filter change triggered a re-render of the entire list.
The solution is react-virtualized — specifically its Grid component combined with WindowScroller. Rather than rendering the full list, it calculates which cells are currently visible in the viewport and only renders those, plus a small buffer:
<WindowScroller>
{({ height, isScrolling, registerChild, onChildScroll, scrollTop }) => (
<div ref={registerChild}>
<Grid
autoHeight
cellRenderer={cellRenderer}
columnCount={columnCount}
columnWidth={POSTER_WIDTH + GAP} // 208px per column
rowHeight={POSTER_HEIGHT + GAP} // 308px per row
rowCount={rowCount}
overscanRowCount={3} // pre-render 3 rows above/below viewport
scrollTop={scrollTop}
isScrolling={isScrolling}
onScroll={onChildScroll}
...
/>
</div>
)}
</WindowScroller>
WindowScroller ties the grid's scroll position to the page's native scroll rather than creating a separate scrollable container. That keeps the browser scrollbar, avoids scroll-jank on mobile, and means the address bar hides naturally on iOS.
Fixed cell dimensions (always 200×300px with an 8px gap) let react-virtualized calculate row and column positions with simple arithmetic, avoiding expensive DOM measurement. Window width isn't available at build time, so the component initialises with a single-column placeholder and sets real dimensions in a useEffect after mount.
The first two rows are above the fold on most screens, so next/image is told to load those eagerly with fetchpriority="high". Everything below row 2 is lazy-loaded as the user scrolls.
One wrinkle: the intro section above the grid can be collapsed or expanded, which shifts the grid's offset on the page. WindowScroller needs to know about this:
requestAnimationFrame(() => {
window.dispatchEvent(new Event("resize"));
});
A synthetic resize event prompts WindowScroller to recalculate its position. Not elegant, but it works.
Movie Detail Pages: Stripping Performances Before They Cross the Wire
Each film has its own pre-rendered page. generateStaticParams() iterates every movie at build time and Next.js generates a static HTML file for each — typically 1,500+ pages per build.
The app/movies/[id]/[slug]/page.tsx Server Component does the structurally stable work: resolves genres, people, and venues for the film; generates JSON-LD structured data (Movie, BreadcrumbList, ScreeningEvent) for search engine rich results. Then — critically — it strips performances from the movie prop before passing it to the client component:
const { performances: _performances, ...movieWithoutPerformances } = movie;
That means the pre-rendered HTML — and the inline JSON Next.js serialises into it for hydration — only contains movie metadata (title, poster, ratings, cast). The actual showtimes are fetched at runtime by the data context.
The app/movies/[id]/[slug]/page-content.tsx Client Component calls getDataWithPriority(movie.id) on mount, which fetches the chunk containing this film first before loading everything else in parallel. A startTransition defers the showings computation until after the hero section has rendered — so the poster, title, and ratings appear immediately, with showtimes filling in shortly after.
Where It Stands
With all of this in place, I ran Lighthouse against the site across cold and warm cache — averaged over three runs on desktop.
| Metric | Cold cache | Warm cache |
|---|---|---|
| Lighthouse score | 74/100 | 92/100 |
| First Contentful Paint | 459ms | 23ms |
| Largest Contentful Paint | 2.5s | 281ms |
| Speed Index | 2.5s | 42ms |
| Cumulative Layout Shift | 0.197 | 0.18 |
| Transfer size | 5.5 MB | 20 KB |
The warm cache numbers are the point of everything in this post — 308 of 336 network requests served from cache, 5.5 MB down to 20 KB (less than 1% of the data going across the wire), LCP dropping from 2.5s to 281ms (about 10% of the original time). That's what content-hashed files plus a year-long browser TTL actually buys you.
Cold cache is where there's still work to do. A 74/100 and a 2.5s LCP on first visit isn't bad, but it's not where I'd like it to be. The LCP is the main thing to improve — 2.5s is right at the edge of Google's "needs improvement" threshold, and it's what's dragging the cold cache score down. The CLS (0.197) is a known trade-off from the SSR grid handing off to the virtualised interactive grid, but given warm cache sits at 0.18 and still scores 92/100, it's clearly not the bottleneck.
Next post: Cleaning Cinema Titles Before You Can Even Search






Top comments (0)