Getting the Data Model Right: Movie -> Showings -> Performances

#webdev #json #javascript #architecture

When I started building cinema aggregation tooling — pulling listings from multiple independent cinemas — the first real decision was the data model. I've fought bad schemas before. So I sat with this one for a while before writing any code.

The hierarchy I landed on is Movie → Showings → Performances, and while it might sound over-engineered at first glance, every layer earns its place.

Why not just Movie → Performances?

My first schema was essentially flat. A movie had a title, some overview metadata (directors, actors, duration), and an array of performances — times you could go and see it. Simple enough, and it worked fine when I was dealing with a single cinema's listings.

But a cinema doesn't just show a film. It shows variants of a screening. Take Hackney Picturehouse's 40th anniversary run of Labyrinth. They didn't just list it once with a bunch of times — they had regular showings, a "Kids' Club" baby-friendly screening, and a "Relaxed Screening" for folks needing additional support, including neurodivergent audiences and those living with dementia. These aren't just different times — they're fundamentally different experiences, each with their own listing page, their own description, and their own set of performance slots.

That middle layer — the Showing — captures this. A Showing represents one cinema's particular presentation of a movie. It carries the variant-specific context: the URL for that listing, any notes about what makes it different, and its own array of performances underneath. Hackney Picturehouse's Labyrinth becomes three Showings, each with their own performances — rather than one flat list of times where you have to squint at freetext notes to figure out which screening is which.

The original schema

The first version of my transform schema — the contract that each cinema's scraper had to produce — looked roughly like this: a flat array of objects, each with a title, a url, an overview block of metadata, and an array of performances. Each performance had a time, optional screen, freetext notes, and a bookingUrl.

It got the job done for a single venue. But it was doing too much in too few layers. The "notes" field on each performance was carrying all the variant information as unstructured text. Categories lived in the overview, but there was no way to distinguish between a film, a live comedy night, and a quiz. Duration was required, which made sense when we were only generating calendar events, but caused problems when the data was missing. And there was no hook for enriching the data with external sources.

What changed

The evolved schema introduces several things the original couldn't support cleanly.

A showingId gives each showing a stable identity. This matters when you're deduplicating across sources or tracking what's changed between scrapes.

A category enum (movie, tv, quiz, comedy, music, talk, workshop, shorts, event) acknowledges that modern independent cinemas are not just cinemas. They host all kinds of events, and your data model needs to represent that without shoehorning everything into a film-shaped hole. It also set the scene for going beyond cinemas to any venue that screens films and might have other interesting events.

Structured accessibility data at the performance level replaces freetext notes for things like audio description, baby-friendly screenings, hard-of-hearing support, relaxed sessions, and subtitles. This is crucial — accessibility isn't a property of the movie, or even the showing. It's a property of that specific screening at that specific time. A Tuesday afternoon showing might be relaxed; the Saturday evening one isn't.

A status object on each performance captures things like whether it's sold out. Again, this is inherently performance-level data.

External enrichment fields — themoviedb and themoviedbs (plural) — provide the hook for hydrating listings with data from TMDB. The singular version covers standard films; the plural handles double bills or curated screening programmes where a single showing maps to multiple movies.

And several small refinements: duration is no longer required (because a quiz night doesn't have a runtime), year was added to the overview, classification replaced the awkwardly-named age-restriction, and additionalProperties: false was added throughout the schema to keep the data tight when validating.

Where it gets interesting: combining venues

The transform schema represents what comes out of a single venue's scrape. Each cinema produces its own array of showings. But the aggregation site needs to combine these into a unified view: one movie, with showings from multiple cinemas, each with their own performances.

This is where the hierarchy really pays off. The Movie → Showings → Performances structure scales naturally from single-venue to multi-venue. You don't need to restructure anything — you just group showings under a shared movie identity.

But combining also means deduplicating, and that's where things get nuanced. When the same movie appears at three different cinemas, you'll have overlapping metadata at different levels:

Director and cast info might exist in the showing-level overview (scraped from the cinema's own listing) and at the movie level (from TMDB). Which do you trust? Usually the external source is more reliable and complete, but not always — a cinema might list a special guest or a different cut.
Accessibility information is firmly performance-level. No deduplication needed — it's inherently specific to that time slot at that venue.
Categories and genres can drift between sources. One cinema might tag something as "Drama", another as "Drama / Thriller", and TMDB might call it "Drama, Crime". You need a strategy for reconciling these.

Deduplication isn't a single operation — it's a per-field decision about which source of truth wins at which level of the hierarchy. Having clean separation between movies, showings, and performances makes those decisions much more tractable than they'd be in a flat structure.

The payoff

Spending time upfront on the data model meant that when complexity arrived — new venue types, accessibility requirements, external data enrichment, multi-venue aggregation — the schema absorbed it instead of fighting it. The hierarchy isn't clever for its own sake; it maps onto how cinemas actually programme their events, and that's what makes it hold up.

Next post: ~~Site Performance: Loading 30,000+ Showings in a Browser~~
Change in the schedule: A Brief Detour: Two Writing Challenges and What Came Out of Them