Alistair

Posted on Mar 2

Three Things I Learned Using LLMs in a Data Pipeline

#devchallenge #geminireflections #gemini

Built with Google Gemini: Writing Challenge

This is a submission for the Built with Google Gemini: Writing Challenge

What I Built with Google Gemini

"Ghibliotheque Presents: My Neighbor Totoro + Intro"

That's a real cinema listing title, but it's not a title you can just search for. And as titles go, it's one of the more straightforward ones. Things get even messier when we get into cinema listing pages. I've seen venues that don't include a year, don't include the director, or give you little more than a title and a one-line description. If you're building an aggregator that needs to identify what's actually showing, you spend a lot of time staring at strings like this.

I've been building Clusterflick, a cinema aggregator for London that pulls listings from 250+ venues daily. I thought scraping would be the hard part. But figuring out what a listing actually is — which film, matched to which entry in The Movie DB — is where a lot of complexity lies. And it's where I've been using Gemini.

There's a whole layer of work involved in cleaning raw listing strings down to something searchable — that's worth a post of its own — but even with a clean title, the matching problem doesn't go away. Many venues don't include the necessary information to programatically search using The Movie DB API — just a title and maybe a vague description. Even when they do have more data, e.g. title plus year or even title plus director, it doesn't necessarily uniquely identify a film. And legitimate films with short or common names can be difficult to surface in TMDB search results at all.

I use Gemini to help at four stages in the identification pipeline:

Match against TMDB — given the cinema listing and a list of search results from TMDB, Gemini picks the best match. This handles the majority of cases.
- common/ask-llm-to-review-results.js
Direct identification — if TMDB search returns nothing useful, I ask Gemini if it recognises the film from the listing alone. Its training data often knows about films that don't surface well through search.
- common/ask-llm.js (The original use of Gemini in the project — everything else has grown from this first step)
Classify the listing — if we still can't identify a film, I ask Gemini what the listing actually is: a film, a short, a double bill, a quiz night, a live event, a comedy show. That classification feeds into filters on the website, and it determines what happens next in the pipeline — a listing classified as multiple films or shorts triggers its own follow-up steps.
- common/ask-llm-to-categorise.js
Extract multiple films or shorts — if a listing is identified as containing multiple films or shorts (a double bill, a shorts programme, a marathon), I ask Gemini to pull out the individual titles so each can be matched separately.
- scripts/transform/identify-multiple-movies.js
- scripts/transform/identify-shorts.js

Each stage only fires if the previous one didn't produce a result. That keeps costs down and means Gemini is only doing the hard work when simpler approaches have already failed.

The model I'm using is gemini-2.5-flash-lite. I'd been running on gemini-2.0-flash for a while and recently upgraded — one line change in the code, and I saw no noticeable difference in the identification and categorisation output from the previous run. Free performance improvement!

Demo

Clusterflick is live at clusterflick.com — 250+ venues and thousands of films across London, updated daily.

The pipeline code is open source (github.com/clusterflick/scripts), and runs across GitHub's cloud runners and a cluster of 6 Raspberry Pis in my living room — so if the judges are looking for a good home for that prize, I have a shelf ready! 🍿

The parsing layer discussed below is in llm-client.js.

What I Learned

Asking for a reason made the model more honest

When I first started asking Gemini to match listings to TMDB results, I was asking it to return a match and a confidence score (I use 0–9). It worked, but I was getting too many confident wrong answers — the model would pick something and report high confidence even when it was clearly a stretch.

The fix was adding a reason key to the expected JSON response. Forcing the model to articulate why it had chosen a match made it noticeably more cautious. It's like the difference between someone blurting out an answer and someone having to show their working. The false positives dropped.

{
  "reason": "Listing matched description of magical forest spirits and animation style",
  "confidence": 8,
  "match": {
    "id": 8392
  }
}

I now apply the same pattern wherever I need the model to make a judgement call. Structured output with a reason field is the single most effective prompt change I've made.

Using Gemini to improve my prompts

At some point I realised I was spending more time tweaking prompts than writing actual pipeline code. So I started asking Gemini to critique and rewrite them for me.

It sounds circular, but it works. The model is better than I am at structuring instructions for itself — clearer constraints, better edge case handling, more consistent output. Now when a prompt isn't giving me the results I want, my first step is to paste it into a fresh conversation and ask the model what's wrong with it and how it would rewrite it.

The results are often prompts I wouldn't have written myself. More explicit about edge cases. Better at specifying output format. And because the model wrote them, they tend to produce more predictable responses.

Defensive parsing is non-negotiable

Even with well-crafted prompts, LLM output in production will occasionally be malformed. I found this out when the model truncated a film overview mid-sentence and left a trailing backslash — one bad character broke JSON.parse and failed the entire job.

The longer the pipeline ran, the more edge cases surfaced. The model occasionally hallucinates fields that aren't in the schema (backdrop_path appearing uninvited was a fun one). It sometimes leaves unescaped quotes inside string values. Markdown code fences show up often enough that stripping them became standard. Each of these is now a line in the sanitisation layer:

const result = await chatSession.sendMessage(prompt);
const response = result.response.text();

// Unwrap the string if it's been wrapped in a markdown block
const jsonString = response.replace("\`\`\`json", "").replace("\`\`\`", "");

const correctedJsonString = jsonString
  // Apply corrections for malformed escape characters (perhaps due to truncation)
  .replace(/\\(?!["\\/bfnrtu]|u[0-9a-fA-F]{4})/g, "")
  // Apply corrections for hallucinated invalid additions
  .replace(/"backdrop_path": "[^,]+,\n/i, "")
  // Fix unescaped quotes within the "reason" field value
  .replace(
    /"reason"\s*:\s*"(.*)"\s*([,}])/s,
    (_match, reasonContent, terminator) => {
      const fixed = reasonContent.replace(/(?<!\\)"/g, '\\"');
      return `"reason":"${fixed}"${terminator}`;
    },
  );

try {
  return JSON.parse(correctedJsonString);
} catch (e) {
  console.log("Error parsing LLM answer");
  console.log("--- Original response: -----------------------");
  console.log(response);
  console.log("--- Corrected response: ----------------------");
  console.log(correctedJsonString);
  throw e;
}

Every line in there is a real production issue. Treat LLM responses as untrusted input, sanitise before you parse, and log both the original and corrected response when things go wrong — you'll want that context when debugging.

Google Gemini Feedback

Flash-lite has been reliable and cheap, which matters when you're running a pipeline daily across hundreds of venues and thousands of films. Cost has stayed predictable as the number of venues has grown, which is exactly what I needed.

One deliberate choice worth mentioning: I run with temperature: 0. This is a data pipeline, not a creative writing tool — I want output as close to deterministic and consistent as possible.

const generationConfig = {
  temperature: 0,
  topP: 0.95,
  topK: 40,
  maxOutputTokens: 8192,
};

The upgrade from 2.0 to 2.5 was painless — one line change, no prompt tuning needed. To confirm nothing had shifted, I ran the pipeline twice with each model version and compared the transformed output. No noticeable differences for any venues. That kind of stability is worth a lot in production.

The main frustration I haven't fully solved is flip-flopping. The pipeline runs daily, and occasionally a listing that was confidently matched to film X on one run comes back as film Y the next. The confidence is right on the edge either way — only one can be right, or both can be wrong — and temperature: 0 helps but doesn't eliminate it. I'd love better signalling when the model is genuinely on the fence, rather than having to infer uncertainty from a confidence score that turns out not to be reliable enough to always act on.

Top comments (1)

Benjamin Nguyen • Mar 2

Interesting!