DEV Community

Samir Yuja
Samir Yuja

Posted on • Originally published at samiryuja.dev

Futbol Report — building a multi-model LLM comparison on AWS Lambda

A few months ago I set up a soccer-digest bot that sends me a Telegram message every few days with fixtures, results, transfer news, and manager changes. It started as a tmux session running Claude Code on a small Linux server, fired by a cron job. It worked. It also went down occasionally, and I had no good way to inspect what it was producing.

I wanted to do two things at once: make it more reliable, and turn it into something more interesting than "one bot sending one report." The result is Futbol Report — a scheduled job on AWS that sends the same prompt and search context to four different language models (Claude, Kimi, Qwen, Gemma) every three days, stores all four reports in Redis, and renders them side-by-side on this site with live voting.

This post is about how it works, what I learned from running it, and the deployment war stories — the bits that are usually edited out of "here's my side project" writeups.

What the comparison shows

Same input, four models, no editing. The page has a dropdown to switch between past runs and a vote button under each report. Vote tallies persist.

The point isn't to crown a "best" model. It's to make differences visible on a real, repeated task. Placed next to each other, the models diverge in ways that are easy to see — how faithfully they follow the requested format, what they choose to include or filter out, and how long their reports run.

How the pipeline works

EventBridge Scheduler  (every 3 days)
        ↓
   AWS Lambda  (Python 3.12)
        ├── Brave Search    (~13 queries: fixtures, results, transfers, manager changes)
        └── OpenRouter      (same context → 4 models)
        ↓
   Redis  (Vercel)
        ↓
   Next.js page  (server-rendered comparison + voting)
Enter fullscreen mode Exit fullscreen mode

Every three days EventBridge fires a Lambda function. The Lambda calls Brave Search with about a dozen queries, then sends the same compiled context to four models through OpenRouter. Each model's report goes into Redis under a timestamped key. The Next.js site, deployed on Vercel, reads from the same Redis and renders the comparison page.

A few decisions worth surfacing:

OpenRouter as the inference layer. One API instead of four, and adding or swapping a model is a one-line change.

Server-rendered comparison page. The data only changes every few days, so there's no point fetching it from the browser. The server reads Redis and sends back the finished page. Only the vote button runs in the browser.

Redis with a 90-day TTL on report keys. Redis fit the access pattern — small payloads (a few KB per report) and pure key lookups by timestamp, no queries. The TTL means old reports expire automatically; votes and the run index have no TTL, so voting history is never evicted even if memory fills.

What I learned from running it

1. Search results aren't deterministic. Running the same query thirty minutes apart returns different result sets — that's just how live ranking works. So context has to be held fixed within a run for the comparison to be fair (one Brave call feeds all four models).

2. A simple anti-hallucination clause worked across all four models. After the first run hallucinated fixtures, I added "use ONLY facts present in the provided search results" to the prompt. None of the four models invented data after that — same clause, same effect, across four different labs.

3. Models filter context differently. One run, Brave's results included an out-of-scope Indian Super League match. Three models filtered it out; one led its report with it. Same prompt, same data, different prioritization.

4. Output length varies wildly with model size. Claude and Kimi used most of the available context. Gemma — by far the cheapest model — collapsed the same input into one-line summaries. Cost and level of detail are correlated.

5. Format adherence varies too. Claude followed the prompt's structure most faithfully. Gemma dropped most of it. Qwen and Kimi were in between.

6. The pipeline survived the off-season without code changes. When Serie A ended, fixture queries returned nothing useful. I added two new search categories (transfers, manager changes) and one line to the prompt — "if fixtures are sparse, lead with transfer news." Reports stayed substantive: Allegri sacking at Milan, Guardiola leaving City, World Cup buildup.

The deployment war stories

The interesting part of moving the generator to Lambda was the several hours of debugging in the middle.

Python runtime mismatch. AWS defaulted the new Lambda to Python 3.14. My deployment package was built for 3.12. The error didn't say "wrong Python version" — it said No module named 'pydantic_core._pydantic_core', because the compiled C extension is a cpython-312 .so that won't load under 3.14. Fix: match the runtime to the build target.

Mac vs Linux binaries. Even after pinning the Python version, pydantic-core kept loading the macOS binary into my zip. I was using uv for packaging — uv pip install --only-binary is supposed to fetch a Linux wheel but reliably didn't here. Switching to vanilla python3 -m pip install --platform manylinux2014_x86_64 --only-binary=:all: finally pulled the right artifact. The newer tool I trusted was the problem; the older boring tool worked.

Optional imports for dual environments. python-dotenv is great locally — reads .env, populates os.environ. On Lambda, environment variables come from AWS directly, and python-dotenv is just dead weight that doesn't ship in the runtime. Wrap the import:

try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    pass  # dotenv not needed on Lambda — env vars are set directly
Enter fullscreen mode Exit fullscreen mode

Same code works in both environments.

Default timeout. Lambda's default timeout is 3 seconds. My pipeline needs about 4 minutes — 13 sequential Brave calls plus four sequential model generations. Bump it to 900 seconds (Lambda's max).

Reaching for the CLI over the browser. I couldn't get the AWS console's "Upload .zip file" button to reliably refresh the deployed code — the SHA-256 hash kept matching the previous upload even after I selected a new file. It probably works fine most of the time; in my case aws lambda update-function-code from the CLI was faster and easier, and it's the path I use now.

Secrets are easy to leak when copy-pasting console output. I rotated three API keys during this project — twice because I pasted command output that included environment variables. Painful in proportion to how preventable it is. Wiring gitleaks into a GitHub Actions workflow is next on the list.

What's next

In priority order:

  • Telegram delivery from the Lambda — restore the original use case so the digest pings me when a new run finishes.
  • CI security scanninggitleaks for secrets and osv-scanner for dependency CVEs, both wired into a GitHub Actions workflow on each repo.
  • Parallelize the model calls — sequential right now (~4 minutes); concurrent would cut runtime by ~75%.
  • Full page content for grounding — Brave returns 1-2 sentence snippets, so thin context yields thin reports. Firecrawl or a readability extractor would fix this.
  • Server-side vote dedup — votes are deduped per-browser via localStorage; clearing storage or going incognito gets around it.

Links

Thanks

To Ryan — for letting me run the original bot on his machine over Tailscale, and for the steady stream of articles and ideas that shaped a lot of the thinking behind this project.

Top comments (0)