DEV Community

Cover image for I made Claude, GPT and Gemini predict the entire 2026 World Cup. Here's the experiment design.
Willian Pinho
Willian Pinho

Posted on

I made Claude, GPT and Gemini predict the entire 2026 World Cup. Here's the experiment design.

The 2026 World Cup kicks off today: 48 teams, 104 matches, five weeks. I'm using it as a benchmark.

Three frontier models (Claude Opus 4.8, GPT-5.2 and Gemini 3.1 Pro) predicted every group match with scorelines and win/draw/loss probabilities, then a complete knockout bracket down to the champion and Golden Boot. Every prediction was locked before kickoff and committed to a public repo. As real results arrive, a live site scores them automatically.

Building it surfaced some genuinely weird model behavior. GPT-5.2, for instance, kept inventing an impossible football rule until the prompt explicitly forbade it. More on that below.

But the core of the project is the experiment design: the picks matter less than how the question was asked.

The confound nobody controls for

Ask an LLM with web access to predict a match and you have no idea what you measured. You can't tell whether it reasoned from internal knowledge, scraped a reliable source, or hallucinated a plausible-sounding stat. Two models citing different injury reports aren't comparable. A friend reviewing the project put it bluntly: without a standardized source, the models can just make that information up.

So each model runs under three conditions:

Arm Setup What it isolates
web Chat/CLI with live web access Model + free-form sourcing (uncontrolled)
baseline API, no tools, no extra context Pure parametric knowledge
enriched API, no tools, + identical data snapshot Reasoning over controlled inputs

The enriched snapshot is the same for all three models: the official FIFA ranking (April 2026 release, pulled from FIFA's own API) and World Football Elo Ratings for all 48 teams, versioned in the repo with sources and retrieval dates. No model gets an information advantage.

If enriched beats baseline, the value was in the data. If baseline holds its own, the knowledge was already in the weights. The web arm tells us whether free browsing helps or just adds noise.

Keeping the no-tools arms honest

"No tools" is an instruction until you verify it. The API arms run through a LiteLLM gateway with no tool definitions. Gemini's runs go through its CLI, which reports per-call tool stats, and the runner rejects any response where the tool-call counter isn't zero (same pattern for Claude's bracket runs, with all tools disallowed). It's verified per request, not assumed:

const { content, totalTokens, toolCalls } = await send(prompt);
if (opts.requireNoTools && (toolCalls ?? 0) > 0) {
  // Arm violation, not a model error: the no-tools condition must hold. Retry raw.
  if (attempt === MAX_RETRIES) {
    fail(
      `Group ${gf.group}: transport used ${toolCalls} tool call(s) — no-tools arm violated`,
    );
  }
  continue;
}
Enter fullscreen mode Exit fullscreen mode

Outputs are strict JSON, validated with Zod against the official fixture list — every response must contain exactly the six expected team pairs for its group:

const PredictionItem = z.object({
  teamA: z.string().min(1),
  teamB: z.string().min(1),
  scoreA: z.number().int().min(0).max(30),
  scoreB: z.number().int().min(0).max(30),
  probWinA: z.number().min(0).max(100).optional(),
  probDraw: z.number().min(0).max(100).optional(),
  probWinB: z.number().min(0).max(100).optional(),
});
Enter fullscreen mode Exit fullscreen mode

Invalid responses get the validation errors fed back, up to three attempts.

Now, the weird behavior I promised. GPT-5.2 consistently labeled knockout ties that were level after extra time as "decided in extra time", which is impossible under the rules, and validation feedback alone didn't fix it. It only stopped when the prompt spelled out: a level score after 120 minutes means penalties. Claude had its own quirk: the schema documented the winner field as "winner": "<teamA|teamB>", and it returned the literal string "teamA". Prompt precision beats prompt length.

What they predicted

The headline disagreement:

  • Claude picks Spain as champion in all three arms (the only model that's consistent with itself).
  • Gemini says Brazil (web and baseline) but switches to France when given the standardized data.
  • GPT-5.2 says Brazil on the web arm, France on both API arms.

All three picked Mbappé for the Golden Boot in 7 of the 9 brackets.

That inconsistency across arms is itself a result: the same model gives a different answer depending on what information it was handed. The tournament will tell us which configuration was actually calibrated.

Scoring

Group stage, per match: 5 points for the exact score, 3 for the correct result plus one exact side, 2 for the result only. On top of that, a multiclass Brier score over the win/draw/loss probabilities, which heavily penalizes overconfident wrong predictions. That matters more to me than raw hit rate.

Brackets are scored pool-style: 1/2/4/8/16 points per team correctly placed in the real round of 32, 16, quarters, semis and final, plus 32 for the champion. Max 192.

Honest limitations

  • The API arms receive the fixture list (the web arm had to recall the official draw), so they test judgement on outcomes, not memory of the schedule.
  • Group tiebreakers are simplified; fair-play points can't be reproduced from scorelines.
  • Player-level predictions (scorers, cards) are out of scope: adding them after kickoff would break the locked-before-the-tournament guarantee.
  • One tournament is one sample. This measures calibration on a single event, not "which model is smarter."

Stack, briefly

Next.js 16 (App Router) + Prisma 7 + SQLite, results sync from openfootball every 30 minutes, 56 unit tests on the pure scoring/standings/validation logic. The real bracket renders with official placeholder slots ("Group A runner-up", "Winner of match 73") and fills itself in as the tournament progresses.

Everything — prompts, raw model JSON, dataset, runner scripts, scoring code — is in the repo. If you think the methodology is flawed, the receipts are right there to prove it.

Live leaderboard: https://worldcup2026.willianpinho.com
Repo: https://github.com/willianpinho/worldcup-predictor-2026

I'll publish the group-stage verdict on 28 June and a full post-mortem after the final on 19 July. Place your (intellectual) bets now: does the model that knows the most football win, or the one that's best calibrated about what it doesn't know?


Independent project — not affiliated with Anthropic, Google or OpenAI. Educational experiment, not betting advice.

Top comments (0)