De' Clerke

Posted on Jun 13

I Built a World Cup Prediction Model That Retrains Itself Daily and Can't Cheat Its Own Results ⚽

#machinelearning #dataengineering #python #sportsdata

Most sports prediction models have perfect hindsight. Mine is committed to git before kickoff.

That's the constraint I kept coming back to when building CupCast 2026, a machine learning system that forecasts every remaining World Cup fixture, refits on fresh data every morning, and records its predictions in an append-only log that cannot be edited after the match starts. When the result comes in, the model grades its frozen prediction. No retroactive edits. No cherry-picked accuracy claims.

This article is about two engineering decisions that make it actually honest: daily automated retraining and prediction freezing.

The Pipeline in Plain English

Here's what runs at 09:00 UTC every day:

GitHub Actions spins up a fresh ubuntu-latest runner
Fetches the latest fixtures and results from football-data.org
Recomputes World Football Elo over 49,410 historical matches (every international match since 1872)
Refits XGBoost using hyperparameters committed in best_params.json
Runs 10,000 Monte Carlo simulations of the remaining bracket
Validates output against 7 JSON schemas — if anything's malformed, the build fails and the previous data stays live
Appends frozen predictions for any fixture kicking off within 72 hours
Commits everything and pushes — Vercel picks it up and auto-deploys the frontend

The whole run takes under 2 minutes on a free GitHub Actions runner. Zero infrastructure cost.

on:
  schedule:
    - cron: "0 9 * * *"   # 09:00 UTC, after North-American overnight kickoffs settle

jobs:
  forecast:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run pipeline (refit on fresh data + 10k simulations)
        env:
          FOOTBALL_DATA_TOKEN: ${{ secrets.FOOTBALL_DATA_TOKEN }}
        run: uv run python run_pipeline.py --refit --sims 10000

      - name: Commit updated forecast
        run: |
          git add web/public/data pipeline/data/frozen pipeline/best_params.json
          if git diff --staged --quiet; then
            echo "No changes to commit."
          else
            git commit -m "Daily forecast update $(date -u +%Y-%m-%d)"
            git push
          fi

tune-once, refit-daily: Why These Are Different Things

My first instinct was to run Optuna on every CI push. Bad idea.

Optuna with 75 trials across 3 CV folds takes 5-10 minutes. It burns free-tier minutes fast. Worse, running it daily introduces search noise: you're not finding better parameters, you're finding parameters that overfit to the most recent week of results.

The right separation:

Tuning (Optuna): run on-demand when there's a structural reason to, such as new features, significant data drift, or an algorithm change. Output is best_params.json, committed to the repo.
Refitting (daily CI): load those committed params, fit on all historical data up to today. Takes seconds. Knowledge updates; architecture stays locked.

Here's how the code makes that explicit:

# train.py

def train_all(art: dict, n_trials: int = 75) -> dict:
    """Full tune: run Optuna, find best params, commit them, fit production model."""
    study = optuna.create_study(direction="minimize",
                                sampler=optuna.samplers.TPESampler(seed=C.SEED))
    study.optimize(lambda t: _objective(t, dev), n_trials=n_trials)
    best = study.best_params
    BEST_PARAMS_PATH.write_text(json.dumps(best, indent=2))  # committed for daily refit
    return _fit_production(art["train"], best)


def refit(art: dict) -> dict:
    """Daily CI path: load committed params, refit on fresh data. No Optuna."""
    best = json.loads(BEST_PARAMS_PATH.read_text())
    return _fit_production(art["train"], best)

best_params.json is versioned in git. When I add a feature I re-run train_all locally (Optuna included) and commit the updated params. The daily CI only ever calls refit. Clean separation between architecture decisions and knowledge updates.

32 Features, Zero Leakage

The classifier is XGBoost with objective="multi:softprob" for W/D/L as three classes. For scorelines, I pair it with two XGBoost count:poisson regressors (home goals and away goals separately), then combine the predicted goal rates into a scoreline probability matrix via the Poisson distribution.

32 features:

FEATURE_COLUMNS = [
    "elo_home", "elo_away", "elo_diff",
    "neutral", "home_is_host",
    "form5_win_h", "form5_draw_h", "form5_gf_h", "form5_ga_h",
    "form5_win_a", "form5_draw_a", "form5_gf_a", "form5_ga_a",
    "form10_win_h", "form10_draw_h", "form10_gf_h", "form10_ga_h",
    "form10_win_a", "form10_draw_a", "form10_gf_a", "form10_ga_a",
    "form10_oppelo_h", "form10_oppelo_a",
    "rest_h", "rest_a",
    "h2h_home_winrate", "h2h_mean_gd",
    "importance",
    "elo_trend_h", "elo_trend_a",
    "alt_gap_home", "alt_gap_away",
]

The last two (alt_gap_home and alt_gap_away) are the altitude feature. Each team has a baseline elevation derived from the stadiums where they typically play. These features capture how much each team ascends relative to that baseline to reach the match venue. Of WC 2026's 16 host cities, only Mexico City (2,240m) and Guadalajara (1,566m) are materially elevated. For those five fixtures the delta is real signal. For the other 63 matches it's effectively zero.

The leakage problem in sports ML is subtle. A random train/test split means your training set will contain matches that happened after some test matches. The model's form features will implicitly encode future trajectory. This is fine in most ML tasks where samples are i.i.d. In temporal sports data it makes your backtests look better than they are.

The fix is walk-forward expanding-window CV:

CV_FOLDS = [  # (train_end, val_start, val_end)
    ("2017-12-31", "2018-01-01", "2019-12-31"),
    ("2019-12-31", "2020-01-01", "2021-12-31"),
    ("2021-12-31", "2022-01-01", "2023-12-31"),
]

def _objective(trial, train: pd.DataFrame) -> float:
    losses = []
    for tr_end, va_start, va_end in CV_FOLDS:
        tr = train[train["date"] <= tr_end]
        va = train[(train["date"] >= va_start) & (train["date"] <= va_end)]
        model = XGBClassifier(objective="multi:softprob", num_class=3, **params)
        model.fit(Xtr, ytr, sample_weight=w)
        losses.append(log_loss(yva, model.predict_proba(Xva)))
    return float(np.mean(losses))

Each fold trains on everything before tr_end and validates on the following two years. Validation always starts after training ends. I also hold out 2024-01-01 through 2026-06-10 as an untouched test set; the tuner never sees it.

Test-set results:

Log loss: 0.8583
Favourite accuracy: 60.0%
Brier score: 0.5043

Baselines on the same holdout: Elo-only logistic regression logs 0.9201, historical base rates log 0.9864. Both beaten.

Prediction Freezing: The Part Most Sports Models Skip

Here's the problem with sports model accuracy claims: they're almost always backtested. The model already saw the outcome distribution when you built it. A "62% accuracy" headline is meaningless unless you can show it was generated before kickoff against predictions the model couldn't retroactively update.

My approach: write predictions to an append-only CSV before the match starts, then score them as results arrive. This is the same principle as event sourcing or an audit log: the log is immutable, and the current state (accuracy metrics) is derived from it.

The freezing function runs as part of every daily pipeline. It checks every upcoming fixture, sees if a prediction has already been logged for that match, and if not, appends one:

def freeze_due(fixtures, predictions, model_version, now=None):
    now = now or datetime.now(timezone.utc)
    horizon = now + timedelta(hours=72)
    existing = {int(r["match_id"]) for r in _read_csv(LOG_PATH)}

    new_rows = []
    for _, m in fixtures.iterrows():
        mid = int(m["match_id"])
        if mid in existing or mid not in predictions:
            continue   # already frozen, don't overwrite
        kickoff = datetime.fromisoformat(m["utc_date"].replace("Z", "+00:00"))
        if not (now <= kickoff <= horizon):
            continue
        p = predictions[mid]
        new_rows.append({
            "frozen_at_utc": now.isoformat(),
            "match_id": mid,
            "p_home": round(p["p_home"], 4),
            "p_draw": round(p["p_draw"], 4),
            "p_away": round(p["p_away"], 4),
            "model_version": model_version,
        })

    # append-only: open in 'a' mode, never truncate existing rows
    if new_rows:
        with open(LOG_PATH, "a", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=LOG_FIELDS)
            writer.writerows(new_rows)

Once a row exists for a match_id, the existing check on the next daily run skips it. The prediction is locked. When a result comes in, a separate score_resolved function reads the frozen log, computes Brier score and log loss for that row, and appends to scores.csv. Those scores accumulate live on the Model page.

7 JSON Contracts (Making It a Product, Not a Script)

A model that only lives in Python is a script. A product needs a data contract.

I publish 7 validated JSON files on every daily run:

File	Contents
`meta.json`	Run timestamp, model version, simulation count
`matches.json`	All 104 fixtures with W/D/L probabilities
`champion_odds.json`	Per-team tournament win probability + daily delta
`groups.json`	Group standings with qualification probabilities
`bracket.json`	Full knockout bracket with per-match probabilities
`accuracy.json`	Frozen prediction scores as they accumulate
`match_detail/{id}.json`	SHAP values, Elo trend, top scorelines, form, H2H

Each is validated against a JSON schema before the commit step. If the check fails (a required field is null, probabilities don't sum correctly, an ID is missing), the pipeline errors out and the previous good data stays live. The frontend never receives a partial update.

The React frontend (Vite + Tailwind v4 + Recharts + Framer Motion) is a pure read-only consumer of these contracts. No backend, no runtime server. The data contract is the interface, and the interface is versioned in git.

One Vercel SPA Gotcha Worth Noting

For direct URL access to match pages (/match/537330), I had this in vercel.json:

{
  "cleanUrls": true,
  "rewrites": [{ "source": "/:path*", "destination": "/index.html" }]
}

Direct navigation returned 404. The issue: cleanUrls: true interferes with how Vercel resolves the catch-all rewrite; it strips extensions first, then can't match a file. Fix: remove cleanUrls entirely. The rewrite handles all routing. Small thing, 20 minutes of my life I won't get back.

What I'd Do Differently

Add betting market odds as a feature. Markets are the single most efficient signal in football: they aggregate injury news, team selection, and weather that the model doesn't have. I avoided them to keep the pipeline free to run, but for a production system they'd be the first addition.

Red card and injury data. The model has no idea a team's starting goalkeeper is injured. Elo absorbs this over time, but pre-match it's a real blind spot in individual fixture predictions.

Form-weighted Elo. Self-computed Elo is reliable for ranking relative team strength across years. It's less reliable at capturing rapid momentum shifts. A team on an 8-game winning streak and a team grinding out draws can have identical Elo trajectories. A recency-weighted variant would be worth testing.

The Result

CupCast 2026 is live at world-cup-2026-forecast.vercel.app. The model updates daily as the tournament progresses. Current champion odds: Spain 26.3%, Argentina 18.9%, France 10.4%.

The code is open: github.com/declerke/World-Cup-2026-Forecast.

If you've built a similar system, or have opinions on the altitude feature or on why betting markets are hard to beat, drop a comment below. Follow me on dev.to for more data engineering from production.

Follow me on dev.to for more data engineering content, or check out the full code at github.com/declerke.

DEV Community