DEV Community

Furiosa Studio
Furiosa Studio

Posted on • Originally published at top-scorers.com

How we model the Golden Boot race across 30+ leagues

"Top scorer of a league" sounds like one line of SQL: SELECT player, COUNT(*) FROM goals GROUP BY player ORDER BY 2 DESC. Build a live product around that query and you find it's wrong in at least four ways. Here's what actually goes into a multi-league golden-boot table that stays correct.

1. The naive query, and why it lies

Three things break the one-liner immediately:

  • Mid-season transfers. A striker who scores 9 for one club and 6 for another is one player with 15, not two rows. Key your goals on (player_id, club_id) and he splits in two and drops off the chart.
  • Penalties and own goals. Some boards count penalties, some publish a separate non-penalty tally; own goals must never be credited to the scorer. "A goal" is not one thing.
  • What "the race" even means. Goals only? Goals + assists? Per-90? Each is a different leaderboard, and people want to switch between them.

2. The data model

Scope every event to a (league, season) and resolve players to one canonical entity across clubs:

players(id, canonical_name, ...)
appearances(player_id, club_id, league_season_id, minutes)
goal_events(id, player_id, league_season_id, minute, type)  -- open_play | penalty | own_goal
Enter fullscreen mode Exit fullscreen mode

Aggregate over player_id scoped to league_season_id (not club), so transfers within a league sum correctly, and the type column lets you include or exclude penalties on the fly.

The hard part isn't the schema — it's entity resolution. 30+ feeds spell names differently ("J. Bellingham", "Jude Bellingham", "Bellingham, J."), reuse IDs, and disagree on accents. Normalize the name, match on (normalized_name, club, birth_year), and keep a manual override table for the genuinely ambiguous cases. Reconciling messy third-party feeds is the unglamorous 80% of any sports-data product.

3. Goals vs expected goals (xG)

A raw count says what happened; xG says how sustainable it is. A striker on 12 goals from 6.0 xG is overperforming and likely to regress; one on 8 from 11.0 xG is unlucky and probably about to score. Store xG per shot, sum per player-season, and show the delta next to the raw count — it's the most useful column for anyone predicting the rest of the race.

4. Keeping it live

Goals arrive continuously and stats get retroactively corrected (a goal reassigned, a penalty reclassified days later). So ingestion is idempotent: every event has a stable source key, re-ingesting updates in place, and a nightly reconciliation pass catches corrections. Without idempotent upserts you double-count; without reconciliation you drift from the official record.

5. Serving it

A leaderboard is read-heavy and identical for everyone, so it's cached hard and exposed through a small public REST API (/players, /standings, compare endpoints). Caching is easy; invalidating the moment a goal lands without hammering the origin is the interesting part — key the cache on (league_season, stat, updated_at) and let new events bump the tag.

Takeaway

"Aggregate a leaderboard from several unreliable sources" is a pattern far beyond football — pricing, reviews, any vendor feed. The shape is always the same: resolve entities, define the metric precisely, make ingestion idempotent, cache the read path.

This is what powers top-scorers.com — live top scorers across 30+ leagues, goals vs xG, and a free public API. Built by Furiosa Studio.

Top comments (0)