"Top scorer of a league" sounds like one line of SQL: SELECT player, COUNT(*) FROM goals GROUP BY player ORDER BY 2 DESC. Build a live product around that query and you find it's wrong in at least four ways. Here's what actually goes into a multi-league golden-boot table that stays correct.
1. The naive query, and why it lies
Three things break the one-liner immediately:
-
Mid-season transfers. A striker who scores 9 for one club and 6 for another is one player with 15, not two rows. Key your goals on
(player_id, club_id)and he splits in two and drops off the chart. - Penalties and own goals. Some boards count penalties, some publish a separate non-penalty tally; own goals must never be credited to the scorer. "A goal" is not one thing.
- What "the race" even means. Goals only? Goals + assists? Per-90? Each is a different leaderboard, and people want to switch between them.
2. The data model
Scope every event to a (league, season) and resolve players to one canonical entity across clubs:
players(id, canonical_name, ...)
appearances(player_id, club_id, league_season_id, minutes)
goal_events(id, player_id, league_season_id, minute, type) -- open_play | penalty | own_goal
Aggregate over player_id scoped to league_season_id (not club), so transfers within a league sum correctly, and the type column lets you include or exclude penalties on the fly.
The hard part isn't the schema — it's entity resolution. 30+ feeds spell names differently ("J. Bellingham", "Jude Bellingham", "Bellingham, J."), reuse IDs, and disagree on accents. Normalize the name, match on (normalized_name, club, birth_year), and keep a manual override table for the genuinely ambiguous cases. Reconciling messy third-party feeds is the unglamorous 80% of any sports-data product.
3. Goals vs expected goals (xG)
A raw count says what happened; xG says how sustainable it is. A striker on 12 goals from 6.0 xG is overperforming and likely to regress; one on 8 from 11.0 xG is unlucky and probably about to score. Store xG per shot, sum per player-season, and show the delta next to the raw count — it's the most useful column for anyone predicting the rest of the race.
4. Keeping it live
Goals arrive continuously and stats get retroactively corrected (a goal reassigned, a penalty reclassified days later). So ingestion is idempotent: every event has a stable source key, re-ingesting updates in place, and a nightly reconciliation pass catches corrections. Without idempotent upserts you double-count; without reconciliation you drift from the official record.
5. Serving it
A leaderboard is read-heavy and identical for everyone, so it's cached hard and exposed through a small public REST API (/players, /standings, compare endpoints). Caching is easy; invalidating the moment a goal lands without hammering the origin is the interesting part — key the cache on (league_season, stat, updated_at) and let new events bump the tag.
Takeaway
"Aggregate a leaderboard from several unreliable sources" is a pattern far beyond football — pricing, reviews, any vendor feed. The shape is always the same: resolve entities, define the metric precisely, make ingestion idempotent, cache the read path.
This is what powers top-scorers.com — live top scorers across 30+ leagues, goals vs xG, and a free public API. Built by Furiosa Studio.
Top comments (0)