Building a Cricket Trivia Game Was Easy. Normalising 7,000+ Players Was Hard.

#datascience #webdev #javascript #nextjs

When I started building Stumped!, a cricketer guessing game, I thought the hard part would be coming up with clever clues.

I was wrong.

The real hard part was turning thousands of raw, ball-by-ball cricket scorecards into clean, human-readable player profiles.

Here is how a simple trivia game helped me learn more about normalising data.

The Dream vs. The Reality

I wanted to generate rich, dynamic clues for players, like:

"This batter scored 573 runs in the death overs at a strike rate of 135.8."

To do that, I turned to the amazing open datasets at Cricsheet. They provide incredible ball-by-ball archives. But there's a catch: raw match data and game-ready player profiles speak entirely different languages.

Cricsheet tells you what happened on delivery 4.2 of Match X. It does not tell you a player's career stats. Everything had to be derived from scratch.

1. Turning Matches into Careers

Step one was flipping the data architecture from match-centric to player-centric. I built a pipeline that ingests every single delivery and progressively updates a player’s lifetime accumulator object:

players[name] = {
    "matches": 0,
    "bat_runs": 0,
    "bat_balls": 0,
    "bowl_runs": 0,
    "bowl_wickets": 0,
    // ...you get the idea
}

Instead of querying a massive database of matches every time a user plays, we build the careers once beforehand.

2. The "Who TF is V Kohli?" Problem

Then came the initials. Cricket scorecards love abbreviations.

EJG Morgan

RG Sharma

V Kohli

Humans see "V Kohli" and know it's Virat. Computers see it and shrug. Worse, multiple players share the same initials.

I tried scraping external sports sites to map these messy strings to unique humans, but between rate limits, anti-bot shields, and wildly inconsistent formatting, the scrapers failed hard. Right now, I am back to the drawing board, trying to figure out how to reliably enrich these player profiles without losing my sanity.

(Pro tip: This is exactly why Stumped asks users to guess surnames. Surnames are way more reliable than ambiguous initials.)

3. Extracting the Spicy Stats

Basic career averages are boring trivia. To make the game fun, I needed to slice and dice the data into highly specific cricket archetypes:

Batting Phases: Grouping deliveries into Powerplays (overs 0–5), Middle (6–15), and Death (16–20) to find the clutch finishers.
The Psychology of the Chase: Tracking performance when setting a target vs. chasing one.
Nemesis vs Favourite Bowler: Who dismisses this batter the most? Who do they absolutely smash for fun? (Rule #1 of the pipeline: Your nemesis cannot also be your favourite victim. The logic got messy here, but it made the clues feel remarkably human.)
The Weird Stuff: Tracking golden ducks, diamond ducks, maiden overs, and dot-ball percentages.

4. Flattening the Monster

Deeply nested JSON objects are a pain to consume on the frontend. The final step of the pipeline takes all those complex, deep career structures and flattens them into a clean, single-level profile:

{
  "bat_runs": 2443,
  "bat_average": 29.08,
  "bat_strike_rate": 136.7,
  "fielding_catches": 47,
  "nemesis_bowler_name": "Hardik Pandya",
  "favorite_bowler_name": "Umar Gul"
}

Now, generating a clue is as simple as reading a single key-value pair.

Lesson Learned

Building the game took a fraction of the time it took to clean records and handle weird edge cases. Under the hood, a single clue represents thousands of rows of raw match data processed into something actually readable.
If I started this project again, I'd invest in the name normalisation layer first, before writing/generating a single line of stat-aggregation code.

We just launched the game! If you want to check out this and other fun side projects we’ve been hacking on, take a look at The Almanac Project or test your cricket trivia knowledge directly at Stumped!.

And as always Happy Coding!