This is Part 1 of a 5-part series on building a complete ML system. Part 2 → | Part 3 → | Part 4 → | Part 5 →
The Cricket App Nobody Asked For
Last year, I was watching IPL matches and thinking about the perfect fantasy cricket assistant.
Not just another prediction model.
Not just a stats dashboard.
But something that could:
- Answer any factual question instantly ("Who was player of the match on March 27, 2024?")
- Predict match winners with real confidence scores
- Route questions intelligently (knowing when to use each engine)
I'd see people on Twitter asking:
- "Has CSK beaten MI more times than the other way around?"
- "Does winning the toss actually help?"
- "Who will win if MI bat against CSK tomorrow?"
The problem: The first two need retrieval. The third needs ML. Nobody was solving both in one system.
At least, not in a simple, production-ready way.
So I built it.
What This System Actually Does
Example 1: Match Prediction
User: "Who will win if Mumbai Indians bat against Chennai Super Kings?"
System: Runs Gradient Boosting model with pre-match features
- H2H record: 54%
- Overall win rate: 55%
- Venue history: 60%
- Recent form: 52%
Response: 🏆 Chennai Super Kings (64% confidence)
Example 2: Factual Lookup
User: "Who was player of the match in match 335982?"
System: TF-IDF cosine similarity against 42,000 Q&A pairs
Response: 🌟 Suryakumar Yadav
Example 3: Historical Analysis
User: "Head to head KKR vs RCB?"
System: Retrieves team record from Q&A index
Response:
- KKR: 9 wins
- RCB: 8 wins
- (Total: 17 innings)
The magic? It automatically decides which engine to use based on the question.
🎯 How It Looks in Action
Here's what a real conversation looks like:
User: "Who will win if MI bat against CSK?"
System: Runs the ML model → 🏆 CSK (64% confidence)
User: "How many runs did MI score in match 335982?"
System: Queries Q&A index → 🏏 Mumbai Indians scored 222 runs
User: "Does toss matter?"
System: Retrieves aggregate stats → Only 51.3% of toss winners win the match
Two different engines, one seamless experience.
🎨 Quick Visual Overview
User Input (Natural Language)
↓
Intent Detection
├─ Prediction keywords? → ML Engine
├─ Match ID? → Q&A Lookup
└─ Otherwise → Q&A Retrieval
↓
Intelligent Response
└─ Routed through FastAPI backend
The Architecture (Simple Version)
CSV Data (2,217 rows)
↓
Normalization (fix team names, dates)
↓
Feature Engineering (pre-match only)
├─→ ML Model (Gradient Boosting, 61.8%)
└─→ Q&A Index (42,000 pairs)
↓
FastAPI Backend (/chat, /predict, /health)
↓
Streamlit Frontend (3 tabs: Chat, Predict, Metrics)
Critical design: Only .joblib files are used at runtime. No CSV dependency. Makes it:
✅ Fast (no CSV parsing)
✅ Portable (single deploy)
✅ Docker-friendly
The Dataset Problem Nobody Talks About
The IPL dataset spans 2008–2024 (2,217 rows). Looks clean, right?
It's not.
Problem 1: Team Name Chaos
Delhi Daredevils became Delhi Capitals in 2019. Kings XI Punjab became Punjab Kings in 2021.
If you don't normalize:
# ❌ WRONG
Delhi Daredevils: 5 wins (2015-2018)
Delhi Capitals: 2 wins (2020-2024)
Total: 7 wins (WRONG! It's one team)
The fix:
TEAM_NAME_MAP = {
"Royal Challengers Bengaluru": "Royal Challengers Bangalore",
"Delhi Daredevils": "Delhi Capitals",
"Kings XI Punjab": "Punjab Kings",
}
df[team_cols] = df[team_cols].replace(TEAM_NAME_MAP)
One function. Called once. All team names normalized.
Problem 2: Date Type Matters
# ❌ WRONG (string comparison)
df[df["date"] < "2023-01-01"]
# "2024" < "2023" alphabetically → TRUE (backwards!)
# ✅ RIGHT (datetime comparison)
df["date"] = pd.to_datetime(df["date"])
df[df["date"] < pd.Timestamp("2023-01-01")]
The Real Problem: Data Leakage
Here's where most sports prediction projects fail silently.
Imagine building a predictor with:
features = ["total_runs", "wickets", "run_rate", "boundaries"]
Your model trains on 500 matches and gets 95% accuracy.
In production? 45% accuracy.
Why? Because total_runs is only known after the match. Your model is using information that doesn't exist at prediction time.
This is data leakage, and it's invisible.
The Guard: Time-Based Filtering
For Match X (played Jan 15, 2024), I only use matches before Jan 15:
def _h2h_win_rate(matches, team1, team2, before_date):
past = matches[
(matches["date"] < before_date) & # ← THE GUARD
(
((matches["batting_team"] == team1) &
(matches["bowling_team"] == team2)) |
((matches["batting_team"] == team2) &
(matches["bowling_team"] == team1))
)
]
if past.empty:
return 0.5
return (past["winner"] == team1).sum() / len(past)
That one line ensures the model only learns from past data.
Four historical metrics:
- H2H win rate — Past match record
- Overall win rate — All-time record
- Venue win rate — Ground-specific history
- Rolling win rate — Recent form (last 5 matches)
13 features total. All pre-match. Zero leakage.
Why This Matters
The difference between 46% (baseline) and 61.8% (Gradient Boosting) came from:
- ✅ Proper feature engineering (H2H, venue, momentum)
- ✅ No data leakage (pre-match features only)
- ✅ Time-based validation (train on past, test on future)
That last one is critical. If I'd used random 80/20 split:
❌ WRONG:
- Training: 2008-2024 mixed
- Testing: 2008-2024 mixed
- Model trains on 2024 data, tests on 2021 data (backwards!)
✅ RIGHT:
- Training: 932 matches (2008-2022)
- Testing: 144 matches (2023-2024)
- Simulates real deployment
The Normalized Dataset
After all fixes:
| Metric | Value |
|---|---|
| Total rows | 2,217 |
| Unique matches | 1,076 |
| Seasons | 17 (2008-2024) |
| Teams | 8 (normalized) |
| Pre-match features | 13 — all derived strictly from past data |
| Training set | 932 matches |
| Test set | 144 matches |
Ready for feature engineering.
What's in Part 2 (Prediction Engine)
Next post: I'll show you:
✅ How to assemble 13 features cleanly
✅ Why Gradient Boosting won (surprise: size matters)
✅ Feature importances — what actually matters?
✅ Model comparison (61.8% explained)
✅ Confidence scoring
Sneak preview: Most people think toss is important. The model learned toss is almost irrelevant (5.3% importance). Head-to-head records? That's the signal (12.8%).
The Moment It Clicked
At one point, my model showed ~95% accuracy on the training set — and I was genuinely excited.
Then I realized it was using future data. Total_runs, wickets, boundaries — all known only after the match ended.
That was the moment I understood how easy it is to fool yourself in machine learning. The model wasn't learning cricket patterns. It was memorizing match outcomes.
I rebuilt everything with strictly pre-match features. Accuracy dropped to 61.8%. But suddenly, it actually worked in the real world.
🏁 Final Thought
Most ML systems don't fail because of models.
They fail because of:
- Bad data (unclean, inconsistent teams/dates)
- Leakage (using information that shouldn't exist at prediction time)
- Wrong validation (random splits on time-series data)
Fix those three things — and even simple models become powerful.
Part 2 dives into how I built the prediction engine with these principles in mind.
This is Part 1 of 5. Ready for the deep dive? Part 2: Prediction Engine →
Top comments (0)