POTHURAJU JAYAKRISHNA YADAV for AWS Community Builders

Posted on Apr 14

Part 1: The Problem No One Solves — Building an IPL AI That Predicts AND Answers

#ai #machinelearning #python #datascience

This is Part 1 of a 5-part series on building a complete ML system. Part 2 → | Part 3 → | Part 4 → | Part 5 →

The Cricket App Nobody Asked For

Last year, I was watching IPL matches and thinking about the perfect fantasy cricket assistant.

Not just another prediction model.

Not just a stats dashboard.

But something that could:

Answer any factual question instantly ("Who was player of the match on March 27, 2024?")
Predict match winners with real confidence scores
Route questions intelligently (knowing when to use each engine)

I'd see people on Twitter asking:

"Has CSK beaten MI more times than the other way around?"
"Does winning the toss actually help?"
"Who will win if MI bat against CSK tomorrow?"

The problem: The first two need retrieval. The third needs ML. Nobody was solving both in one system.

At least, not in a simple, production-ready way.

So I built it.

What This System Actually Does

Example 1: Match Prediction

User: "Who will win if Mumbai Indians bat against Chennai Super Kings?"

System: Runs Gradient Boosting model with pre-match features
- H2H record: 54%
- Overall win rate: 55%
- Venue history: 60%
- Recent form: 52%

Response: 🏆 Chennai Super Kings (64% confidence)

Example 2: Factual Lookup

User: "Who was player of the match in match 335982?"

System: TF-IDF cosine similarity against 42,000 Q&A pairs

Response: 🌟 Suryakumar Yadav

Example 3: Historical Analysis

User: "Head to head KKR vs RCB?"

System: Retrieves team record from Q&A index

Response:
- KKR: 9 wins
- RCB: 8 wins
- (Total: 17 innings)

The magic? It automatically decides which engine to use based on the question.

🎯 How It Looks in Action

Here's what a real conversation looks like:

User: "Who will win if MI bat against CSK?"
System: Runs the ML model → 🏆 CSK (64% confidence)

User: "How many runs did MI score in match 335982?"
System: Queries Q&A index → 🏏 Mumbai Indians scored 222 runs

User: "Does toss matter?"
System: Retrieves aggregate stats → Only 51.3% of toss winners win the match

Two different engines, one seamless experience.

🎨 Quick Visual Overview

User Input (Natural Language)
    ↓
Intent Detection
    ├─ Prediction keywords? → ML Engine
    ├─ Match ID? → Q&A Lookup
    └─ Otherwise → Q&A Retrieval
    ↓
Intelligent Response
    └─ Routed through FastAPI backend

The Architecture (Simple Version)

CSV Data (2,217 rows)
    ↓
Normalization (fix team names, dates)
    ↓
Feature Engineering (pre-match only)
    ├─→ ML Model (Gradient Boosting, 61.8%)
    └─→ Q&A Index (42,000 pairs)
         ↓
    FastAPI Backend (/chat, /predict, /health)
         ↓
    Streamlit Frontend (3 tabs: Chat, Predict, Metrics)

Critical design: Only .joblib files are used at runtime. No CSV dependency. Makes it:

✅ Fast (no CSV parsing)

✅ Portable (single deploy)

✅ Docker-friendly

The Dataset Problem Nobody Talks About

The IPL dataset spans 2008–2024 (2,217 rows). Looks clean, right?

It's not.

Problem 1: Team Name Chaos

Delhi Daredevils became Delhi Capitals in 2019. Kings XI Punjab became Punjab Kings in 2021.

If you don't normalize:

# ❌ WRONG
Delhi Daredevils: 5 wins (2015-2018)
Delhi Capitals:   2 wins (2020-2024)
Total: 7 wins (WRONG! It's one team)

The fix:

TEAM_NAME_MAP = {
    "Royal Challengers Bengaluru": "Royal Challengers Bangalore",
    "Delhi Daredevils": "Delhi Capitals",
    "Kings XI Punjab": "Punjab Kings",
}

df[team_cols] = df[team_cols].replace(TEAM_NAME_MAP)

One function. Called once. All team names normalized.

Problem 2: Date Type Matters

# ❌ WRONG (string comparison)
df[df["date"] < "2023-01-01"]
# "2024" < "2023" alphabetically → TRUE (backwards!)

# ✅ RIGHT (datetime comparison)
df["date"] = pd.to_datetime(df["date"])
df[df["date"] < pd.Timestamp("2023-01-01")]

The Real Problem: Data Leakage

Here's where most sports prediction projects fail silently.

Imagine building a predictor with:

features = ["total_runs", "wickets", "run_rate", "boundaries"]

Your model trains on 500 matches and gets 95% accuracy.

In production? 45% accuracy.

Why? Because total_runs is only known after the match. Your model is using information that doesn't exist at prediction time.

This is data leakage, and it's invisible.

The Guard: Time-Based Filtering

For Match X (played Jan 15, 2024), I only use matches before Jan 15:

def _h2h_win_rate(matches, team1, team2, before_date):
    past = matches[
        (matches["date"] < before_date) &  # ← THE GUARD
        (
            ((matches["batting_team"] == team1) & 
             (matches["bowling_team"] == team2)) |
            ((matches["batting_team"] == team2) & 
             (matches["bowling_team"] == team1))
        )
    ]
    if past.empty:
        return 0.5
    return (past["winner"] == team1).sum() / len(past)

That one line ensures the model only learns from past data.

Four historical metrics:

H2H win rate — Past match record
Overall win rate — All-time record
Venue win rate — Ground-specific history
Rolling win rate — Recent form (last 5 matches)

13 features total. All pre-match. Zero leakage.

Why This Matters

The difference between 46% (baseline) and 61.8% (Gradient Boosting) came from:

✅ Proper feature engineering (H2H, venue, momentum)
✅ No data leakage (pre-match features only)
✅ Time-based validation (train on past, test on future)

That last one is critical. If I'd used random 80/20 split:

❌ WRONG:
- Training: 2008-2024 mixed
- Testing: 2008-2024 mixed
- Model trains on 2024 data, tests on 2021 data (backwards!)

✅ RIGHT:
- Training: 932 matches (2008-2022)
- Testing: 144 matches (2023-2024)
- Simulates real deployment

The Normalized Dataset

After all fixes:

Metric	Value
Total rows	2,217
Unique matches	1,076
Seasons	17 (2008-2024)
Teams	8 (normalized)
Pre-match features	13 — all derived strictly from past data
Training set	932 matches
Test set	144 matches

Ready for feature engineering.

What's in Part 2 (Prediction Engine)

Next post: I'll show you:

✅ How to assemble 13 features cleanly

✅ Why Gradient Boosting won (surprise: size matters)

✅ Feature importances — what actually matters?

✅ Model comparison (61.8% explained)

✅ Confidence scoring

Sneak preview: Most people think toss is important. The model learned toss is almost irrelevant (5.3% importance). Head-to-head records? That's the signal (12.8%).

The Moment It Clicked

At one point, my model showed ~95% accuracy on the training set — and I was genuinely excited.

Then I realized it was using future data. Total_runs, wickets, boundaries — all known only after the match ended.

That was the moment I understood how easy it is to fool yourself in machine learning. The model wasn't learning cricket patterns. It was memorizing match outcomes.

I rebuilt everything with strictly pre-match features. Accuracy dropped to 61.8%. But suddenly, it actually worked in the real world.

🏁 Final Thought

Most ML systems don't fail because of models.

They fail because of:

Bad data (unclean, inconsistent teams/dates)
Leakage (using information that shouldn't exist at prediction time)
Wrong validation (random splits on time-series data)

Fix those three things — and even simple models become powerful.

Part 2 dives into how I built the prediction engine with these principles in mind.

This is Part 1 of 5. Ready for the deep dive? Part 2: Prediction Engine →

← Back to Series

DEV Community