POTHURAJU JAYAKRISHNA YADAV for AWS Community Builders

Posted on Apr 14 • Edited on Apr 16

Building the Perfect Prediction Engine — Feature Engineering Without Cheating

#ai #machinelearning #python #datascience

Part 2 of 5 | ← Part 1 | Part 3 → | View Series

The Hook: 95% Accuracy That Was Completely Wrong

I built a prediction model.

It gave me 95% accuracy.

I was ecstatic. I thought I'd cracked it.

Then I deployed it.

Predictions were garbage.

⚡ WARNING: Most IPL prediction models you see online are wrong. Not because of the algorithm. Because they're leaking future data. You're about to learn how to build one that doesn't.

⚠️ Mistake I Made: The Data Leakage Trap

The problem?

I was using runs and wickets as features.

But here's the thing — I calculated them from the same match I was predicting.

The model wasn't predicting.

It was cheating.

It was like asking "Will Team A win?" and then telling the model "Here are Team A's final stats from that match."

Of course it knew the answer.

Stop here for a second.

That's the entire problem with most ML models.

They don't learn to predict. They learn to memorize the future.

That 95% accuracy? Total illusion. The moment I tested on real, unseen data: 30% accuracy. Worse than flipping a coin.

The Real Challenge

So I started over.

This time: Only historical data. No future data leakage.

I had 13 pre-match features to compute for 1,076 matches, using only historical data available before the match.

The naive approach: Loop through each match, compute features. Takes 45 minutes.

The real question: How do you compute features correctly AND fast?

The answer: Vectorized pandas operations + smart grouping + the date filter from hell.

This is the critical part:

Every single feature must be computed from data that exists BEFORE the match.

Not after.

Before.

One line: matches["date"] < before_date

That line is doing ALL the work.

Four Historical Rate Functions

1️⃣ Head-to-Head Win Rate

How many times has Team A beaten Team B historically?

💡 Key Line: matches["date"] < before_date

This is the data leakage guard. Without it, you're looking at the future. With it, you're looking at the past only.

def _h2h_win_rate(matches, team1, team2, before_date):
    """Win rate of team1 vs team2 in all past matches."""
    past = matches[
        (matches["date"] < before_date) &  # ← THE CRITICAL LINE
        (
            ((matches["batting_team"] == team1) & (matches["bowling_team"] == team2)) |
            ((matches["batting_team"] == team2) & (matches["bowling_team"] == team1))
        )
    ]
    if past.empty:
        return 0.5  # Neutral if no history

    wins_by_team1 = (past["winner"] == team1).sum()
    return wins_by_team1 / len(past)

Real example: MI vs CSK

38 past matches (filtered by before_date)
MI won: 18
CSK won: 20
→ h2h_rate_MI = 18/38 = 0.474 (47.4%)

Notice what's missing?

Stats from the match we're predicting.

That's the whole point.

2️⃣ Overall Win Rate

What percentage of matches does this team win (all-time)?

This is the team's DNA.

Some teams are just better. Structurally. Over 16 years.

def _overall_win_rate(matches, team, before_date):
    """Win percentage across all matches."""
    past = matches[
        (matches["date"] < before_date) &  # ← Still have the guard
        ((matches["batting_team"] == team) | (matches["bowling_team"] == team))
    ]
    return (past["winner"] == team).sum() / len(past)

Real example:

Mumbai Indians
Matches before 2024-01-01: 200
Wins: 90
→ 45% win rate (structurally strong)
They win nearly half their matches. Consistently.

Kings XI Punjab
Matches: 200
Wins: 70
→ 35% win rate (historically weaker)
They win about a third.

The signal is HUGE:

MI at 45% vs KXIP at 35% = 10 percentage points of pure team quality.

The model learns this. And it matters.

3️⃣ Venue Win Rate

Home field advantage.

Is it real?

Yes.

By how much?

A lot.

def _venue_win_rate(matches, team, venue, before_date):
    """Win rate at a specific venue."""
    past = matches[
        (matches["date"] < before_date) &
        (matches["venue"] == venue) &  # ← Filter to this ground only
        ((matches["batting_team"] == team) | (matches["bowling_team"] == team))
    ]
    return (past["winner"] == team).sum() / len(past) if not past.empty else 0.5

Real example — Mumbai Indians:

Ground	Record	Win Rate
Wankhede (Home)	18-10	64%
Eden Gardens (Away)	4-11	27%
Narendra Modi Stadium	8-7	53%

Notice that 64% at home?

Then 27% at KKR's fortress (Eden Gardens).

That's the magic of venue. Same team. 37 percentage points difference. Same players. Different ground.

👉 The model learns this. It matters.

4️⃣ Rolling Win Rate (Momentum)

Is a team hot or cold?

Last 5 matches.

Won how many?

That's momentum.

def _rolling_win_rate(matches, team, before_date, n=5):
    """Recent form: last n matches."""
    past = matches[
        (matches["date"] < before_date) &
        ((matches["batting_team"] == team) | (matches["bowling_team"] == team))
    ].tail(n)  # ← LAST n matches only

    return (past["winner"] == team).sum() / len(past) if not past.empty else 0.5

Psychological impact:

Winning streak (5-0): Confidence high, execution sharp → morale boost
Losing streak (0-5): Doubt creeping in, mistakes multiplying → morale crash

Real example — the moment that shows momentum:

Match Day: 2024-01-20 (MI vs CSK)

MI's last 5 matches:
Loss → Loss → Win → Win → Win
→ rolling_rate = 60% (recovering, momentum UP)

CSK's last 5 matches:  
Loss → Loss → Loss → Loss → Loss
→ rolling_rate = 0% (in free fall)

What would the model predict?
MI: 60% form advantage
CSK: 0% form advantage
→ MI becomes heavy favorite

Plot twist: You'd think momentum would be the #1 predictor.

It's not. (Spoiler: we'll see why later.)

Assembling Features: The engine_features() Function

Now let's put all four features together:

def engineer_features(df):
    """
    Main feature engineering function.
    Input: Raw DataFrame (2,217 rows)
    Output: ML-ready DataFrame (932 rows, 13 features)
    """
    # Step 1: One row per match
    matches = df[df["innings"] == 1].copy().reset_index(drop=True)

    # Step 2: Chronological order (CRITICAL)
    matches = matches.sort_values("date").reset_index(drop=True)

    # Step 3: Compute features for each match
    rows = []
    for _, row in matches.iterrows():
        t1, t2 = row["batting_team"], row["bowling_team"]
        date = row["date"]  # ← THIS date is the cutoff
        venue = row["venue"]
        winner = row["winner"]

        if pd.isna(winner):
            continue  # Skip abandoned matches

        rows.append({
            "team1": t1,
            "team2": t2,
            "venue": venue,
            "toss_winner_is_team1": 1 if row["toss_winner"] == t1 else 0,
            "toss_decision": row.get("toss_decision", "unknown"),

            # Historical rates (ALL filtered by date < before_date)
            "h2h_win_rate_t1": _h2h_win_rate(matches, t1, t2, date),
            "overall_win_rate_t1": _overall_win_rate(matches, t1, date),
            "overall_win_rate_t2": _overall_win_rate(matches, t2, date),
            "venue_win_rate_t1": _venue_win_rate(matches, t1, venue, date),
            "venue_win_rate_t2": _venue_win_rate(matches, t2, venue, date),
            "rolling_win_rate_t1": _rolling_win_rate(matches, t1, date),
            "rolling_win_rate_t2": _rolling_win_rate(matches, t2, date),

            "winner": winner,  # ← What actually happened
            "date": date,
        })

    return pd.DataFrame(rows)

What you get:

Shape: (932 rows, 15 columns)

✓ 932 matches (2008-2024, all cleaned)
✓ 15 columns:
  - 4 metadata (team1, team2, venue, date)
  - 2 toss info
  - 7 historical rates
  - 1 target (winner)
  - 1 date

The magic: Every single feature is computed BEFORE the match. No cheating.

⚠️ The Second Mistake: Wrong Train-Test Split

I almost made this mistake again.

❌ The Trap (Random Split)

X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

What happens:

Training set: [2024, 2008, 2021, 2010, 2023, ...]
Test set:     [2015, 2022, 2009, 2019, 2024, ...]

Model sees 2024 during training.
Model tests on 2008 data.

→ Completely backwards.
→ 95% fake accuracy. 30% real accuracy in production.

Why I almost fell for it:

Sklearn's train_test_split is beautiful. Clean API. Works great.

For time-series data, it's a disaster.

✅ The Right Way (Temporal Split)

split_date = pd.Timestamp("2023-01-01")
train_df = engineered_df[engineered_df["date"] < split_date]
test_df = engineered_df[engineered_df["date"] >= split_date]

Timeline (the way reality works):

2008 ──────────────────────── 2023 ────── 2024
├─ TRAIN: 15 years history ─┤├─ TEST: 2 years unseen ─┤
         ↓                              ↓
    Model learns               Model evaluates
    (only past)                (only future)

💡 Key Rule: Train sees only the past. Test sees only the future.

Anything else is cheating.

Result: 61.8% test accuracy (real, honest, deployable).

Model Tournament: 4 Algorithms Battle

Now which algorithm should I use?

candidates = {
    "Logistic Regression": LogisticRegression(...),
    "Random Forest": RandomForestClassifier(n_estimators=200, max_depth=8),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=200, max_depth=4),
    "XGBoost": XGBClassifier(n_estimators=200, max_depth=4),
}

Let them fight on the same data.

Drumroll...

Model	Test Accuracy	Cross-Val Mean	Notes
Logistic Regression	46.5%	48.9%	Too linear
Random Forest	50.7%	53.4%	Better, but basic
🏆 Gradient Boosting	61.8%	48.3%	WINNER
XGBoost	58.3%	47.1%	Close 2nd, overfits

The accuracy gap visualized:

Random Guessing: ████░░░░░░░░░░░░░  12.5%
Logistic Reg:    ███████░░░░░░░░░░░  46.5%
Random Forest:   ██████████░░░░░░░░  50.7%
XGBoost:         ████████████░░░░    58.3%
Gradient Boost:  ████████████████░   61.8% ← Better patterns, not fancier math

💡 Scale it up:

Random baseline (8 teams, pick randomly): 12.5% accuracy

Gradient Boosting: 61.8%

→ 5x better than just guessing.

Why Gradient Boosting crushed it:

Model	Why It Failed
Logistic Regression	Linear model. Can't capture "Team A beats Team B when at home AND have won 3 straight" logic.
Random Forest	Decent, but treats features independently. Misses interactions.
Gradient Boosting	Iterative learning. Each tree corrects the previous tree's mistakes. Learns nuances.
XGBoost	Same idea as GB, but overfits slightly on this small dataset.

Lesson: Better features > better algorithms.

But better algorithms + better features? Now you're cooking.

🔍 Feature Importance: The Shocking Truth

# From the trained Gradient Boosting model
feature_importance = {
    "overall_win_rate_t1": 0.152,     # 15.2% — DOMINATES
    "h2h_win_rate_t1": 0.128,         # 12.8%
    "toss_winner_is_team1": 0.053,    # 5.3% — Wait, that's LOW
    "rolling_win_rate_t1": 0.049,     # 4.9% — Surprising!
    "venue_win_rate_t1": 0.042,       # 4.2%
}

The results shocked me:

1. Overall Team Quality Dominates (15.2%)

A fundamentally strong team beats fundamentally weak teams.

Consistently.

This is the biggest signal.

MI at 45% win rate (all-time) vs KXIP at 35% (all-time)?

The model puts 15.2% weight on that difference alone.

2. H2H History Matters (12.8%)

Historical head-to-head is powerful.

If Team A has beaten Team B 70% of the time historically?

That's signal.

But notice: It's less than overall quality.

3. ⚠️ Toss is Overrated (5.3%)

Cricket myth: "Winning the toss wins the match."

Reality: Toss matters ~5% of the time.

Bigger factors:

Better team
Better history vs opponent
Better recent form

Toss is just noise compared to these.

4. 😲 Momentum is Weak (4.9%)

This is the shocking one.

You'd think "last 5 matches" would be huge.

It's the 4th most important feature.

Why?

Recent form is volatile. Wins vary. Losses happen.

Team quality is stable. It's baked in over 16 years.

The model trusts stability over noise.

5. Venue Advantage is Real but Limited (4.2%)

Home field helps.

But it's less predictive than everything else.

Why?

Because team quality already accounts for most of the variance.

The Production Pipeline

Now let's make this production-ready.

How do you package this for production?

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Numeric features: standardize to mean 0, std 1
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical features: convert to numbers
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

# Combine everything
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Full pipeline: preprocess → train → predict
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=4))
])

# Train on 2008-2022, never look at 2023-2024 during training
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)  # 0.618 ✓

print(f"Model learned: {accuracy * 100:.1f}% accuracy on unseen data")

What this does:

Raw data → Normalize → One-hot encode → Gradient Boosting → Prediction

Every input gets processed the same way. Every time.

Packaging for the Real World

Now save everything:

bundle = {
    "pipeline": pipeline,           # ← Everything needed to predict
    "features": feature_names,      # ← What features it expects
    "label_encoder": label_encoder, # ← Team name mappings
    "metrics": {
        "test_accuracy": 0.618,     # ← How good
        "cv_mean": 0.483,           # ← Cross-val mean
        "cv_std": 0.028,            # ← Cross-val std
    },
    "feature_importance": importance_dict,  # ← What matters
}

joblib.dump(bundle, "models/model.joblib")
# Result: ~30MB file
# Contains: preprocessor, model, metadata, everything

Later, load it:

bundle = joblib.load("models/model.joblib")
pipeline = bundle["pipeline"]

# Use it
result = pipeline.predict_proba(new_data)[0]
# → [0.38, 0.62]  (38% one team, 62% the other)

Prediction Time: Deploy It

User asks: "Will MI beat CSK tomorrow at Wankhede?"

# Step 1: Compute features from historical data ONLY
# (no future data, no match results yet)
features = {
    "h2h_rate": 0.54,        # MI vs CSK history → from database
    "overall_rate": 0.55,    # MI all-time → from database
    "venue_rate": 0.60,      # MI at Wankhede → from database
    "rolling_rate": 0.52,    # MI last 5 matches → from database
    "toss_win": 1,           # MI won toss → from live toss result
}

# Step 2: Send to pipeline
X = pd.DataFrame([features])
prediction = pipeline.predict_proba(X)[0]
# Returns: [0.38, 0.62]
#          (38% CSK wins, 62% MI wins)

# Step 3: Return to user
response = {
    "winner": "Mumbai Indians",
    "confidence": 0.62,  # 62% confident
    "explanation": "MI's strong overall record (55%) + venue advantage (60%) + H2H lead (54%)"
}

No magic. No hallucinations. No APIs.

Just math.

🚨 Other Mistakes I Almost Made

Mistake: Hardcoding Features

# ❌ WRONG
def predict(team1, team2):
    h2h = 0.54  # Hardcoded!!!!
    rolling = 0.52  # Hardcoded!!!!
    return model.predict([[h2h, rolling]])

# Problem: Data changes, your code doesn't.
# Forever stuck at 0.54.

Mistake: Inconsistent Team Name Formats

# ❌ WRONG
matches_where_mi_played = matches[
    (matches["team1"] == "Mumbai Indians") |  # What if CSV has "MI"?
    (matches["team2"] == "Mumbai Indians")
]

# Query returns nothing. Silent failure.
# You have no idea why.

✅ Right Way: Normalize Everything

# At load time: standardize all team names
team_mapping = {
    "MI": "Mumbai Indians",
    "CSK": "Chennai Super Kings",
    "DC": "Delhi Capitals",
}
df["team1"] = df["team1"].map(team_mapping).fillna(df["team1"])
df["team2"] = df["team2"].map(team_mapping).fillna(df["team2"])

# Now all queries work
h2h = _h2h_win_rate(df, "Mumbai Indians", "Chennai Super Kings", date)

💡 The Single Biggest Lesson

The gap between 46% accuracy and 61.8% didn't come from fancier algorithms.

It came from not cheating.

Most ML models fail because they leak information. They train on the future. They use data that wouldn't exist at prediction time.

This model works because every single feature is computed from the past only.

The four historical rate functions (H2H, Overall, Venue, Rolling) do all the heavy lifting.

Gradient Boosting is nice. But it's the features that matter.

🎯 The Myth vs Reality

Myth	Reality
"Better algorithms = better predictions"	Better features >> better algorithms
"Toss determines the match"	Toss is ~5% of the variance
"Recent form is everything"	Recent form is 4.9% of the variance
"Home field is huge"	Home field is 4.2% (team quality matters more)
"You need deep learning"	Simple feature engineering beats 95% of models

🎤 The Mic Drop Lesson

And I almost shipped that model.

No data leakage guards, no temporal split, just raw 95% accuracy shipped to production.

Here's what happened:

The model didn't get better because of Gradient Boosting.

It got better because it stopped cheating.

Most ML projects fail for the same reason.

They learn from the future.

This one doesn't.

That's it. That's why it works.

What's Next (Part 3: Q&A Engine)

So we built a prediction engine that doesn't hallucinate.

Now the harder part: How do you answer 42,000 questions without APIs or LLMs?

✅ Generate Q&A pairs from raw data

✅ TF-IDF (simple, fast, accurate)

✅ Search in <5ms

✅ Zero hallucinations

✅ Why expensive APIs are overkill

Sneak preview: TF-IDF won't impress at parties. But it's 100% accurate and costs $0.00 per query.

Sometimes boring wins.

🏏 This is Part 2 of 5. Subscribe to follow the series!

← Part 1: The Problem | Part 3: Q&A Engine →

DEV Community