DEV Community

Building the Perfect Prediction Engine — Feature Engineering Without Cheating

Building the Perfect Prediction Engine — Feature Engineering Without Cheating

Part 2 of 5 | ← Part 1 | Part 3 → | View Series

The Engineering Challenge

I had 13 pre-match features to compute for 1,076 matches, using only historical data.

The naive approach: Loop through each match, compute features. Takes 45 minutes.

The question: How do you scale feature engineering?

The answer: Vectorized pandas operations + smart grouping.


Four Historical Rate Functions

1️⃣ Head-to-Head Win Rate

How many times has Team A beaten Team B historically?

def _h2h_win_rate(matches, team1, team2, before_date):
    """Win rate of team1 vs team2 in all past matches."""
    past = matches[
        (matches["date"] < before_date) &  # ← No future data
        (
            ((matches["batting_team"] == team1) & (matches["bowling_team"] == team2)) |
            ((matches["batting_team"] == team2) & (matches["bowling_team"] == team1))
        )
    ]
    if past.empty:
        return 0.5  # Neutral if no history

    wins_by_team1 = (past["winner"] == team1).sum()
    return wins_by_team1 / len(past)
Enter fullscreen mode Exit fullscreen mode

Real example: MI vs CSK

38 past matches (filtered by before_date)
MI won: 18
CSK won: 20
→ h2h_rate_MI = 18/38 = 0.474 (47.4%)
Enter fullscreen mode Exit fullscreen mode

Key insight: The line matches["date"] < before_date is the data leakage guard.


2️⃣ Overall Win Rate

What percentage of matches does this team win (all-time)?

def _overall_win_rate(matches, team, before_date):
    """Win percentage across all matches."""
    past = matches[
        (matches["date"] < before_date) &
        ((matches["batting_team"] == team) | (matches["bowling_team"] == team))
    ]
    return (past["winner"] == team).sum() / len(past)
Enter fullscreen mode Exit fullscreen mode

Real example:

Team: Mumbai Indians
Matches before 2024-01-01: 200
Wins: 90
→ overall_rate = 45% (structurally strong team)

Team: Kings XI Punjab  
Matches: 200
Wins: 70
→ overall_rate = 35% (historically weaker)
Enter fullscreen mode Exit fullscreen mode

Why it matters: The model learns team quality. MI at 45% vs KXIP at 35% = significant signal.


3️⃣ Venue Win Rate

How does this team perform at THIS specific ground?

def _venue_win_rate(matches, team, venue, before_date):
    """Win rate at a specific venue."""
    past = matches[
        (matches["date"] < before_date) &
        (matches["venue"] == venue) &  # ← Venue-specific
        ((matches["batting_team"] == team) | (matches["bowling_team"] == team))
    ]
    return (past["winner"] == team).sum() / len(past) if not past.empty else 0.5
Enter fullscreen mode Exit fullscreen mode

Real example:

Ground MI Record Team Rate
Wankhede (Mumbai) 18-10 MI 64%
Eden Gardens (Kolkata) 4-11 MI 27%
Narendra Modi (Ahmedabad) 8-7 MI 53%

Key insight: Home field advantage is HUGE. MI dominates at their home ground, struggles at KKR's fortress.


4️⃣ Rolling Win Rate (Momentum)

How has this team performed in their LAST 5 matches?

def _rolling_win_rate(matches, team, before_date, n=5):
    """Recent form: last n matches."""
    past = matches[
        (matches["date"] < before_date) &
        ((matches["batting_team"] == team) | (matches["bowling_team"] == team))
    ].tail(n)  # ← Get LAST n rows only

    return (past["winner"] == team).sum() / len(past) if not past.empty else 0.5
Enter fullscreen mode Exit fullscreen mode

Why momentum matters:

  • Winning streak (5-0): Confidence high, execution sharp → +15% win probability
  • Losing streak (0-5): Confidence shot, making mistakes → -15% win probability

Real example:

Match: MI vs CSK on 2024-01-20

MI's last 5 matches:
2024-01-12: Loss
2024-01-14: Loss
2024-01-16: Win ← Starting to recover
2024-01-17: Win ← Confidence building
2024-01-19: Win ← On fire
→ rolling_rate_MI = 3/5 = 60%

CSK's last 5 matches:
2024-01-10: Loss
2024-01-11: Loss
2024-01-13: Loss
2024-01-15: Loss
2024-01-18: Loss
→ rolling_rate_CSK = 0/5 = 0% (complete collapse)

Model says: MI is 60% (form) vs CSK's 0% (form)
→ Prediction favors MI heavily
Enter fullscreen mode Exit fullscreen mode

Assembling Features: engineer_features()

def engineer_features(df):
    """
    Main feature engineering function.
    Input: Raw DataFrame (2,217 rows)
    Output: ML-ready DataFrame (932 rows, 13 features)
    """
    # Step 1: One row per match
    matches = df[df["innings"] == 1].copy().reset_index(drop=True)

    # Step 2: Chronological order (CRITICAL)
    matches = matches.sort_values("date").reset_index(drop=True)

    # Step 3: Compute features for each match
    rows = []
    for _, row in matches.iterrows():
        t1, t2 = row["batting_team"], row["bowling_team"]
        date = row["date"]
        venue = row["venue"]
        winner = row["winner"]

        if pd.isna(winner):
            continue  # Skip abandoned matches

        rows.append({
            "team1": t1,
            "team2": t2,
            "venue": venue,
            "toss_winner_is_team1": 1 if row["toss_winner"] == t1 else 0,
            "toss_decision": row.get("toss_decision", "unknown"),

            # Historical rates (all filtered by date < before_date)
            "h2h_win_rate_t1": _h2h_win_rate(matches, t1, t2, date),
            "overall_win_rate_t1": _overall_win_rate(matches, t1, date),
            "overall_win_rate_t2": _overall_win_rate(matches, t2, date),
            "venue_win_rate_t1": _venue_win_rate(matches, t1, venue, date),
            "venue_win_rate_t2": _venue_win_rate(matches, t2, venue, date),
            "rolling_win_rate_t1": _rolling_win_rate(matches, t1, date),
            "rolling_win_rate_t2": _rolling_win_rate(matches, t2, date),

            "winner": winner,
            "date": date,
        })

    return pd.DataFrame(rows)
Enter fullscreen mode Exit fullscreen mode

Output:

Shape: (932, 15)
- 932 matches (2008-2024, cleaned)
- 15 columns (4 metadata + 2 toss + 7 rates + 1 target + 1 date)
Enter fullscreen mode Exit fullscreen mode

The Critical Split: Time-Based vs Random

❌ Wrong Way (Random Split)

X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

Problem: Mixes past and future randomly.

Training set might have: 2024, 2008, 2021, 2010, ...
Test set might have: 2015, 2022, 2009, 2019, ...

Model trains on some 2024 data → tests on some 2008 data
Completely backwards!

Result: 95% test accuracy (fake), 30% production accuracy (real)
Enter fullscreen mode Exit fullscreen mode

✅ Right Way (Temporal Split)

split_date = pd.Timestamp("2023-01-01")
train_df = engineered_df[engineered_df["date"] < split_date]
test_df = engineered_df[engineered_df["date"] >= split_date]
Enter fullscreen mode Exit fullscreen mode

Timeline:

2008 ────────────────────────────── 2023 ──────── 2024
├─── TRAIN (15 years, 932 matches) ─┤├─ TEST (2 years, 144 matches) ─┤
      Learned from history              Never seen before
      ↓ Train model
                                        ↓ Evaluate model
Enter fullscreen mode Exit fullscreen mode

Result: 61.8% test accuracy (real), because model only sees truly new data.


Model Comparison: 4 Algorithms

I tested four algorithms on the same data:

candidates = {
    "Logistic Regression": LogisticRegression(...),
    "Random Forest": RandomForestClassifier(n_estimators=200, max_depth=8),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=200, max_depth=4),
    "XGBoost": XGBClassifier(n_estimators=200, max_depth=4),
}
Enter fullscreen mode Exit fullscreen mode

Results

Model Test Accuracy CV Mean Interpretation
Logistic Regression 46.5% 48.9% Linear assumption too simple
Random Forest 50.7% 53.4% Better, handles non-linearity
Gradient Boosting 61.8% 48.3% BEST — learns patterns
XGBoost 58.3% 47.1% Close 2nd, slight overfitting

61.8% accuracy is 6.9x better than random guessing (12.5% with 8 teams).

Why Gradient Boosting won:

  • Logistic: Linear model → Can't capture complex interactions
  • Random Forest: Decent → But bags trees randomly
  • Gradient Boosting: Each tree corrects previous errors → learns complex patterns
  • XGBoost: Similar to GB, but slight overfitting

Feature Importance: What Actually Matters

# From the trained Gradient Boosting model
feature_importance = {
    "overall_win_rate_t1": 0.152,     # 15.2% — Most important
    "h2h_win_rate_t1": 0.128,         # 12.8%
    "toss_winner_is_team1": 0.053,    # 5.3% — Much less than you'd think!
    "rolling_win_rate_t1": 0.049,     # 4.9%
    "venue_win_rate_t1": 0.042,       # 4.2%
}
Enter fullscreen mode Exit fullscreen mode

Key insights:

  1. Overall strength matters most (15.2%)

    • A fundamentally strong team (high overall_win_rate) beats almost everything
  2. H2H record matters (12.8%)

    • Historical matchups have real signal
  3. Toss hardly matters (5.3%)

    • Myth: Winning toss determines match
    • Reality: Toss decides maybe 5% of outcomes
  4. Momentum weak signal (4.9%)

    • Surprising! Recent form less predictive than we think
  5. Venue advantage real but modest (4.2%)

    • Home field helps, but not as much as team quality

The Training Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Numeric features: standardize (mean 0, std 1)
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical features: one-hot encode
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

# Combine
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Full pipeline: preprocess → model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=4))
])

# Train on 2008-2022, evaluate on 2023-2024
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)  # 0.618 ✓
Enter fullscreen mode Exit fullscreen mode

Model Serialization

bundle = {
    "pipeline": pipeline,           # Preprocessor + model
    "features": feature_names,      # For documentation
    "label_encoder": label_encoder, # Team names → indices
    "metrics": {
        "test_accuracy": 0.618,
        "cv_mean": 0.483,
        "cv_std": 0.028,
    },
    "feature_importance": importance_dict,
}

joblib.dump(bundle, "models/model.joblib")
# Result: ~30MB file with everything needed for predictions
Enter fullscreen mode Exit fullscreen mode

What Happens at Prediction Time

A user asks: "Will MI beat CSK tomorrow?"

# Step 1: Compute features from historical data
features = {
    "h2h_rate": 0.54,        # MI vs CSK history
    "overall_rate": 0.55,    # MI all-time
    "venue_rate": 0.60,      # MI at this ground
    "rolling_rate": 0.52,    # MI last 5 matches
    "toss_win": 1,           # MI won toss
}

# Step 2: Pass to pipeline
X = pd.DataFrame([features])
prediction = pipeline.predict_proba(X)[0]
# Result: [0.38, 0.62] (38% CSK, 62% MI)

# Step 3: Return to user
{
    "winner": "Mumbai Indians",
    "confidence": 0.62,
}
Enter fullscreen mode Exit fullscreen mode

Common Mistakes I Almost Made

Mistake 1: Hardcoding Historical Rates

# ❌ WRONG
def predict(team1, team2):
    h2h = 0.54  # Hardcoded!
    return model.predict(...h2h...)

# Problem: What if data changes? You hardcode forever
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Not Normalizing Team Names

# ❌ WRONG
h2h_mi_vs_csk = matches[
    (matches["team1"] == "Mumbai Indians") &  # What if data has "MI"?
    (matches["team2"] == "Chennai Super Kings")
]

# Result: Query fails silently
Enter fullscreen mode Exit fullscreen mode

✅ Right Way

# Normalize once at load time
teams = normalize_teams(df)

# Then all queries work
h2h = _h2h_win_rate(teams, "Mumbai Indians", "Chennai Super Kings", date)
Enter fullscreen mode Exit fullscreen mode

The One Insight

The difference between 46% (baseline) and 61.8% (Gradient Boosting) came from proper feature engineering, not fancy algorithms.

Four carefully crafted features (H2H, Overall, Venue, Rolling) beat random models by 3.4x.


What's in Part 3 (Q&A Engine)

Next post: I'll show you how to build a 42,000-answer Q&A system without any APIs:

✅ Generate Q&A pairs from CSV

✅ TF-IDF vectorization explained

✅ <5ms query retrieval

✅ Zero hallucinations

✅ Why not LLMs?

Sneak preview: TF-IDF is slower than LLMs at understanding meaning, but it's 100% accurate and costs $0 per query.


This is Part 2 of 5. Subscribe to follow the series! 🏏

← Part 1: The Problem | Part 3: Q&A Engine →

Top comments (0)