Previously in This Series
Day 2 sorted machine learning problems into four types by what's given to the algorithm. Today we zoom into the "given" part itself — the actual columns and answers sitting in your dataset — because most real ML failures trace back to something wrong there, not to the algorithm you picked.
Learning Objectives
Define what actually makes a feature useful, beyond "a column in a spreadsheet."
Spot feature leakage — a feature that secretly contains the answer — before it reaches production, not after.
See, with real numbers, why label quality can matter more than which algorithm you choose.
Build the habit of asking "would I actually have this information at prediction time?" before trusting any feature.
Why This Matters
Feature leakage is one of the most common, most expensive, and most embarrassing mistakes in real ML systems — models that look great in testing and then quietly fail, or get pulled from production entirely, because the feature that made them "great" was never actually available at decision time.
It's also a favorite senior-level interview question: "you're shown a model with suspiciously good performance — what's the first thing you check?" The expected answer is almost always leakage, not algorithm tuning.
And it sets up two things directly ahead in this series: Phase 3's Feature Engineering is the constructive half of what you're learning today, and Phase 5's production monitoring exists largely because leakage and label problems are so easy to miss until real money or real users are involved.
Mental Model
Think of training a model like building a court case.
Features are the evidence a detective would actually have access to at the time — fingerprints, timestamps, statements taken before the verdict was known.
A leaked feature is evidence that could only exist after the verdict — like a confession transcript that got mixed into the evidence pile by mistake. It makes the case look airtight in the file room, and falls apart the instant a real case lands on the desk without it.
Labels are witness testimony. Useful, but only as good as the witness. A model trained on confidently wrong testimony doesn't get suspicious about it — it just learns to be confidently wrong in exactly the same way.
History in 60 Seconds
1936 — Ronald Fisher uses linear discriminant analysis on the iris flower measurements (collected by botanist Edgar Anderson, and the same dataset from Day 2) — among the earliest formal uses of measured "features" to separate categories statistically.
1960s–1980s — early statistical and expert systems rely almost entirely on hand-picked features; the practitioner's judgment in choosing them matters as much as the model itself.
1997 — Tom Mitchell's Machine Learning textbook (referenced on Day 1) formalizes "feature" and "label" as the standard vocabulary still used today.
2011 — Kaufman, Rosset, and Perlich present "Leakage in Data Mining" at KDD, formalizing this lesson's central problem after watching it sink one data mining competition entry after another. An expanded journal version, adding co-author Ori Stitelman, followed in 2012 — the version cited in References below.
2010s — Kaggle competitions cement "feature engineering wins competitions" as conventional wisdom in classical ML, often more decisive than algorithm choice.
2012 onward — deep learning starts automating feature extraction for unstructured data like images and audio. For structured, tabular data — most of what businesses actually have — hand-built features remain decisive even now, which is exactly why this lesson isn't going away.
Key Terminology Table
| Term | Meaning |
|---|---|
| Feature | A measurable input variable used to make a prediction |
| Label | The known correct answer attached to a training example |
| Feature leakage | A feature that, directly or indirectly, contains information about the label that wouldn't actually be available at prediction time |
| Label noise | Incorrect or inconsistent labels present in your training data |
| Ground truth | The real-world correct answer a label is supposed to represent |
| One-hot encoding | Converting a categorical feature into separate binary columns a model can use |
| Multicollinearity | When two or more features carry largely the same information, confusing how a model attributes importance between them |
| Feature engineering | Creating, transforming, or selecting features to improve what a model can learn |
Core Concepts
What Actually Makes a Feature "Good"
A useful feature needs to clear four bars, and most feature mistakes are a failure on exactly one of them:
It has a genuine relationship with the target. If a column doesn't actually correlate with what you're predicting, it's just noise the model has to learn to ignore.
It's available at prediction time, in the same form. This is the bar feature leakage fails — more on that in a moment.
It actually varies. A column that's the same value for every row carries zero information, no matter how relevant it sounds conceptually.
It's not a near-duplicate of another feature. Two features that carry almost the same information create multicollinearity — a linear model can't cleanly tell which one deserves the credit, so it often splits weight between them in a way that hurts interpretability even when it doesn't hurt raw accuracy.
Feature Leakage: The Mistake That Hides Inside Good Numbers
Feature leakage happens when a feature contains information about the label that wouldn't actually exist yet at the moment you need to make a prediction. The test is simple to state and easy to forget under deadline pressure: would I have this exact value, in this exact form, at the moment I actually need to predict?
Real examples beyond today's hands-on demo:
Predicting loan default using a
days_past_duefield — a field that, by definition, only has a meaningful value once a payment is already late.Predicting hospital readmission using a
discharge_dispositionfield that gets recorded based on how the patient's case actually concluded.Predicting customer churn using a
cancellation_datecolumn that's only populated after the customer has already churned.
In every case, the feature isn't "wrong" in the database — it's wrong for this prediction task, because it's only known after the answer is already settled.
Labels Are Data Too — and They're Often Wrong
Labels feel like ground truth because someone wrote them down with confidence. In practice, labels come from human annotators who disagree with each other, automated heuristics that get edge cases wrong, or definitions that quietly drift over time. All of that shows up as label noise: training examples where the recorded answer doesn't match reality.
Here's the part that's worth sitting with mathematically, briefly: if you flip a fraction p of your labels completely at random, the model is still learning from a signal that's correlated with the truth — just a weaker one — for any p below 0.5. As p approaches 0.5, that correlation collapses toward zero, because a coin flip is, by definition, equally likely to match the truth or not, regardless of what the truth actually is. That's not a smooth decline. It's a cliff, and you'll see exactly where it sits in today's experiment.
Visual Explanations
A quick litmus test for any feature you're about to add:
flowchart TD
F[Candidate feature] --> Q1{Would this value be known,<br/>in this exact form,<br/>at prediction time?}
Q1 -->|No| Leak[Leakage — fix the pipeline<br/>or drop the feature]
Q1 -->|Yes| Q2{Does it just re-encode<br/>the label itself?}
Q2 -->|Yes| Leak
Q2 -->|No| Q3{Does it actually vary,<br/>and relate to the outcome?}
Q3 -->|No| Weak[Low-value — consider dropping]
Q3 -->|Yes| Good[Legitimate, useful feature]
And a few real leakage patterns, side by side:
| Domain | Leaky feature | Why it leaks |
|---|---|---|
| Titanic survival (today's demo) | alive | Literally a re-encoding of the label |
| Loan default | days_past_due | Only exists once the loan is already in trouble |
| Hospital readmission | discharge_disposition | Recorded based on how the case actually concluded |
| Customer churn | cancellation_date | Only populated after the customer has already churned |
Hands-On Example
We'll use the same Titanic passenger dataset many of you have seen before, for a specific reason: it contains a real, built-in leakage trap, not a contrived one.
Train a model on legitimate features — class, sex, age, family size, fare, and embarkation port. All of these are genuinely known before the ship sank.
"Accidentally" add one extra column:
alive. This field is just the word "yes" or "no" version of the label we're trying to predict. We'll watch what that does to accuracy, and talk about why that result should worry you, not impress you.Deliberately corrupt a percentage of the training labels and watch how accuracy degrades as that percentage climbs — using the legitimate features only, so we isolate label quality as the one variable changing.
Environment Setup
python -m venv venv
source venv/bin/activate
# Windows
venv\Scripts\activate
pip install numpy pandas seaborn scikit-learn
Note: seaborn is used here purely as a convenient, reliable loader for the Titanic dataset — we won't be making any plots with it.
Complete Working Code
Every number in the breakdown below came from actually running this script.
"""
day_03_features_and_labels.py
A real feature-leakage demonstration and a real label-noise experiment,
both run on the Titanic passenger dataset.
"""
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
df = sns.load_dataset("titanic")
# --- Clean a legitimate, leakage-free feature set ---
df_clean = df.copy()
df_clean["age"] = df_clean["age"].fillna(df_clean["age"].median())
df_clean["embarked"] = df_clean["embarked"].fillna(df_clean["embarked"].mode()[0])
df_clean["sex"] = df_clean["sex"].map({"male": 0, "female": 1})
df_clean["embarked"] = df_clean["embarked"].map({"S": 0, "C": 1, "Q": 2})
legit_features = ["pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]
X_legit = df_clean[legit_features]
y = df_clean["survived"]
X_train, X_test, y_train, y_test = train_test_split(
X_legit, y, test_size=0.3, random_state=42, stratify=y
)
# --- 1. Model trained on legitimate features only ---
model_legit = LogisticRegression(max_iter=500)
model_legit.fit(X_train, y_train)
legit_acc = accuracy_score(y_test, model_legit.predict(X_test))
print("=== LEGITIMATE FEATURES ===")
print(f"Test accuracy: {legit_acc:.3f}")
# --- 2. Same setup, but with a leaked feature added: 'alive' is 'survived' in disguise ---
df_leaky = df_clean.copy()
df_leaky["alive_encoded"] = df["alive"].map({"no": 0, "yes": 1})
leaky_features = legit_features + ["alive_encoded"]
X_leaky = df_leaky[leaky_features]
X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(
X_leaky, y, test_size=0.3, random_state=42, stratify=y
)
model_leaky = LogisticRegression(max_iter=500)
model_leaky.fit(X_train_l, y_train_l)
leaky_acc = accuracy_score(y_test_l, model_leaky.predict(X_test_l))
print("\n=== WITH LEAKED FEATURE ('alive') ===")
print(f"Test accuracy: {leaky_acc:.3f}")
# --- 3. Label noise experiment: corrupt a % of TRAINING labels, test labels stay clean ---
print("\n=== LABEL NOISE EXPERIMENT (legitimate features only) ===")
rng = np.random.RandomState(42)
for noise_level in [0.0, 0.1, 0.3, 0.5]:
y_train_noisy = np.array(y_train, dtype=int).copy()
n_flip = int(len(y_train_noisy) * noise_level)
flip_idx = rng.choice(len(y_train_noisy), size=n_flip, replace=False)
y_train_noisy[flip_idx] = 1 - y_train_noisy[flip_idx]
m = LogisticRegression(max_iter=500)
m.fit(X_train, y_train_noisy)
acc = accuracy_score(y_test, m.predict(X_test))
print(f"Training label noise: {int(noise_level * 100):>3}% -> Test accuracy: {acc:.3f}")
majority_baseline = max(y.mean(), 1 - y.mean())
print(f"\nFor reference, majority-class baseline accuracy: {majority_baseline:.3f}")
Running this prints:
=== LEGITIMATE FEATURES ===
Test accuracy: 0.799
=== WITH LEAKED FEATURE ('alive') ===
Test accuracy: 1.000
=== LABEL NOISE EXPERIMENT (legitimate features only) ===
Training label noise: 0% -> Test accuracy: 0.799
Training label noise: 10% -> Test accuracy: 0.791
Training label noise: 30% -> Test accuracy: 0.787
Training label noise: 50% -> Test accuracy: 0.369
For reference, majority-class baseline accuracy: 0.616
Code Breakdown
The legitimate model. Using only information genuinely available before the Titanic sank — class, sex, age, family counts, fare, embarkation port — logistic regression reaches 79.9% test accuracy. That's a real, respectable, very typical number for this dataset, and it's the honest baseline everything else gets compared against.
The leaked model. Adding one column, alive, pushes test accuracy to a perfect 1.000. Look back at the code: alive is built by mapping "yes"/"no" directly onto the same 1/0 values as the survived label. We didn't add a powerful new feature — we added the answer key, lightly disguised as a column. The model didn't get smarter. It just stopped having to predict anything.
This is the entire point of the leakage litmus test. If you saw 1.000 accuracy on a real project without knowing in advance which column was the trap, your honest reaction should be suspicion, not celebration.
The label noise experiment. Notice the shape of the degradation: from 0% to 30% noise, accuracy barely moves — 0.799 down to 0.787, a rounding error in practical terms. Then, at 50% noise, accuracy collapses to 0.369 — worse than the 0.616 you'd get by just guessing the majority class every time. That's the cliff predicted by the math in Core Concepts: at exactly 50% random label corruption, the training labels carry essentially zero usable signal, and the specific corrupted pattern the model latched onto here happened to actively point it in the wrong direction on the clean test set. The lesson isn't "noise is fine up to 30% and then it's not" — it's that label noise doesn't degrade performance smoothly, and you can't assume you're safely far from the cliff just because moderate noise hasn't hurt you yet.
Common Mistakes
Treating high accuracy as automatically good news. Suspiciously perfect performance is a bug report waiting to be read, not a result to ship.
Including post-outcome fields by accident. Status fields, timestamps, and flags that get updated after the event you're predicting are leakage's favorite hiding spot.
Trusting labels as ground truth without checking their source. A label generated by an inconsistent heuristic or a disagreeing pair of human annotators is not the same thing as reality.
Testing on labels that are just as noisy as your training labels. If you can't trust your test set, you can't tell real model degradation from measurement noise in your evaluation itself.
Dropping a "weak" feature too early. A feature with a weak standalone correlation can still meaningfully help a model in combination with others — check feature value in context, not in isolation, before discarding it.
Forgetting that one-hot encoded categories need to handle values the model has never seen. A category that shows up at serving time but never appeared in training will break a naively encoded pipeline.
Best Practices
For every candidate feature, ask explicitly: would I have this value, in this exact form, at the moment I actually need to predict? Make this a literal checklist item in code review, not just a mental gut check.
Treat suspiciously perfect performance as something to investigate before you celebrate it — and investigate by removing features one at a time, not by assuming the model "got lucky."
Treat label collection as a first-class engineering problem. If humans are labeling, measure inter-annotator agreement; if a heuristic is labeling, audit its edge cases the same way you'd audit code.
When you suspect your main labels are noisy, hold out a small, especially carefully verified "gold" validation set, so you have at least one trustworthy yardstick.
Production Perspective
Feature leakage in production most often comes from offline/online skew: a feature computed correctly in a batch training pipeline — with the benefit of full hindsight — simply isn't available, or is computed differently, in the live serving pipeline. This is one of the single most common real-world causes of "the model worked perfectly in testing and then failed in production."
Label quality in production has its own twist: labels frequently come from delayed business outcomes. Did the customer actually churn? You might not know for 90 days. That creates a permanent tension between training on the freshest data available and training on the most reliably labeled data available — and most production teams end up explicitly trading one off against the other.
Companies build specific monitoring around exactly these failure modes: training-serving skew detection, feature drift monitoring, and "this performance metric looks too good" alerts exist because the problems in this article are common enough to deserve their own dashboards.
Cost-wise, this is one of the most expensive mistake categories in ML. A leakage bug caught after deployment can cost far more than the entire original model build — in re-engineering, in lost trust, and in regulated domains like lending, sometimes in compliance exposure as well.
Real-World Applications
Credit and loan underwriting, where regulation often explicitly requires excluding any information that wasn't available at the moment of the lending decision.
Healthcare readmission and risk prediction, where outcome-adjacent fields are everywhere in the raw data and easy to include by mistake.
Customer churn prediction, where churn-adjacent timestamps are some of the most common leakage sources in practice.
Fraud detection, where the exact sequence and timing of events — not just their presence — determines whether a feature is legitimate or leaked.
Ad click and ranking systems, where the constant engineering battle is making sure a feature available during offline training is actually computable, in the same form, in real-time serving.
Interview Questions
1. What is feature leakage, and how would you detect it in a model that's performing suspiciously well? Feature leakage is a feature that contains information about the label which wouldn't actually be available at prediction time. To detect it, investigate unusually high performance by removing features one at a time and checking which one is responsible for the jump, and explicitly audit any feature for whether it could only exist after the outcome occurred.
2. Why might a feature that's legitimate during training become invalid at serving time? Training data is often built with full hindsight from a batch pipeline, while serving happens in real time with only the information available at that exact moment — a mismatch known as training-serving skew, and a major real-world source of leakage that isn't even visible in the training data itself.
3. How does label noise differ from feature noise, and which one usually hurts more? Feature noise affects the inputs; label noise affects the answers the model is trying to match. Label noise is generally more damaging, because a model has no way to distinguish a wrong answer from a right one — it will confidently learn whatever pattern, including a wrong one, that the noisy labels suggest.
4. If you're told a model gets 99.9% accuracy, what's the first thing you'd check? Whether any feature could be leaking the label, directly or indirectly — followed by checking whether the dataset has severe class imbalance that makes a trivial majority-class prediction look deceptively strong.
5. Give an example of multicollinearity, and explain why it's a problem even when it doesn't hurt accuracy. Two features like "age in years" and "birth year" carry almost the same information. A linear model may split importance between them somewhat arbitrarily, which doesn't necessarily hurt raw accuracy but does hurt interpretability — you can no longer trust the model's reported feature importances at face value.
6. How would you design a labeling process to minimize label noise from human annotators? Have multiple annotators label a sample of the same examples, measure their agreement rate, write clearer labeling guidelines for the cases where they disagree most, and treat persistent low-agreement categories as a signal that the task itself may need to be redefined.
7. What is training-serving skew, and how does it relate to feature leakage? It's a mismatch between how a feature is computed during training versus during live serving. It's closely related to leakage because the training-time version of a feature can accidentally include information — like later timestamps or aggregated future data — that the serving-time version structurally can't have.
8. Why might you intentionally keep a "weak" feature in a model rather than dropping it? Because standalone correlation with the target isn't the only way a feature contributes — it might interact with other features to reveal a pattern that neither feature shows on its own, something a quick univariate check would miss entirely.
Self-Assessment
Can you explain to someone non-technical why a model that's 100% accurate should worry you more than reassure you?
Could you turn the "would I have this at prediction time?" test into an actual checklist for a project you're working on?
Can you explain, in your own words, why 10% label noise and 50% label noise don't degrade a model by proportional amounts?
Can you name one feature, in any dataset you've worked with, that might secretly be leaking the answer?
Portfolio Challenges
Beginner Challenge
Using the Breast Cancer dataset (sklearn.datasets.load_breast_cancer), deliberately construct an obviously leaky derived feature from the diagnosis label, train a model with and without it, and report both accuracies — mirroring today's alive demonstration on a different dataset.
Intermediate Challenge
Re-run today's label-noise experiment at finer granularity (0%, 5%, 10%, 15%, ..., 50%) and plot or tabulate the full accuracy curve. Identify, as precisely as your data allows, where the "cliff" actually begins for this dataset, and write two sentences on whether it matches the 0.5 threshold predicted in Core Concepts.
Advanced Challenge
Engineer two new, genuinely non-leaky features from the raw Titanic columns — for example, a family_size feature from sibsp + parch, or a title feature extracted from the passenger name field — and measure whether either one improves test accuracy over today's legitimate baseline of 0.799. Report your result honestly even if it doesn't improve things; a feature that doesn't help is still a useful, valid finding.
Advanced Insights
Today's lesson is the defensive half of a pair. Phase 3's Feature Engineering is the constructive half — deliberately building better signal instead of just avoiding broken signal. Both rest on the same foundation: understanding exactly what a feature represents and exactly when it becomes available.
It's also worth knowing this problem doesn't disappear once you move to deep learning. Representation learning automates a lot of feature creation, but a leaked signal can hide inside a learned embedding just as easily as inside a hand-built column — it's just harder to spot, because you can no longer point at a single suspicious column name. The thinking from this lesson applies just as much in Phase 4 and Phase 6 as it does here; only the hiding place changes.
Key Takeaways
A feature is only useful if it's predictive and genuinely available, in the same form, at prediction time — leakage is what happens when that second condition silently fails.
Suspiciously perfect model performance is a signal to investigate, not a result to celebrate.
Label noise doesn't degrade performance smoothly. Today's experiment found a real cliff at 50% random corruption, where accuracy fell below even the majority-class baseline.
Feature leakage and label quality usually matter more for real-world model performance than which algorithm you choose.
What's Next
Day 4 zooms out one more level: Data Quality — missing values, outliers, class imbalance, and the dataset-wide checks worth running before you train anything at all, building directly on today's column-by-column thinking.
References
Beginner
Google: Machine Learning Crash Course — includes a dedicated module on data quality and feature engineering fundamentals.
scikit-learn: User Guide — official documentation covering the preprocessing and encoding tools used in this article's code.
Intermediate
- Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning — free at statlearning.com, with solid coverage of variable selection and the statistical risks of correlated predictors.
Professional
- Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (O'Reilly) — has a thorough, practical treatment of feature engineering and data preparation pipelines. Available through O'Reilly and major booksellers.
Advanced
- Shachar Kaufman, Saharon Rosset, Claudia Perlich, and Ori Stitelman, "Leakage in Data Mining: Formulation, Detection, and Avoidance," ACM Transactions on Knowledge Discovery from Data, 2012 — the paper that formalized the exact problem at the center of today's lesson, based on real data mining competition failures. Official record: doi.org/10.1145/2382577.2382579.
Originally published on ZyVOP
💡 For more articles like this, subscribe to the ZyVOP newsletter!
Top comments (0)