Harsh Bhardwaj

Posted on May 29

How not to make Machine's Learn

#programming #ai

Machine Learning is fun. Atleast when the model is doing all fine and the accuracy is through the roof. But how do you debug when the model rolls with a lower score?

How do you identify the bottleneck?

The beginner instinct is usually:

“Maybe I need a better algorithm.”

Sometimes, yes. But wait give this a read:

But a lot of ML failures are not algorithm failures.
They are data failures.
Sampling failures.
Feature failures.
Evaluation failures.
Assumption failures.
This article is a practical map of the common places where ML systems break.

Also, i have kept this supershort coz no one needs story in the starting, This is for quick notes. Enjoy

1. First check: do you even have enough useful data?

A very interesting example is from a natural language disambiguation experiment by Banko and Brill.

They compared different ML algorithms on a language problem. The surprising part was that as the amount of training data increased, many different algorithms started performing almost similarly well.

The lesson was simple but brutal:

For many complex problems, more good data can matter more than obsessing over the algorithm.

Not always. But often.

Practical situation

Suppose you are building a sentiment classifier.

You try:

Naive Bayes
Logistic Regression
SVM
Random Forest
Neural Network

But your dataset has only 300 examples.

At that point, the problem may not be “which algorithm is best?”

The real problem may be:

“The model has not seen enough language patterns to generalize.”

What to do

Get more representative data.
Use data augmentation carefully.
Use transfer learning if the domain allows it.
Start with a simple baseline before jumping to complex models.
Check whether performance improves as data size increases.

What not to do

Do not keep switching algorithms blindly.
Do not assume deep learning will save a tiny dataset.
Do not collect more random data if it does not match the actual problem.

2. Sampling bias: your dataset may be confidently wrong

Sampling bias is one of the most dangerous ML problems because the model may look correct during training but fail in the real world.

A famous case is the 1936 Literary Digest poll.

They surveyed millions of people and predicted that Landon would defeat Roosevelt. The sample size was huge, but the prediction was badly wrong.

Why?

Because the people they reached were not representative of the full voting population. Their source lists were biased toward wealthier people, and the people who replied introduced another bias: nonresponse bias.

So the issue was not quantity.

The issue was representation.

ML version

Imagine training a placement prediction model using data mostly from students who already attend coding clubs, hackathons, and internships.

The model may conclude:

“Students with high GitHub activity are highly placeable.”

That may be partly true.

But it may ignore students who are good at core subjects, aptitude, communication, or offline projects because they were underrepresented in the data.

Your model is not learning “placement potential.”

It is learning “patterns from the type of students your dataset contains.”

What to do

Ask: who is missing from the dataset?
Compare dataset distribution with real-world distribution.
Use stratified sampling when groups matter.
Track important segments separately.
Test the model on examples outside the original collection source.

What not to do

Do not trust a dataset just because it is large.
Do not train only on convenient data.
Do not assume online data represents the real world.
Do not ignore minority cases if those cases matter in production.

3. Data quality: garbage in, garbage out is not a joke

Bad data does not become intelligent because you pass it through a model.

If the dataset has noise, wrong labels, missing values, duplicates, or weird outliers, the model will try to learn from that mess.

Practical situation

You are building a house price prediction model.

Your data contains:

missing location values
incorrect prices
duplicate listings
luxury villas mixed with normal apartments
square feet and square meters in the same column
outlier entries where price is accidentally typed with one extra zero

Now your model gives poor predictions.

The problem may not be regression.

The problem may be that your data is broken.

What to do

Check missing values.
Check duplicate rows.
Check impossible values.
Check outliers.
Standardize units.
Fix or remove corrupted entries.
Document every cleaning step.

What not to do

Do not delete rows blindly.
Do not fill missing values without understanding the column.
Do not ignore outliers just because the model still runs.
Do not treat preprocessing as boring. It is where a lot of real ML work happens.

4. Features: the model only sees what you give it

A model cannot magically use information that is not present in the features.

This is why feature engineering matters.

Feature engineering mainly means:

selecting useful features
removing useless features
combining existing features into better ones
creating new features by collecting better data

Practical situation

You want to predict student exam performance.

Useful features may include:

attendance
previous marks
assignment completion
number of mock tests attempted
average practice score
study consistency

Weak or useless features may include:

roll number
phone brand
favorite color
random ID
hostel room number

If you feed too many irrelevant columns, the model may start finding fake patterns.

What to do

Ask whether each feature has a logical relation to the target.
Remove random identifiers unless they carry real meaning.
Combine features where useful.
Create domain-specific features.
Compare model performance before and after feature changes.

What not to do

Do not dump every column into the model.
Do not assume the model will automatically ignore useless features.
Do not confuse correlation with actual usefulness.
Do not use features that will not be available in production.

5. Overfitting: when your model memorizes the training set

Overfitting happens when the model performs very well on training data but badly on new data.

It has not learned the actual pattern.

It has memorized the training examples.

Think of it like preparing for an exam by memorizing last year’s exact questions. You may score well if the same questions repeat. But if the paper changes, you are finished.

Practical situation

You train a very complex model.

Results:

Training accuracy: 99%
Validation accuracy: 72%

That gap is a warning.

The model is probably learning noise, shortcuts, or accidental patterns in the training data.

Real-life style example

Suppose a fraud detection model learns that transactions from one specific city are risky because the training dataset had many fraud cases from that city.

But in reality, that city was just overrepresented in the fraud reports.

Now the model starts flagging normal users from that city.

That is not intelligence.

That is overfitting plus biased data.

What to do

Use a validation set.
Use cross-validation.
Reduce model complexity.
Add regularization.
Get more training data.
Remove noisy features.
Compare training score and validation score.

What not to do

Do not celebrate training accuracy alone.
Do not keep making the model bigger without checking validation performance.
Do not tune repeatedly on the test set.
Do not trust a model just because it looks perfect on old data.

6. Underfitting: when the model is too weak

Underfitting is the opposite problem.

The model is too simple to capture the actual pattern.

Practical situation

You are trying to predict house prices.

But the relationship is not simple.

Price depends on:

location
size
income level of area
nearby facilities
age of property
number of rooms
market trends

If you use a very simple linear model with weak features, the model may perform badly even on the training data.

That is underfitting.

Symptom

Training score is poor.
Validation score is also poor.
Adding more data does not help much.
The model is not capturing the structure.

What to do

Use a more powerful model.
Add better features.
Reduce excessive regularization.
Try polynomial or non-linear features.
Check whether the target pattern is too complex for the current model.

What not to do

Do not assume simple models are always safer.
Do not keep cleaning data forever if the model itself is too weak.
Do not ignore bad training performance.
Do not blame validation if training performance is already poor.

7. Data mismatch: notebook performance is not production performance

This one is very practical.

Suppose you want to build a flower recognition app.

You train it using clean flower images downloaded from the web.

Those images may have:

good lighting
clear background
centered flowers
high resolution
professional photography

But users of your app may upload:

blurry photos
tilted photos
bad lighting
half-visible flowers
messy backgrounds
low-resolution mobile shots

Now your model works well during training but fails when real users use it.

This is data mismatch.

The training data and production data are not from the same world.

What to do

Make validation and test data look like production data.
Collect real user-like examples early.
Keep a train-dev set if your training data source is different from production.
Preprocess training data to look closer to production data.
Test on messy, real-world examples before trusting the model.

What not to do

Do not validate only on clean data.
Do not assume web images represent mobile images.
Do not deploy just because notebook accuracy is high.
Do not mix near-duplicate images across train and test sets.

8. Bad validation: accidentally cheating without knowing it

A model should be judged on data it has not seen.

The common split is:

training set: used to train the model
validation set: used to compare and tune models
test set: used only at the end

The test set should be treated like the final exam.

If you keep checking the test score again and again, then changing the model again and again, you are indirectly training on the test set.

That gives fake confidence.

Practical situation

You try 20 models.

Every time, you check test accuracy.

Then you choose the one with the best test result.

This sounds normal, but it is risky.

You may have selected the model that performs best on that specific test set by chance, not the one that generalizes best.

What to do

Use validation data for model selection.
Keep the test set untouched until the end.
Use cross-validation when data is limited.
Report the evaluation setup honestly.

What not to do

Do not tune hyperparameters on the test set.
Do not keep changing the model after seeing test results.
Do not report only the best score without explaining how you got it.

9. No Free Lunch: there is no universal best model

This is one of the most important ideas.

There is no model that is best for every problem.

Every model makes assumptions.

A linear model assumes the data can be explained with a mostly linear relationship.

A decision tree assumes decisions can be made through feature-based splits.

A neural network may learn very complex patterns, but it may also need more data, more compute, and more tuning.

So when someone says:

“Just use Random Forest.”

“Neural networks are always better.”

That is not ML thinking.

That is model worship.

Practical meaning

The best model depends on:

the dataset
the size of data
the quality of data
the features
the metric
the cost of wrong predictions
the production environment
the assumptions you are willing to make

What to do

Start with a simple baseline.
Try multiple models.
Compare fairly using validation.
Understand why a model works.
Choose based on evidence, not hype.

What not to do

Do not marry one algorithm.
Do not pick models because they sound advanced.
Do not ignore simple models.
Do not skip baselines.

A practical debugging checklist

When your model score is low, ask these questions in order:

Problem Area	Debug Question
Data quantity	Do I have enough examples?
Data representation	Does my data represent the real world?
Sampling bias	Which groups are missing or overrepresented?
Data quality	Are there missing values, outliers, wrong labels, or duplicates?
Features	Are my features actually useful?
Overfitting	Is training score high but validation score low?
Underfitting	Are both training and validation scores low?
Data mismatch	Is production data different from training data?
Validation	Am I accidentally tuning on the test set?
Model choice	Have I compared multiple models fairly?

Final takeaway

When a machine learning model fails, do not immediately ask:

“Which algorithm should I use?”

Ask:

“Where is the bottleneck?”

Maybe the data is not enough.

Maybe the sample is biased.

Maybe the features are weak.

Maybe the model is memorizing.

Maybe the model is too simple.

Maybe the test setup is lying to you.

Maybe the training data and real-world data are completely different.

Good ML is not just about training models.

It is about building systems that can survive contact with real data.