Machine Learning is fun. Atleast when the model is doing all fine and the accuracy is through the roof. But how do you debug when the model rolls with a lower score?
How do you identify the bottleneck?
The beginner instinct is usually:
“Maybe I need a better algorithm.”
Sometimes, yes. But wait give this a read:

But a lot of ML failures are not algorithm failures.
They are data failures.
Sampling failures.
Feature failures.
Evaluation failures.
Assumption failures.
This article is a practical map of the common places where ML systems break.
Also, i have kept this supershort coz no one needs story in the starting, This is for quick notes. Enjoy
1. First check: do you even have enough useful data?
A very interesting example is from a natural language disambiguation experiment by Banko and Brill.
They compared different ML algorithms on a language problem. The surprising part was that as the amount of training data increased, many different algorithms started performing almost similarly well.
The lesson was simple but brutal:
For many complex problems, more good data can matter more than obsessing over the algorithm.
Not always. But often.
Practical situation
Suppose you are building a sentiment classifier.
You try:
- Naive Bayes
- Logistic Regression
- SVM
- Random Forest
- Neural Network
But your dataset has only 300 examples.
At that point, the problem may not be “which algorithm is best?”
The real problem may be:
“The model has not seen enough language patterns to generalize.”
What to do
- Get more representative data.
- Use data augmentation carefully.
- Use transfer learning if the domain allows it.
- Start with a simple baseline before jumping to complex models.
- Check whether performance improves as data size increases.
What not to do
- Do not keep switching algorithms blindly.
- Do not assume deep learning will save a tiny dataset.
- Do not collect more random data if it does not match the actual problem.
2. Sampling bias: your dataset may be confidently wrong
Sampling bias is one of the most dangerous ML problems because the model may look correct during training but fail in the real world.
A famous case is the 1936 Literary Digest poll.
They surveyed millions of people and predicted that Landon would defeat Roosevelt. The sample size was huge, but the prediction was badly wrong.
Why?
Because the people they reached were not representative of the full voting population. Their source lists were biased toward wealthier people, and the people who replied introduced another bias: nonresponse bias.
So the issue was not quantity.
The issue was representation.
ML version
Imagine training a placement prediction model using data mostly from students who already attend coding clubs, hackathons, and internships.
The model may conclude:
“Students with high GitHub activity are highly placeable.”
That may be partly true.
But it may ignore students who are good at core subjects, aptitude, communication, or offline projects because they were underrepresented in the data.
Your model is not learning “placement potential.”
It is learning “patterns from the type of students your dataset contains.”
What to do
- Ask: who is missing from the dataset?
- Compare dataset distribution with real-world distribution.
- Use stratified sampling when groups matter.
- Track important segments separately.
- Test the model on examples outside the original collection source.
What not to do
- Do not trust a dataset just because it is large.
- Do not train only on convenient data.
- Do not assume online data represents the real world.
- Do not ignore minority cases if those cases matter in production.
3. Data quality: garbage in, garbage out is not a joke
Bad data does not become intelligent because you pass it through a model.
If the dataset has noise, wrong labels, missing values, duplicates, or weird outliers, the model will try to learn from that mess.
Practical situation
You are building a house price prediction model.
Your data contains:
- missing location values
- incorrect prices
- duplicate listings
- luxury villas mixed with normal apartments
- square feet and square meters in the same column
- outlier entries where price is accidentally typed with one extra zero
Now your model gives poor predictions.
The problem may not be regression.
The problem may be that your data is broken.
What to do
- Check missing values.
- Check duplicate rows.
- Check impossible values.
- Check outliers.
- Standardize units.
- Fix or remove corrupted entries.
- Document every cleaning step.
What not to do
- Do not delete rows blindly.
- Do not fill missing values without understanding the column.
- Do not ignore outliers just because the model still runs.
- Do not treat preprocessing as boring. It is where a lot of real ML work happens.
4. Features: the model only sees what you give it
A model cannot magically use information that is not present in the features.
This is why feature engineering matters.
Feature engineering mainly means:
- selecting useful features
- removing useless features
- combining existing features into better ones
- creating new features by collecting better data
Practical situation
You want to predict student exam performance.
Useful features may include:
- attendance
- previous marks
- assignment completion
- number of mock tests attempted
- average practice score
- study consistency
Weak or useless features may include:
- roll number
- phone brand
- favorite color
- random ID
- hostel room number
If you feed too many irrelevant columns, the model may start finding fake patterns.
What to do
- Ask whether each feature has a logical relation to the target.
- Remove random identifiers unless they carry real meaning.
- Combine features where useful.
- Create domain-specific features.
- Compare model performance before and after feature changes.
What not to do
- Do not dump every column into the model.
- Do not assume the model will automatically ignore useless features.
- Do not confuse correlation with actual usefulness.
- Do not use features that will not be available in production.
5. Overfitting: when your model memorizes the training set
Overfitting happens when the model performs very well on training data but badly on new data.
It has not learned the actual pattern.
It has memorized the training examples.
Think of it like preparing for an exam by memorizing last year’s exact questions. You may score well if the same questions repeat. But if the paper changes, you are finished.
Practical situation
You train a very complex model.
Results:
- Training accuracy: 99%
- Validation accuracy: 72%
That gap is a warning.
The model is probably learning noise, shortcuts, or accidental patterns in the training data.
Real-life style example
Suppose a fraud detection model learns that transactions from one specific city are risky because the training dataset had many fraud cases from that city.
But in reality, that city was just overrepresented in the fraud reports.
Now the model starts flagging normal users from that city.
That is not intelligence.
That is overfitting plus biased data.
What to do
- Use a validation set.
- Use cross-validation.
- Reduce model complexity.
- Add regularization.
- Get more training data.
- Remove noisy features.
- Compare training score and validation score.
What not to do
- Do not celebrate training accuracy alone.
- Do not keep making the model bigger without checking validation performance.
- Do not tune repeatedly on the test set.
- Do not trust a model just because it looks perfect on old data.
6. Underfitting: when the model is too weak
Underfitting is the opposite problem.
The model is too simple to capture the actual pattern.
Practical situation
You are trying to predict house prices.
But the relationship is not simple.
Price depends on:
- location
- size
- income level of area
- nearby facilities
- age of property
- number of rooms
- market trends
If you use a very simple linear model with weak features, the model may perform badly even on the training data.
That is underfitting.
Symptom
- Training score is poor.
- Validation score is also poor.
- Adding more data does not help much.
- The model is not capturing the structure.
What to do
- Use a more powerful model.
- Add better features.
- Reduce excessive regularization.
- Try polynomial or non-linear features.
- Check whether the target pattern is too complex for the current model.
What not to do
- Do not assume simple models are always safer.
- Do not keep cleaning data forever if the model itself is too weak.
- Do not ignore bad training performance.
- Do not blame validation if training performance is already poor.
7. Data mismatch: notebook performance is not production performance
This one is very practical.
Suppose you want to build a flower recognition app.
You train it using clean flower images downloaded from the web.
Those images may have:
- good lighting
- clear background
- centered flowers
- high resolution
- professional photography
But users of your app may upload:
- blurry photos
- tilted photos
- bad lighting
- half-visible flowers
- messy backgrounds
- low-resolution mobile shots
Now your model works well during training but fails when real users use it.
This is data mismatch.
The training data and production data are not from the same world.
What to do
- Make validation and test data look like production data.
- Collect real user-like examples early.
- Keep a train-dev set if your training data source is different from production.
- Preprocess training data to look closer to production data.
- Test on messy, real-world examples before trusting the model.
What not to do
- Do not validate only on clean data.
- Do not assume web images represent mobile images.
- Do not deploy just because notebook accuracy is high.
- Do not mix near-duplicate images across train and test sets.
8. Bad validation: accidentally cheating without knowing it
A model should be judged on data it has not seen.
The common split is:
- training set: used to train the model
- validation set: used to compare and tune models
- test set: used only at the end
The test set should be treated like the final exam.
If you keep checking the test score again and again, then changing the model again and again, you are indirectly training on the test set.
That gives fake confidence.
Practical situation
You try 20 models.
Every time, you check test accuracy.
Then you choose the one with the best test result.
This sounds normal, but it is risky.
You may have selected the model that performs best on that specific test set by chance, not the one that generalizes best.
What to do
- Use validation data for model selection.
- Keep the test set untouched until the end.
- Use cross-validation when data is limited.
- Report the evaluation setup honestly.
What not to do
- Do not tune hyperparameters on the test set.
- Do not keep changing the model after seeing test results.
- Do not report only the best score without explaining how you got it.
9. No Free Lunch: there is no universal best model
This is one of the most important ideas.
There is no model that is best for every problem.
Every model makes assumptions.
A linear model assumes the data can be explained with a mostly linear relationship.
A decision tree assumes decisions can be made through feature-based splits.
A neural network may learn very complex patterns, but it may also need more data, more compute, and more tuning.
So when someone says:
“Just use Random Forest.”
or
“Neural networks are always better.”
That is not ML thinking.
That is model worship.
Practical meaning
The best model depends on:
- the dataset
- the size of data
- the quality of data
- the features
- the metric
- the cost of wrong predictions
- the production environment
- the assumptions you are willing to make
What to do
- Start with a simple baseline.
- Try multiple models.
- Compare fairly using validation.
- Understand why a model works.
- Choose based on evidence, not hype.
What not to do
- Do not marry one algorithm.
- Do not pick models because they sound advanced.
- Do not ignore simple models.
- Do not skip baselines.
A practical debugging checklist
When your model score is low, ask these questions in order:
| Problem Area | Debug Question |
|---|---|
| Data quantity | Do I have enough examples? |
| Data representation | Does my data represent the real world? |
| Sampling bias | Which groups are missing or overrepresented? |
| Data quality | Are there missing values, outliers, wrong labels, or duplicates? |
| Features | Are my features actually useful? |
| Overfitting | Is training score high but validation score low? |
| Underfitting | Are both training and validation scores low? |
| Data mismatch | Is production data different from training data? |
| Validation | Am I accidentally tuning on the test set? |
| Model choice | Have I compared multiple models fairly? |
Final takeaway
When a machine learning model fails, do not immediately ask:
“Which algorithm should I use?”
Ask:
“Where is the bottleneck?”
Maybe the data is not enough.
Maybe the sample is biased.
Maybe the features are weak.
Maybe the model is memorizing.
Maybe the model is too simple.
Maybe the test setup is lying to you.
Maybe the training data and real-world data are completely different.
Good ML is not just about training models.
It is about building systems that can survive contact with real data.
Top comments (0)