DEV Community

Vineet Chauhan
Vineet Chauhan

Posted on

Better Data Beats Better Algorithms: Before Changing the Model, Change the Data

How Feature Engineering Taught Me That Better Data Often Beats Better Algorithms

When I first started learning Machine Learning, I believed what many beginners believe:

If my model is not performing well, I need a better algorithm.

So I kept switching models.

I moved from Logistic Regression to Decision Trees, then Random Forest, and later even started reading about XGBoost and Neural Networks.

The results improved slightly, but never dramatically.

What surprised me was that the biggest improvement didn't come from changing the algorithm.

It came from changing the data.


The Problem

I was working on a dataset containing missing values, outliers, and categorical variables.

Like many beginners, my first instinct was simple:

model.fit(X_train, y_train)
pred = model.predict(X_test)
Enter fullscreen mode Exit fullscreen mode

The model trained successfully.

The accuracy looked acceptable.

But something felt wrong.

The data itself was messy.

Some columns contained missing values.

Some numerical features had extreme outliers.

Several categorical columns were represented as text.

Yet I expected the model to magically learn everything.


My First Experiment

I trained a Logistic Regression model on the raw dataset.

Results:

Accuracy : 72%
Enter fullscreen mode Exit fullscreen mode

Not terrible.

Not impressive either.

Instead of changing the model, I decided to investigate the data.

This turned out to be the most important decision of the entire project.


Step 1: Handling Missing Values

The dataset contained several missing values.

At first I considered simply deleting rows.

df.dropna(inplace=True)
Enter fullscreen mode Exit fullscreen mode

The problem?

I lost a significant portion of the data.

So I experimented with multiple approaches:

Mean Imputation

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)
Enter fullscreen mode Exit fullscreen mode

Median Imputation

imputer = SimpleImputer(strategy='median')
Enter fullscreen mode Exit fullscreen mode

KNN Imputation

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
X = imputer.fit_transform(X)
Enter fullscreen mode Exit fullscreen mode

KNN preserved relationships between records much better than simple averaging.

This alone improved performance.


Step 2: Fighting Outliers

I then visualized the numerical columns.

The boxplots looked terrible.

A few extreme values were stretching entire distributions.

sns.boxplot(df["experience"])
Enter fullscreen mode Exit fullscreen mode

The model was spending too much effort trying to fit a handful of unusual observations.

I used IQR-based treatment.

Q1 = df["experience"].quantile(0.25)
Q3 = df["experience"].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df = df[(df["experience"] >= lower) &
        (df["experience"] <= upper)]
Enter fullscreen mode Exit fullscreen mode

After removing outliers, the data distribution became much cleaner.

More importantly, the model began learning actual patterns instead of noise.


Step 3: Encoding Categorical Features

Machine Learning algorithms cannot understand text.

They only understand numbers.

So columns like:

Male
Female

Private
Public

Graduate
Masters
Enter fullscreen mode Exit fullscreen mode

needed transformation.

I applied One-Hot Encoding.

pd.get_dummies(df,
               columns=["gender",
                        "company_type"])
Enter fullscreen mode Exit fullscreen mode

and Ordinal Encoding where order mattered.

education_level

High School
Graduate
Masters
PhD
Enter fullscreen mode Exit fullscreen mode

This converted human-readable categories into machine-readable information.


Step 4: Feature Scaling

Some columns ranged between:

0 – 5
Enter fullscreen mode Exit fullscreen mode

while others ranged between:

0 – 100000
Enter fullscreen mode Exit fullscreen mode

Distance-based algorithms become biased toward larger values.

I applied MinMax Scaling.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Enter fullscreen mode Exit fullscreen mode

Now every feature contributed fairly.


What Happened Next?

I trained the exact same Logistic Regression model again.

Nothing changed except the data.

Results:

Before Feature Engineering : 72%

After Feature Engineering  : 86%
Enter fullscreen mode Exit fullscreen mode

A gain of 14 percentage points.

Without changing the algorithm.

Without using deep learning.

Without adding complexity.

Just by improving the data.


The Most Important Lesson

This project changed the way I think about Machine Learning.

Earlier I believed:

Better Algorithm
       ↓
Better Results
Enter fullscreen mode Exit fullscreen mode

Now I believe:

Better Data
       ↓
Better Features
       ↓
Better Results
Enter fullscreen mode Exit fullscreen mode

Most real-world machine learning problems are not algorithm problems.

They are data problems.

A powerful model trained on poor-quality data will still struggle.

A simple model trained on clean, meaningful data can often outperform much more complex alternatives.


Challenges I Faced

The hardest part was not training the model.

The hardest part was preparing the data.

Some difficulties included:

  • Losing rows during Complete Case Analysis
  • Choosing between Mean, Median, and KNN Imputation
  • Combining transformed datasets
  • Handling dimensionality after One-Hot Encoding
  • Identifying genuine outliers versus valuable rare cases

These challenges taught me more than model training ever did.


Final Thoughts

Feature Engineering is not the most glamorous part of Machine Learning.

Nobody posts screenshots of missing value treatment on social media.

Nobody celebrates scaling features.

Yet this is where much of the real improvement happens.

After this project, I stopped asking:

Which model should I use?

and started asking:

What is my data trying to tell me?

That single change in mindset improved my machine learning skills more than learning any new algorithm.

Top comments (0)