Vineet Chauhan

Posted on Jun 6

Better Data Beats Better Algorithms: Before Changing the Model, Change the Data

#deeplearning #datascience #ai #machinelearning

How Feature Engineering Taught Me That Better Data Often Beats Better Algorithms

When I first started learning Machine Learning, I believed what many beginners believe:

If my model is not performing well, I need a better algorithm.

So I kept switching models.

I moved from Logistic Regression to Decision Trees, then Random Forest, and later even started reading about XGBoost and Neural Networks.

The results improved slightly, but never dramatically.

What surprised me was that the biggest improvement didn't come from changing the algorithm.

It came from changing the data.

The Problem

I was working on a dataset containing missing values, outliers, and categorical variables.

Like many beginners, my first instinct was simple:

model.fit(X_train, y_train)
pred = model.predict(X_test)

The model trained successfully.

The accuracy looked acceptable.

But something felt wrong.

The data itself was messy.

Some columns contained missing values.

Some numerical features had extreme outliers.

Several categorical columns were represented as text.

Yet I expected the model to magically learn everything.

My First Experiment

I trained a Logistic Regression model on the raw dataset.

Results:

Accuracy : 72%

Not terrible.

Not impressive either.

Instead of changing the model, I decided to investigate the data.

This turned out to be the most important decision of the entire project.

Step 1: Handling Missing Values

The dataset contained several missing values.

At first I considered simply deleting rows.

df.dropna(inplace=True)

The problem?

I lost a significant portion of the data.

So I experimented with multiple approaches:

Mean Imputation

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)

Median Imputation

imputer = SimpleImputer(strategy='median')

KNN Imputation

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
X = imputer.fit_transform(X)

KNN preserved relationships between records much better than simple averaging.

This alone improved performance.

Step 2: Fighting Outliers

I then visualized the numerical columns.

The boxplots looked terrible.

A few extreme values were stretching entire distributions.

sns.boxplot(df["experience"])

The model was spending too much effort trying to fit a handful of unusual observations.

I used IQR-based treatment.

Q1 = df["experience"].quantile(0.25)
Q3 = df["experience"].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df = df[(df["experience"] >= lower) &
        (df["experience"] <= upper)]

After removing outliers, the data distribution became much cleaner.

More importantly, the model began learning actual patterns instead of noise.

Step 3: Encoding Categorical Features

Machine Learning algorithms cannot understand text.

They only understand numbers.

So columns like:

Male
Female

Private
Public

Graduate
Masters

needed transformation.

I applied One-Hot Encoding.

pd.get_dummies(df,
               columns=["gender",
                        "company_type"])

and Ordinal Encoding where order mattered.

education_level

High School
Graduate
Masters
PhD

This converted human-readable categories into machine-readable information.

Step 4: Feature Scaling

Some columns ranged between:

0 – 5

while others ranged between:

0 – 100000

Distance-based algorithms become biased toward larger values.

I applied MinMax Scaling.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Now every feature contributed fairly.

What Happened Next?

I trained the exact same Logistic Regression model again.

Nothing changed except the data.

Results:

Before Feature Engineering : 72%

After Feature Engineering  : 86%

A gain of 14 percentage points.

Without changing the algorithm.

Without using deep learning.

Without adding complexity.

Just by improving the data.

The Most Important Lesson

This project changed the way I think about Machine Learning.

Earlier I believed:

Better Algorithm
       ↓
Better Results

Now I believe:

Better Data
       ↓
Better Features
       ↓
Better Results

Most real-world machine learning problems are not algorithm problems.

They are data problems.

A powerful model trained on poor-quality data will still struggle.

A simple model trained on clean, meaningful data can often outperform much more complex alternatives.

Challenges I Faced

The hardest part was not training the model.

The hardest part was preparing the data.

Some difficulties included:

Losing rows during Complete Case Analysis
Choosing between Mean, Median, and KNN Imputation
Combining transformed datasets
Handling dimensionality after One-Hot Encoding
Identifying genuine outliers versus valuable rare cases

These challenges taught me more than model training ever did.

Final Thoughts

Feature Engineering is not the most glamorous part of Machine Learning.

Nobody posts screenshots of missing value treatment on social media.

Nobody celebrates scaling features.

Yet this is where much of the real improvement happens.

After this project, I stopped asking:

Which model should I use?

and started asking:

What is my data trying to tell me?

That single change in mindset improved my machine learning skills more than learning any new algorithm.

DEV Community