DEV Community

Cover image for Feature Engineering
likhitha manikonda
likhitha manikonda

Posted on • Edited on

Feature Engineering

🛠️ Feature Engineering Made Simple
Machine learning models are only as good as the data we feed them. That’s where feature engineering comes in. Let’s break it down in plain language.

🌱 What is Feature Engineering?
Feature engineering is the process of transforming raw data into useful inputs for machine learning models. Think of it like preparing ingredients before cooking — you clean, cut, and season them so the dish turns out delicious. In machine learning, we clean and transform data so the model learns better.

  • A feature is just a column of data (like age, salary, or number of purchases).
  • Feature engineering means creating, modifying, or selecting the right features so that your model learns better.
  • Think of it as preparing ingredients before cooking—you want them clean, chopped, and ready to make the dish tasty.

⚙️ Why Do We Need It?

  • Raw data is often messy, incomplete, or not in the right format.
  • Good features help algorithms see patterns more clearly.
  • Better features = better predictions, faster training, and more accurate results.

Got it 👍 Let’s make this even simpler and beginner‑friendly, with short definitions, tiny examples, and then cover all the techniques.


🧩 Simple Examples

  • Raw data: "Red", "Blue", "Green"
  • Engineered feature: Convert to numbers → Red=0, Blue=1, Green=2

  • Raw data: "2025-12-20"

  • Engineered feature: Extract day, month, year → Day=20, Month=12, Year=2025

  • Raw data: "Salary = 2000000" (too high compared to others)

  • Engineered feature: Detect as outlier and cap it at a reasonable max.


🔧 Techniques in Feature Engineering

1. Handling Missing Values

When data has blanks or NaN, models get confused.

  • Delete rows/columns: If too many missing values.
  • Fill with mean/median/mode: Replace with average or most common value.
  • Fill with constant: Use "Unknown" or 0.
  • Predictive fill: Use another model to guess missing values.

👉 Example:

df['Age'].fillna(df['Age'].median(), inplace=True)
Enter fullscreen mode Exit fullscreen mode

2. Handling Imbalanced Dataset

Imagine you’re teaching a model to detect fraud in bank transactions.

  • Out of 1000 transactions, 990 are normal and only 10 are fraud.
  • If the model always predicts “normal,” it will be 99% accurate — but it will miss all fraud cases. That’s the problem of imbalanced datasets: one class (normal) has way more samples than the other (fraud).

🟢 Oversampling

We add more samples of the minority class (fraud) so the model sees them more often.

  • Simple way: duplicate existing fraud cases.
  • Better way: use techniques like SMOTE(Synthetic Minority Oversampling Technique) to create synthetic new fraud cases.

👉 Example:

If you have 10 fraud cases, you duplicate them until you have 990 fraud cases to match the normal ones.


🔴 Undersampling

We remove some samples of the majority class (normal) so the dataset becomes balanced.

  • Example: Instead of keeping 990 normal cases, we randomly keep only 10 normal cases to match the 10 fraud cases.
  • Downside: we lose a lot of useful data.

⚖️ Class Weights

We tell the model: “Pay more attention to the minority class.”

  • Mistakes on fraud cases are penalized more heavily than mistakes on normal cases.
  • This way, the model tries harder to detect fraud even if fraud cases are rare.

👉 Example in scikit-learn:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')
Enter fullscreen mode Exit fullscreen mode

📊 Better Metrics

Accuracy alone is misleading. Instead, we use metrics that focus on minority class performance:

  • Precision: Of all predicted fraud cases, how many were actually fraud?
  • Recall: Of all actual fraud cases, how many did we correctly detect?
  • F1-Score: A balance between Precision and Recall.
  • ROC-AUC: Measures how well the model separates fraud vs normal. ROC AUC stands for Receiver Operating Characteristic Area Under the Curve, a key metric in machine learning for evaluating binary classification models, indicating how well a model distinguishes between positive and negative classes across various thresholds, with values closer to 1 showing better performance

👉 Example:

If the model predicts 8 fraud cases, but only 5 are correct:

  • Precision = 5/8 = 62.5%
  • Recall = 5/10 = 50% Accuracy might look high, but Recall shows we missed half the fraud cases.

🎯 Summary

  • Imbalance = one class dominates the dataset.
  • Oversampling = add more minority samples.
  • Undersampling = remove majority samples.
  • Class weights = make the model care more about minority class.
  • Better metrics = judge the model using Precision, Recall, F1, not just Accuracy.

3. Handling Imbalanced Dataset Using SMOTE

SMOTE = Synthetic Minority Oversampling Technique.

Instead of duplicating, it creates new synthetic samples by mixing existing ones.

👉 Example:

from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X, y)
Enter fullscreen mode Exit fullscreen mode

4. Handling Outliers

Outliers = extreme values (like age = 200). They can distort results.

  • Detect with IQR or Z‑Score.
  • Remove if they are errors.
  • Transform (log, sqrt).
  • Cap values at thresholds.

👉 Example:

z_scores = np.abs((df['Salary'] - df['Salary'].mean()) / df['Salary'].std())
df_no_outliers = df[z_scores < 3]
Enter fullscreen mode Exit fullscreen mode

5. Data Encoding – Nominal / One‑Hot Encoding

Step 1: What is Nominal Data?

Nominal data means categories that don’t have any order.

Example:

  • Colors: Red, Blue, Green
  • Cities: London, Paris, Tokyo

Here, Red is not bigger than Blue, and Paris is not smaller than Tokyo. They’re just names.


Step 2: Why Do We Need Encoding?

Machine learning models only understand numbers, not text.

So we must convert categories into numbers.


Step 3: One‑Hot Encoding (OHE)

One‑Hot Encoding creates new columns for each category.

Each column has 1 if the row belongs to that category, otherwise 0.

👉 Example Dataset:

Color
Red
Blue
Green
Red
Enter fullscreen mode Exit fullscreen mode

👉 After One‑Hot Encoding:

Color_Blue   Color_Green   Color_Red
0            0             1
1            0             0
0            1             0
0            0             1
Enter fullscreen mode Exit fullscreen mode
  • First row is RedColor_Red=1
  • Second row is BlueColor_Blue=1
  • Third row is GreenColor_Green=1

Step 4: Python Example

import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red']})
encoded = pd.get_dummies(df['Color'], drop_first=True)
print(encoded)
Enter fullscreen mode Exit fullscreen mode

👉 Output:

Color_Blue  Color_Green
0           0
1           0
0           1
0           0
Enter fullscreen mode Exit fullscreen mode

Here we used drop_first=True to avoid dummy variable trap (dropping one column because the others already give enough information).


  • Nominal data = categories with no order (like colors, cities).
  • One‑Hot Encoding = turn each category into a new column with 0/1 values.
  • This helps models understand categorical data without assuming any order.

6. Label and Ordinal Encoding

  • Label Encoding: Assign numbers to categories (Red=0, Blue=1).
  • Ordinal Encoding: Use when categories have order (Small=1, Medium=2, Large=3).

👉 Example:

from sklearn.preprocessing import LabelEncoder
df['Color'] = LabelEncoder().fit_transform(df['Color'])
Enter fullscreen mode Exit fullscreen mode

7. 🎯 Target Guided Ordinal Encoding

  • Problem: Categories are just names (like neighborhoods, brands, cities). Models don’t understand them directly.
  • Normal encoding: If we just assign numbers randomly (Neighborhood A=1, B=2, C=3), the numbers don’t mean anything.
  • Target Guided Encoding: Instead of random numbers, we use the target variable (the thing we’re predicting) to guide the encoding.

🧩 How it works

  1. Take each category (e.g., each neighborhood).
  2. Calculate the average of the target variable for that category (e.g., average house price in that neighborhood).
  3. Replace the category with that average value.

👉 Example:

  • Neighborhood A → Average Price = 200k
  • Neighborhood B → Average Price = 500k
  • Neighborhood C → Average Price = 300k

Now the model sees numbers that carry meaning about the target.


🐍 Python Example

mean_prices = df.groupby('Neighborhood')['Price'].mean()
df['Neighborhood_encoded'] = df['Neighborhood'].map(mean_prices)
Enter fullscreen mode Exit fullscreen mode

So if a house is in Neighborhood B, it gets encoded as 500000 (the average price there).


✅ Why it’s useful

  • Categories are converted into numbers that reflect their relationship with the target.
  • The model learns patterns more easily.
  • Works well when categories strongly influence the target (like location affecting house price).

📝 Summary

  • Target Guided Ordinal Encoding = replace categories with numbers based on target averages.
  • It makes encoding meaningful, not random.
  • Example: Encode neighborhoods by their average house price.

🎯 Final Takeaway

Feature engineering is about making data useful:

  • Fill or fix missing values.
  • Balance datasets so models don’t cheat.
  • Handle outliers to avoid distortion.
  • Encode categories into numbers.
  • Use target‑guided encoding for smarter features.

With these techniques, you’ll move from raw messy data to clean, powerful features that help your models shine ✨.

Top comments (0)