DEV Community

likhitha manikonda
likhitha manikonda

Posted on

Feature Engineering

🛠️ Feature Engineering Made Simple

Machine learning models are only as good as the data we feed them. That’s where feature engineering comes in. Let’s break it down in plain language.


🌱 What is Feature Engineering?

  • A feature is just a column of data (like age, salary, or number of purchases).
  • Feature engineering means creating, modifying, or selecting the right features so that your model learns better.
  • Think of it as preparing ingredients before cooking—you want them clean, chopped, and ready to make the dish tasty.

⚙️ Why Do We Need It?

  • Raw data is often messy, incomplete, or not in the right format.
  • Good features help algorithms see patterns more clearly.
  • Better features = better predictions, faster training, and more accurate results.

🔧 Common Techniques in Feature Engineering

Technique What It Means Simple Example
Handling Missing Values Fill in blanks or remove incomplete data Replace missing ages with the average age
Encoding Categorical Data Convert text labels into numbers “Red, Blue, Green” → 0, 1, 2
Scaling/Normalization Put numbers on similar ranges Salary (₹10,000–₹1,00,000) scaled to 0–1
Feature Creation Combine or transform existing data into new features From “Date of Birth” → create “Age”
Feature Selection Keep only the most useful features Drop irrelevant columns like “User ID”
Binning Group continuous values into categories Age 0–12 = Child, 13–19 = Teen, 20+ = Adult

🧩 Simple Example

Imagine you have this dataset:

Name Date of Birth Salary City
Alice 1995-06-12 50,000 Delhi
Bob 1988-03-05 80,000 Mumbai

After feature engineering:

  • Age is calculated from Date of Birth.
  • City is encoded as numbers (Delhi=0, Mumbai=1).
  • Salary is scaled between 0 and 1.

Now the data is cleaner and easier for the model to understand.


🚀 Key Takeaways

  • Feature engineering = preparing and improving data features.
  • It makes models smarter and predictions more accurate.
  • Techniques include handling missing values, encoding, scaling, creating new features, and selecting the best ones.

🛠️ Feature Engineering in Python

Make sure you have pandas and scikit-learn installed:

pip install pandas scikit-learn

```import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

Example dataset

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Date_of_Birth': ['1995-06-12', '1988-03-05', '2000-12-20'],
'Salary': [50000, 80000, None], # Notice the missing value
'City': ['Delhi', 'Mumbai', 'Delhi']
}

df = pd.DataFrame(data)
print("Original Data:\n", df)

🔹 Handling Missing Values

df['Salary'].fillna(df['Salary'].mean(), inplace=True)

🔹 Feature Creation (Age from Date of Birth)

df['Date_of_Birth'] = pd.to_datetime(df['Date_of_Birth'])
df['Age'] = pd.Timestamp.now().year - df['Date_of_Birth'].dt.year

🔹 Encoding Categorical Data (City)

label_encoder = LabelEncoder()
df['City_encoded'] = label_encoder.fit_transform(df['City'])

🔹 Scaling Numerical Data (Salary)

scaler = MinMaxScaler()
df['Salary_scaled'] = scaler.fit_transform(df[['Salary']])

print("\nAfter Feature Engineering:\n", df)



🧑‍💻 What this code does:
- Handles missing values by filling in the average salary.
- Creates a new feature (Age) from Date_of_Birth.
- Encodes categorical data (City) into numbers.
- Scales numerical data (Salary) between 0 and 1.

---

### 🧩 Examples for Beginners

## 1. Handling Missing Values
- **Example:**  
  Dataset:


Enter fullscreen mode Exit fullscreen mode

Name Age Salary
Alice 25 50000
Bob NaN 60000



  - Fill Bob’s missing age with the average age (25).  
  - Now the dataset has no blanks.

---

## 2. Handling Imbalanced Dataset
- **Example:**  
  Predicting if an email is spam:  
  - 950 emails = “Not Spam”  
  - 50 emails = “Spam”  
  - If you train directly, the model may always predict “Not Spam.”  
  - Solution: Oversample spam emails or undersample non‑spam emails.

---

## 3. Handling Imbalanced Dataset Using SMOTE
- **Example:**  
  Minority class = 50 spam emails.  
  SMOTE creates synthetic new spam emails by combining existing ones.  
  - Instead of duplicating, it generates realistic variations.  
  - Now you might have 200 spam emails vs 950 non‑spam emails → more balanced.

---

## 4. Handling Outliers Using Python
- **Example:**  
  Salaries: [50,000, 55,000, 60,000, 1,000,000]  
  - The 1,000,000 is an outlier.  
  - You can detect it using a boxplot or Z‑score.  
  - Decide: remove it (if it’s an error) or cap it (if valid but extreme).

---

## 5. Data Encoding – Nominal / One‑Hot Encoding (OHE)
- **Example:**  
  Color column: [Red, Blue, Green]  
  - OHE →  
    - Red = [1,0,0]  
    - Blue = [0,1,0]  
    - Green = [0,0,1]

---

## 6. Label & Ordinal Encoding
- **Example:**  
  Size column: [Small, Medium, Large]  
  - Label Encoding → Small=0, Medium=1, Large=2  
  - Since there’s a natural order, this works fine.  
  - But if categories were [Red, Blue, Green], label encoding could mislead the model (thinking Green > Blue > Red).

---

## 7. Target Guided Ordinal Encoding
- **Example:**  
  Predicting house prices with “City” column:  
  - Delhi average price = ₹50 lakh → encode as 2  
  - Mumbai average price = ₹80 lakh → encode as 3  
  - Pune average price = ₹30 lakh → encode as 1  
  - Encoding reflects the relationship with the target (price).

---
`
### Install required libraries if not already installed:
`pip install pandas scikit-learn imbalanced-learn matplotlib seaborn`

# Install required libraries if not already installed:
# pip install pandas scikit-learn imbalanced-learn matplotlib seaborn

```
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt

# -----------------------------
# 1. Create a sample dataset
# -----------------------------
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, np.nan, 35, 40, 28],   # Missing value
    'Salary': [50000, 60000, 1000000, 55000, 58000],  # Outlier in Salary
    'City': ['Delhi', 'Mumbai', 'Delhi', 'Pune', 'Mumbai'],
    'Spam': [0, 0, 0, 0, 1]  # Imbalanced target (only 1 spam)
}

df = pd.DataFrame(data)
print("Original Data:\n", df)

# -----------------------------
# 2. Handling Missing Values
# -----------------------------
df['Age'].fillna(df['Age'].mean(), inplace=True)

# -----------------------------
# 3. Handling Outliers
# -----------------------------
# Using IQR method to cap outliers
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
upper_limit = Q3 + 1.5 * IQR
df['Salary'] = np.where(df['Salary'] > upper_limit, upper_limit, df['Salary'])

# -----------------------------
# 4. Encoding Categorical Data
# -----------------------------
# Label Encoding
label_encoder = LabelEncoder()
df['City_Label'] = label_encoder.fit_transform(df['City'])

# One-Hot Encoding
ohe = pd.get_dummies(df['City'], prefix='City')
df = pd.concat([df, ohe], axis=1)

# -----------------------------
# 5. Scaling Numerical Data
# -----------------------------
scaler = MinMaxScaler()
df['Salary_Scaled'] = scaler.fit_transform(df[['Salary']])

# -----------------------------
# 6. Handling Imbalanced Dataset with SMOTE
# -----------------------------
X = df[['Age', 'Salary_Scaled', 'City_Label']]
y = df['Spam']

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print("\nAfter SMOTE Resampling:")
print("Class distribution:", pd.Series(y_resampled).value_counts())

# -----------------------------
# 7. Visualization Example
# -----------------------------
sns.countplot(x=y_resampled)
plt.title("Balanced Dataset After SMOTE")
plt.show()

print("\nFinal Processed Data:\n", df)
```

---
Feature engineering is the **art of turning raw data into useful features**.  
- Handle **missing values** so models don’t break.  
- Balance **imbalanced datasets** with resampling or SMOTE.  
- Detect and treat **outliers** to avoid skewed results.  
- Use the right **encoding** for categorical data:  
  - OHE for unordered categories.  
  - Label/Ordinal for ordered categories.  
  - Target Guided for smarter encoding using target info.  

👉 In short: **Clean, transform, and enrich your data.** Better features = better models.  

---
---
## ✨ Final Note
Think of feature engineering like polishing a diamond. The raw stone (data) is valuable, but shaping and refining it (features) unlocks its true brilliance.

---
Enter fullscreen mode Exit fullscreen mode

Top comments (0)