DEV Community

Cover image for Feature Engineering: The Dark Art of Teaching Your Model to See What You See
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

Feature Engineering: The Dark Art of Teaching Your Model to See What You See

The One-Line Summary: Feature engineering transforms raw data into meaningful signals that help models learn. It's often the difference between a mediocre model and a competition-winning one — and it's more art than science.


The Detective Who Could See Everything

Inspector Chen walked into the crime scene.

The rookie officer handed her the report:

Evidence Collected:
- Timestamp: 1705333620
- Location: 40.7128, -74.0060
- Temperature: 28°F
- Item found: Brown leather shoe, size 11
- Substance on shoe: Mixed powder
Enter fullscreen mode Exit fullscreen mode

The rookie shrugged. "Just numbers. Not much to go on."

Inspector Chen smiled. She saw EVERYTHING.


From timestamp 1705333620, she extracted:

  • "January 15, 2024, 3:47 PM"
  • "Monday — a workday"
  • "Mid-afternoon — offices are open"
  • "Winter — explains the cold"

From coordinates 40.7128, -74.0060, she derived:

  • "Manhattan, Financial District"
  • "0.3 miles from the nearest subway"
  • "High foot traffic area"
  • "Near 3 coffee shops, 2 banks, 1 gym"

From temperature 28°F and brown leather shoe, she inferred:

  • "Below freezing — unusual to wear leather shoes in snow"
  • "Size 11 — statistically likely male, 5'10" to 6'2""
  • "Leather in winter = probably came by car or indoor work"

From mixed powder on shoe, she identified:

  • "Concrete dust + coffee grounds + chalk"
  • "Concrete = construction site nearby"
  • "Coffee grounds = barista or café regular"
  • "Chalk = gym, school, or rock climbing"

She announced: "Look for a tall man who works in an office in the Financial District, visits the gym at lunch, and gets coffee from the construction-adjacent café on Pine Street."

They caught him in 2 hours.


Inspector Chen didn't have better data than the rookie.

She had better FEATURES.

She transformed raw numbers into meaningful signals. She combined information to create new insights. She extracted hidden patterns that told a story.

This is feature engineering.

Your model is the rookie. It sees raw numbers. Your job is to be Inspector Chen — to transform those numbers into features that reveal the truth.


What Is Feature Engineering?

Feature engineering is the process of using domain knowledge to create new input variables that make machine learning algorithms work better.

RAW DATA                           ENGINEERED FEATURES
──────────────────────────────────────────────────────────────
"2024-01-15 15:47:00"      →      hour=15, day_of_week=1, 
                                   is_weekend=0, is_business_hours=1,
                                   month=1, quarter=1, is_winter=1

40.7128, -74.0060          →      city="NYC", borough="Manhattan",
                                   nearest_subway_dist=0.3,
                                   population_density=27000,
                                   median_income=95000

"John Smith"               →      name_length=10, has_middle_name=0,
                                   first_name_popularity=0.95,
                                   likely_gender="male"

purchase_amount=150,       →      amount_vs_avg=1.5,
user_avg_purchase=100             is_above_average=1,
                                   pct_of_monthly_budget=0.15
Enter fullscreen mode Exit fullscreen mode

The raw data stays the same. The features multiply.


Why Feature Engineering Matters

Let me prove it with a dramatic example.

The House Price Prediction Challenge

Raw features only:

# Raw data
features = ['square_feet', 'num_bedrooms', 'num_bathrooms', 'year_built', 'lot_size']

# Train model
model = RandomForestRegressor()
model.fit(X[features], y)

# Result
print(f"R² Score: {model.score(X_test, y_test):.3f}")
# R² Score: 0.756
Enter fullscreen mode Exit fullscreen mode

With engineered features:

# Create new features
df['age'] = 2024 - df['year_built']
df['bed_bath_ratio'] = df['num_bedrooms'] / df['num_bathrooms']
df['sqft_per_bedroom'] = df['square_feet'] / df['num_bedrooms']
df['lot_utilization'] = df['square_feet'] / df['lot_size']
df['is_new_construction'] = (df['age'] <= 5).astype(int)
df['is_historic'] = (df['age'] >= 50).astype(int)
df['total_rooms'] = df['num_bedrooms'] + df['num_bathrooms']
df['price_tier_area'] = df['zip_code'].map(zip_price_tiers)  # Domain knowledge

# Train same model
features_engineered = features + ['age', 'bed_bath_ratio', 'sqft_per_bedroom', 
                                   'lot_utilization', 'is_new_construction',
                                   'is_historic', 'total_rooms', 'price_tier_area']

model = RandomForestRegressor()
model.fit(X[features_engineered], y)

# Result
print(f"R² Score: {model.score(X_test, y_test):.3f}")
# R² Score: 0.891
Enter fullscreen mode Exit fullscreen mode

Same model. Same raw data. But R² jumped from 0.756 to 0.891!

Feature engineering added 13 percentage points of explained variance. That's often the difference between "interesting prototype" and "production-ready model."


The Feature Engineering Cookbook

Let me show you every technique in the feature engineer's arsenal.


Category 1: DateTime Features

Timestamps are goldmines hiding in plain sight.

import pandas as pd

# Raw timestamp
df['timestamp'] = pd.to_datetime(df['timestamp'])

# === TIME COMPONENTS ===
df['hour'] = df['timestamp'].dt.hour
df['day'] = df['timestamp'].dt.day
df['day_of_week'] = df['timestamp'].dt.dayofweek  # 0=Monday
df['day_name'] = df['timestamp'].dt.day_name()
df['month'] = df['timestamp'].dt.month
df['quarter'] = df['timestamp'].dt.quarter
df['year'] = df['timestamp'].dt.year
df['week_of_year'] = df['timestamp'].dt.isocalendar().week

# === BINARY FLAGS ===
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_month_start'] = df['timestamp'].dt.is_month_start.astype(int)
df['is_month_end'] = df['timestamp'].dt.is_month_end.astype(int)
df['is_quarter_end'] = df['timestamp'].dt.is_quarter_end.astype(int)

# === BUSINESS LOGIC ===
df['is_business_hours'] = df['hour'].between(9, 17).astype(int)
df['is_rush_hour'] = df['hour'].isin([7, 8, 9, 17, 18, 19]).astype(int)
df['is_night'] = df['hour'].isin([22, 23, 0, 1, 2, 3, 4, 5]).astype(int)

# === CYCLICAL ENCODING (for algorithms that need it) ===
import numpy as np
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

# === DOMAIN-SPECIFIC ===
df['is_holiday'] = df['timestamp'].isin(holiday_list).astype(int)
df['days_until_christmas'] = (christmas_date - df['timestamp']).dt.days
df['is_payday'] = df['day'].isin([1, 15]).astype(int)  # Common paydays
df['season'] = df['month'].map({12: 'winter', 1: 'winter', 2: 'winter',
                                 3: 'spring', 4: 'spring', 5: 'spring',
                                 6: 'summer', 7: 'summer', 8: 'summer',
                                 9: 'fall', 10: 'fall', 11: 'fall'})
Enter fullscreen mode Exit fullscreen mode

Why it matters: Your raw timestamp 1705333620 means nothing to a model. But "Monday afternoon in January" tells a story. Retail sales spike on weekends. Restaurant orders peak at 6 PM. Gym visits drop in December. DateTime features capture these patterns.


Category 2: Numerical Transformations

Raw numbers often hide their true signal.

import numpy as np

# === MATHEMATICAL TRANSFORMS ===
df['income_log'] = np.log1p(df['income'])  # Compress skewed distributions
df['income_sqrt'] = np.sqrt(df['income'])
df['income_squared'] = df['income'] ** 2   # Capture non-linear effects

# === BINNING / DISCRETIZATION ===
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 65, 100],
                          labels=['child', 'young_adult', 'adult', 'middle_age', 'senior'])

df['income_bracket'] = pd.qcut(df['income'], q=5, 
                                labels=['very_low', 'low', 'medium', 'high', 'very_high'])

# === NORMALIZATION RELATIVE TO GROUPS ===
df['income_vs_city_median'] = df['income'] / df.groupby('city')['income'].transform('median')
df['income_percentile_in_city'] = df.groupby('city')['income'].rank(pct=True)

# === RATIOS ===
df['debt_to_income'] = df['debt'] / df['income']
df['savings_rate'] = df['savings'] / df['income']
df['price_per_sqft'] = df['price'] / df['square_feet']

# === DIFFERENCES ===
df['price_vs_avg'] = df['price'] - df['price'].mean()
df['age_vs_median'] = df['age'] - df['age'].median()

# === INTERACTIONS ===
df['income_x_education'] = df['income'] * df['education_years']
df['age_x_experience'] = df['age'] * df['years_experience']

# === POLYNOMIAL FEATURES (use sparingly!) ===
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['age', 'income']])
Enter fullscreen mode Exit fullscreen mode

Example: Why log transform matters

# Income distribution is heavily skewed
incomes = [30000, 35000, 40000, 45000, 50000, 55000, 60000, 5000000]

# Mean is pulled by the outlier
print(f"Mean: ${np.mean(incomes):,.0f}")        # $660,625
print(f"Median: ${np.median(incomes):,.0f}")    # $47,500

# After log transform, the $5M doesn't dominate
log_incomes = np.log1p(incomes)
print(f"Log mean: {np.mean(log_incomes):.2f}")  # 10.95
print(f"Log of $5M: {np.log1p(5000000):.2f}")   # 15.42 (not that extreme anymore!)
Enter fullscreen mode Exit fullscreen mode

Category 3: Categorical Feature Engineering

Categories are more than labels — they're information carriers.

# === FREQUENCY ENCODING ===
# How common is this category?
df['brand_frequency'] = df.groupby('brand')['brand'].transform('count') / len(df)

# === TARGET ENCODING (careful with leakage!) ===
# What's the average target for this category?
brand_means = df.groupby('brand')['purchased'].mean()
df['brand_purchase_rate'] = df['brand'].map(brand_means)

# === COUNT ENCODING ===
df['brand_count'] = df['brand'].map(df['brand'].value_counts())

# === RARE CATEGORY GROUPING ===
# Group infrequent categories into "Other"
threshold = 0.01  # Categories with < 1% frequency
freq = df['brand'].value_counts(normalize=True)
rare_brands = freq[freq < threshold].index
df['brand_grouped'] = df['brand'].replace(rare_brands, 'Other')

# === BINARY FLAGS ===
df['is_premium_brand'] = df['brand'].isin(['Apple', 'Samsung', 'Sony']).astype(int)
df['is_domestic'] = df['country'].isin(['USA', 'Canada']).astype(int)

# === CATEGORY COMBINATIONS ===
df['brand_category'] = df['brand'] + '_' + df['product_category']
df['location_type'] = df['city'] + '_' + df['store_type']

# === HIERARCHICAL EXTRACTION ===
# From "Electronics > Phones > Smartphones"
df['category_level_1'] = df['category_path'].str.split(' > ').str[0]
df['category_level_2'] = df['category_path'].str.split(' > ').str[1]
df['category_depth'] = df['category_path'].str.count(' > ') + 1
Enter fullscreen mode Exit fullscreen mode

Category 4: Text Features

Text is unstructured gold waiting to be mined.

import re

# === BASIC METRICS ===
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text'].apply(lambda x: np.mean([len(w) for w in x.split()]))
df['sentence_count'] = df['text'].str.count(r'[.!?]+')

# === CHARACTER PATTERNS ===
df['exclamation_count'] = df['text'].str.count('!')
df['question_count'] = df['text'].str.count(r'\?')
df['uppercase_ratio'] = df['text'].apply(lambda x: sum(1 for c in x if c.isupper()) / len(x))
df['digit_count'] = df['text'].str.count(r'\d')
df['special_char_count'] = df['text'].str.count(r'[^a-zA-Z0-9\s]')

# === KEYWORD PRESENCE ===
df['contains_urgent'] = df['text'].str.lower().str.contains('urgent|asap|immediately').astype(int)
df['contains_money'] = df['text'].str.lower().str.contains(r'\$|\bdollar|\bprice').astype(int)
df['contains_negation'] = df['text'].str.lower().str.contains(r'\bnot\b|\bno\b|\bnever\b').astype(int)

# === SENTIMENT (simple) ===
positive_words = ['good', 'great', 'excellent', 'amazing', 'love', 'best']
negative_words = ['bad', 'terrible', 'awful', 'hate', 'worst', 'horrible']

df['positive_word_count'] = df['text'].apply(
    lambda x: sum(1 for w in x.lower().split() if w in positive_words)
)
df['negative_word_count'] = df['text'].apply(
    lambda x: sum(1 for w in x.lower().split() if w in negative_words)
)
df['sentiment_ratio'] = (df['positive_word_count'] + 1) / (df['negative_word_count'] + 1)

# === EMAIL SPECIFIC ===
df['email_domain'] = df['email'].str.split('@').str[1]
df['is_business_email'] = (~df['email_domain'].isin(['gmail.com', 'yahoo.com', 'hotmail.com'])).astype(int)

# === NAME FEATURES ===
df['name_length'] = df['name'].str.len()
df['name_word_count'] = df['name'].str.split().str.len()
df['has_middle_name'] = (df['name'].str.split().str.len() > 2).astype(int)
df['first_name'] = df['name'].str.split().str[0]
Enter fullscreen mode Exit fullscreen mode

Category 5: Geographic Features

Coordinates are just the beginning.

from math import radians, sin, cos, sqrt, atan2

# === DISTANCE CALCULATIONS ===
def haversine_distance(lat1, lon1, lat2, lon2):
    """Calculate distance between two points in km."""
    R = 6371  # Earth's radius in km

    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1

    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * atan2(sqrt(a), sqrt(1-a))

    return R * c

# Distance to key locations
df['dist_to_city_center'] = df.apply(
    lambda row: haversine_distance(row['lat'], row['lon'], city_center_lat, city_center_lon),
    axis=1
)
df['dist_to_nearest_station'] = df.apply(
    lambda row: min(haversine_distance(row['lat'], row['lon'], s['lat'], s['lon']) 
                    for s in stations),
    axis=1
)

# === GEOHASHING (grouping nearby points) ===
import geohash2
df['geohash_5'] = df.apply(lambda row: geohash2.encode(row['lat'], row['lon'], precision=5), axis=1)

# === ZONE/REGION ASSIGNMENT ===
df['zip_code'] = df.apply(lambda row: get_zipcode(row['lat'], row['lon']), axis=1)
df['neighborhood'] = df['zip_code'].map(zip_to_neighborhood)
df['is_urban'] = df['population_density'] > 1000

# === DERIVED METRICS ===
df['restaurants_within_1km'] = df.apply(count_pois_within_radius, poi_type='restaurant', radius=1)
df['avg_income_in_area'] = df['zip_code'].map(zip_income_data)
df['crime_rate_area'] = df['zip_code'].map(zip_crime_data)
Enter fullscreen mode Exit fullscreen mode

Category 6: Aggregation Features

Individual rows gain meaning from their groups.

# === USER-LEVEL AGGREGATIONS ===
user_stats = df.groupby('user_id').agg({
    'purchase_amount': ['mean', 'sum', 'std', 'min', 'max', 'count'],
    'timestamp': ['min', 'max'],
    'product_category': 'nunique'
}).reset_index()

user_stats.columns = ['user_id', 'user_avg_purchase', 'user_total_spend', 
                       'user_purchase_std', 'user_min_purchase', 'user_max_purchase',
                       'user_purchase_count', 'user_first_purchase', 'user_last_purchase',
                       'user_unique_categories']

# Merge back
df = df.merge(user_stats, on='user_id', how='left')

# === RELATIVE TO USER ===
df['purchase_vs_user_avg'] = df['purchase_amount'] / df['user_avg_purchase']
df['is_above_user_avg'] = (df['purchase_amount'] > df['user_avg_purchase']).astype(int)

# === TIME-BASED AGGREGATIONS ===
df['user_purchases_last_7d'] = df.groupby('user_id').apply(
    lambda x: x.rolling('7D', on='timestamp')['purchase_amount'].count()
).reset_index(level=0, drop=True)

df['user_spend_last_30d'] = df.groupby('user_id').apply(
    lambda x: x.rolling('30D', on='timestamp')['purchase_amount'].sum()
).reset_index(level=0, drop=True)

# === PRODUCT-LEVEL AGGREGATIONS ===
product_stats = df.groupby('product_id').agg({
    'purchase_amount': 'mean',
    'user_id': 'nunique',
    'rating': 'mean'
}).reset_index()
product_stats.columns = ['product_id', 'product_avg_price', 'product_unique_buyers', 'product_avg_rating']

df = df.merge(product_stats, on='product_id', how='left')

# === CROSS-ENTITY STATS ===
df['user_vs_product_buyers'] = df['user_purchase_count'] / df['product_unique_buyers']
Enter fullscreen mode Exit fullscreen mode

Category 7: Lag and Window Features (Time Series)

The past predicts the future.

# === LAG FEATURES ===
df = df.sort_values(['user_id', 'timestamp'])

df['prev_purchase_amount'] = df.groupby('user_id')['purchase_amount'].shift(1)
df['prev_2_purchase_amount'] = df.groupby('user_id')['purchase_amount'].shift(2)
df['prev_3_purchase_amount'] = df.groupby('user_id')['purchase_amount'].shift(3)

# === DIFFERENCE FROM PREVIOUS ===
df['purchase_change'] = df['purchase_amount'] - df['prev_purchase_amount']
df['purchase_pct_change'] = df['purchase_amount'] / df['prev_purchase_amount']

# === ROLLING STATISTICS ===
df['rolling_mean_3'] = df.groupby('user_id')['purchase_amount'].transform(
    lambda x: x.rolling(3, min_periods=1).mean()
)
df['rolling_std_3'] = df.groupby('user_id')['purchase_amount'].transform(
    lambda x: x.rolling(3, min_periods=1).std()
)
df['rolling_max_7'] = df.groupby('user_id')['purchase_amount'].transform(
    lambda x: x.rolling(7, min_periods=1).max()
)

# === EXPANDING STATISTICS ===
df['cumulative_purchases'] = df.groupby('user_id')['purchase_amount'].cumsum()
df['cumulative_count'] = df.groupby('user_id').cumcount() + 1
df['expanding_mean'] = df['cumulative_purchases'] / df['cumulative_count']

# === TIME SINCE EVENTS ===
df['days_since_last_purchase'] = df.groupby('user_id')['timestamp'].diff().dt.days
df['days_since_first_purchase'] = (df['timestamp'] - df['user_first_purchase']).dt.days

# === TREND INDICATORS ===
df['is_increasing'] = (df['purchase_amount'] > df['prev_purchase_amount']).astype(int)
df['consecutive_increases'] = df.groupby('user_id')['is_increasing'].transform(
    lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1)
)
Enter fullscreen mode Exit fullscreen mode

Category 8: Domain-Specific Features

This is where domain expertise shines.

E-commerce Features

# === SHOPPING BEHAVIOR ===
df['cart_abandonment_rate'] = df['carts_abandoned'] / df['carts_created']
df['wishlist_to_purchase_rate'] = df['wishlist_purchases'] / df['wishlist_adds']
df['avg_time_to_purchase'] = df['total_time_to_purchase'] / df['purchase_count']

# === PRODUCT FEATURES ===
df['is_on_sale'] = (df['sale_price'] < df['original_price']).astype(int)
df['discount_pct'] = (df['original_price'] - df['sale_price']) / df['original_price']
df['price_tier'] = pd.qcut(df['price'], q=5, labels=['budget', 'low', 'mid', 'high', 'premium'])

# === SEASONALITY ===
df['is_holiday_season'] = df['month'].isin([11, 12]).astype(int)
df['is_back_to_school'] = df['month'].isin([8, 9]).astype(int)
Enter fullscreen mode Exit fullscreen mode

Healthcare Features

# === PATIENT METRICS ===
df['bmi'] = df['weight_kg'] / (df['height_m'] ** 2)
df['bmi_category'] = pd.cut(df['bmi'], bins=[0, 18.5, 25, 30, 100],
                             labels=['underweight', 'normal', 'overweight', 'obese'])

# === VITAL SIGN FEATURES ===
df['heart_rate_zone'] = pd.cut(df['heart_rate'], bins=[0, 60, 100, 120, 200],
                                labels=['low', 'normal', 'elevated', 'high'])
df['blood_pressure_category'] = df.apply(classify_bp, axis=1)

# === TEMPORAL HEALTH ===
df['days_since_last_visit'] = (df['current_date'] - df['last_visit_date']).dt.days
df['visits_per_year'] = df['total_visits'] / df['years_as_patient']
df['medication_adherence'] = df['doses_taken'] / df['doses_prescribed']
Enter fullscreen mode Exit fullscreen mode

Finance Features

# === CREDIT RISK ===
df['debt_to_income_ratio'] = df['total_debt'] / df['annual_income']
df['credit_utilization'] = df['credit_used'] / df['credit_limit']
df['payment_to_income'] = df['monthly_payment'] / (df['annual_income'] / 12)

# === ACCOUNT BEHAVIOR ===
df['avg_daily_balance'] = df['total_balance'] / df['days_in_period']
df['balance_volatility'] = df['balance_std'] / df['balance_mean']
df['overdraft_frequency'] = df['overdraft_count'] / df['total_transactions']

# === FRAUD INDICATORS ===
df['transaction_velocity'] = df['transactions_last_hour']
df['amount_vs_avg'] = df['transaction_amount'] / df['user_avg_transaction']
df['is_new_merchant'] = (df['times_at_merchant'] == 1).astype(int)
df['distance_from_home'] = haversine_distance(df['transaction_lat'], df['transaction_lon'],
                                               df['home_lat'], df['home_lon'])
Enter fullscreen mode Exit fullscreen mode

The Feature Engineering Process

Step 1: Understand the Problem

Questions to ask:
├── What am I trying to predict?
├── What would a human expert look at?
├── What patterns exist in the data?
├── What external knowledge is relevant?
└── What's the timeline? (When do I need to predict?)
Enter fullscreen mode Exit fullscreen mode

Step 2: Explore the Data

# Profile your data
import pandas as pd

def feature_profile(df):
    """Generate feature engineering opportunities."""
    print("=== DATA TYPES ===")
    print(df.dtypes.value_counts())

    print("\n=== DATETIME COLUMNS ===")
    datetime_cols = df.select_dtypes(include=['datetime64']).columns.tolist()
    print(f"Found: {datetime_cols}")
    print("Opportunities: hour, day, month, is_weekend, is_holiday, cyclical encoding")

    print("\n=== NUMERIC COLUMNS ===")
    numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
    print(f"Found: {numeric_cols}")
    print("Opportunities: log transform, binning, ratios, interactions, normalization")

    print("\n=== CATEGORICAL COLUMNS ===")
    cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    for col in cat_cols:
        nunique = df[col].nunique()
        print(f"  {col}: {nunique} unique values")
    print("Opportunities: frequency encoding, target encoding, grouping rare, combinations")

    print("\n=== POTENTIAL TEXT COLUMNS ===")
    for col in cat_cols:
        avg_len = df[col].str.len().mean()
        if avg_len > 50:
            print(f"  {col}: avg length {avg_len:.0f} chars")
    print("Opportunities: length, word count, sentiment, keywords, TF-IDF")

feature_profile(df)
Enter fullscreen mode Exit fullscreen mode

Step 3: Generate Features

Start broad, then narrow down.

# Generate MANY features first
df = create_datetime_features(df)
df = create_numeric_features(df)
df = create_categorical_features(df)
df = create_aggregation_features(df)
df = create_domain_features(df)

print(f"Features: {len(df.columns)}")
# Features: 150+
Enter fullscreen mode Exit fullscreen mode

Step 4: Select Features

Not all features are useful. Some are noise. Some are redundant.

from sklearn.feature_selection import SelectKBest, mutual_info_classif

# Method 1: Correlation-based
corr_with_target = df.corr()['target'].abs().sort_values(ascending=False)
top_features = corr_with_target.head(20).index.tolist()

# Method 2: Mutual Information
selector = SelectKBest(mutual_info_classif, k=20)
selector.fit(X, y)
selected_features = X.columns[selector.get_support()].tolist()

# Method 3: Feature Importance from Model
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)

importance = pd.Series(model.feature_importances_, index=X.columns)
top_features = importance.nlargest(20).index.tolist()

# Method 4: Recursive Feature Elimination
from sklearn.feature_selection import RFE

rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=20)
rfe.fit(X, y)
selected_features = X.columns[rfe.support_].tolist()
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake 1: Feature Leakage

# ❌ WRONG: Using future information
df['avg_next_7_days'] = df.groupby('user')['sales'].transform(
    lambda x: x.shift(-7).rolling(7).mean()  # LOOKING INTO THE FUTURE!
)

# ✅ RIGHT: Only use past information
df['avg_last_7_days'] = df.groupby('user')['sales'].transform(
    lambda x: x.shift(1).rolling(7).mean()  # Only past data
)
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Target Leakage

# ❌ WRONG: Feature derived from target
df['category_avg_target'] = df.groupby('category')['target'].transform('mean')
# This uses the target to create a feature!

# ✅ RIGHT: Use proper target encoding with CV
from category_encoders import TargetEncoder
encoder = TargetEncoder(cols=['category'])
encoder.fit(X_train, y_train)
X_train_encoded = encoder.transform(X_train)
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Not Handling Missing Values Created by Features

# ❌ WRONG: Lag features create NaN for first rows
df['prev_purchase'] = df.groupby('user')['purchase'].shift(1)
# First purchase per user has NaN!

# ✅ RIGHT: Handle the NaN appropriately
df['prev_purchase'] = df.groupby('user')['purchase'].shift(1).fillna(0)
# Or create a flag
df['is_first_purchase'] = df['prev_purchase'].isna().astype(int)
df['prev_purchase'] = df['prev_purchase'].fillna(df['purchase'].median())
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Over-Engineering

# ❌ WRONG: Creating 1000 features for 1000 rows
# More features than samples = overfitting disaster!

# ✅ RIGHT: Keep features << samples
# Rule of thumb: Start with features < samples / 10
Enter fullscreen mode Exit fullscreen mode

The Feature Engineering Cheat Sheet

Data Type Common Features
DateTime hour, day, month, is_weekend, is_holiday, cyclical encoding, time since event
Numeric log, sqrt, bins, ratios, interactions, percentiles, z-scores
Categorical frequency, target encoding, combinations, rare grouping, binary flags
Text length, word count, sentiment, keywords, TF-IDF, embeddings
Geographic distance to POI, geohash, region, population density, nearby counts
Time Series lag, rolling mean/std, cumulative, trend, seasonality
User/Entity aggregations (mean, sum, count, std), lifetime value, recency

Key Takeaways

  1. Feature engineering is often more important than model selection — A good feature can be worth 10 model tuning iterations

  2. Domain knowledge is your superpower — Know what matters in your problem space

  3. DateTime is a goldmine — Extract every signal from timestamps

  4. Create ratios and interactions — Features combined often reveal hidden patterns

  5. Aggregate at multiple levels — User stats, product stats, time-window stats

  6. Watch for leakage — Never use future information or the target itself

  7. Generate many, select few — Create broadly, then prune ruthlessly

  8. Document everything — Future you needs to know why you created each feature


The One-Sentence Summary

Feature engineering is teaching your model to see what Inspector Chen sees — transforming raw timestamps into "Monday afternoon in winter" and GPS coordinates into "0.3 miles from the subway in a high-income neighborhood."


What's Next?

Now that you understand feature engineering, you're ready for:

  • Feature Selection — Choosing the best features from many
  • Automated Feature Engineering — Tools like Featuretools and AutoML
  • Feature Stores — Production-grade feature management
  • Embeddings — Deep learning's answer to feature engineering

Follow me for the next article in this series!


Let's Connect!

If this expanded your feature engineering toolkit, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's your favorite feature engineering trick? Share your secret weapons!


The difference between a model that sees "timestamp: 1705333620" and one that sees "Monday afternoon in January, 3 days before payday, during business hours"? Feature engineering. Be Inspector Chen, not the rookie.


Share this with someone whose model is struggling. The answer might not be a better algorithm — it might be better features.

Happy engineering! 🔧

Top comments (0)