Sachin Kr. Rajput

Posted on Jan 21

Feature Engineering: The Dark Art of Teaching Your Model to See What You See

#beginners #programming #python #datascience

The One-Line Summary: Feature engineering transforms raw data into meaningful signals that help models learn. It's often the difference between a mediocre model and a competition-winning one — and it's more art than science.

The Detective Who Could See Everything

Inspector Chen walked into the crime scene.

The rookie officer handed her the report:

Evidence Collected:
- Timestamp: 1705333620
- Location: 40.7128, -74.0060
- Temperature: 28°F
- Item found: Brown leather shoe, size 11
- Substance on shoe: Mixed powder

The rookie shrugged. "Just numbers. Not much to go on."

Inspector Chen smiled. She saw EVERYTHING.

From timestamp 1705333620, she extracted:

"January 15, 2024, 3:47 PM"
"Monday — a workday"
"Mid-afternoon — offices are open"
"Winter — explains the cold"

From coordinates 40.7128, -74.0060, she derived:

"Manhattan, Financial District"
"0.3 miles from the nearest subway"
"High foot traffic area"
"Near 3 coffee shops, 2 banks, 1 gym"

From temperature 28°F and brown leather shoe, she inferred:

"Below freezing — unusual to wear leather shoes in snow"
"Size 11 — statistically likely male, 5'10" to 6'2""
"Leather in winter = probably came by car or indoor work"

From mixed powder on shoe, she identified:

"Concrete dust + coffee grounds + chalk"
"Concrete = construction site nearby"
"Coffee grounds = barista or café regular"
"Chalk = gym, school, or rock climbing"

She announced: "Look for a tall man who works in an office in the Financial District, visits the gym at lunch, and gets coffee from the construction-adjacent café on Pine Street."

They caught him in 2 hours.

Inspector Chen didn't have better data than the rookie.

She had better FEATURES.

She transformed raw numbers into meaningful signals. She combined information to create new insights. She extracted hidden patterns that told a story.

This is feature engineering.

Your model is the rookie. It sees raw numbers. Your job is to be Inspector Chen — to transform those numbers into features that reveal the truth.

What Is Feature Engineering?

Feature engineering is the process of using domain knowledge to create new input variables that make machine learning algorithms work better.

RAW DATA                           ENGINEERED FEATURES
──────────────────────────────────────────────────────────────
"2024-01-15 15:47:00"      →      hour=15, day_of_week=1, 
                                   is_weekend=0, is_business_hours=1,
                                   month=1, quarter=1, is_winter=1

40.7128, -74.0060          →      city="NYC", borough="Manhattan",
                                   nearest_subway_dist=0.3,
                                   population_density=27000,
                                   median_income=95000

"John Smith"               →      name_length=10, has_middle_name=0,
                                   first_name_popularity=0.95,
                                   likely_gender="male"

purchase_amount=150,       →      amount_vs_avg=1.5,
user_avg_purchase=100             is_above_average=1,
                                   pct_of_monthly_budget=0.15

The raw data stays the same. The features multiply.

Why Feature Engineering Matters

Let me prove it with a dramatic example.

The House Price Prediction Challenge

Raw features only:

# Raw data
features = ['square_feet', 'num_bedrooms', 'num_bathrooms', 'year_built', 'lot_size']

# Train model
model = RandomForestRegressor()
model.fit(X[features], y)

# Result
print(f"R² Score: {model.score(X_test, y_test):.3f}")
# R² Score: 0.756

With engineered features:

# Create new features
df['age'] = 2024 - df['year_built']
df['bed_bath_ratio'] = df['num_bedrooms'] / df['num_bathrooms']
df['sqft_per_bedroom'] = df['square_feet'] / df['num_bedrooms']
df['lot_utilization'] = df['square_feet'] / df['lot_size']
df['is_new_construction'] = (df['age'] <= 5).astype(int)
df['is_historic'] = (df['age'] >= 50).astype(int)
df['total_rooms'] = df['num_bedrooms'] + df['num_bathrooms']
df['price_tier_area'] = df['zip_code'].map(zip_price_tiers)  # Domain knowledge

# Train same model
features_engineered = features + ['age', 'bed_bath_ratio', 'sqft_per_bedroom', 
                                   'lot_utilization', 'is_new_construction',
                                   'is_historic', 'total_rooms', 'price_tier_area']

model = RandomForestRegressor()
model.fit(X[features_engineered], y)

# Result
print(f"R² Score: {model.score(X_test, y_test):.3f}")
# R² Score: 0.891

Same model. Same raw data. But R² jumped from 0.756 to 0.891!

Feature engineering added 13 percentage points of explained variance. That's often the difference between "interesting prototype" and "production-ready model."

The Feature Engineering Cookbook

Let me show you every technique in the feature engineer's arsenal.

Category 1: DateTime Features

Timestamps are goldmines hiding in plain sight.

import pandas as pd

# Raw timestamp
df['timestamp'] = pd.to_datetime(df['timestamp'])

# === TIME COMPONENTS ===
df['hour'] = df['timestamp'].dt.hour
df['day'] = df['timestamp'].dt.day
df['day_of_week'] = df['timestamp'].dt.dayofweek  # 0=Monday
df['day_name'] = df['timestamp'].dt.day_name()
df['month'] = df['timestamp'].dt.month
df['quarter'] = df['timestamp'].dt.quarter
df['year'] = df['timestamp'].dt.year
df['week_of_year'] = df['timestamp'].dt.isocalendar().week

# === BINARY FLAGS ===
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_month_start'] = df['timestamp'].dt.is_month_start.astype(int)
df['is_month_end'] = df['timestamp'].dt.is_month_end.astype(int)
df['is_quarter_end'] = df['timestamp'].dt.is_quarter_end.astype(int)

# === BUSINESS LOGIC ===
df['is_business_hours'] = df['hour'].between(9, 17).astype(int)
df['is_rush_hour'] = df['hour'].isin([7, 8, 9, 17, 18, 19]).astype(int)
df['is_night'] = df['hour'].isin([22, 23, 0, 1, 2, 3, 4, 5]).astype(int)

# === CYCLICAL ENCODING (for algorithms that need it) ===
import numpy as np
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

# === DOMAIN-SPECIFIC ===
df['is_holiday'] = df['timestamp'].isin(holiday_list).astype(int)
df['days_until_christmas'] = (christmas_date - df['timestamp']).dt.days
df['is_payday'] = df['day'].isin([1, 15]).astype(int)  # Common paydays
df['season'] = df['month'].map({12: 'winter', 1: 'winter', 2: 'winter',
                                 3: 'spring', 4: 'spring', 5: 'spring',
                                 6: 'summer', 7: 'summer', 8: 'summer',
                                 9: 'fall', 10: 'fall', 11: 'fall'})

Why it matters: Your raw timestamp 1705333620 means nothing to a model. But "Monday afternoon in January" tells a story. Retail sales spike on weekends. Restaurant orders peak at 6 PM. Gym visits drop in December. DateTime features capture these patterns.

Category 2: Numerical Transformations

Raw numbers often hide their true signal.

import numpy as np

# === MATHEMATICAL TRANSFORMS ===
df['income_log'] = np.log1p(df['income'])  # Compress skewed distributions
df['income_sqrt'] = np.sqrt(df['income'])
df['income_squared'] = df['income'] ** 2   # Capture non-linear effects

# === BINNING / DISCRETIZATION ===
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 65, 100],
                          labels=['child', 'young_adult', 'adult', 'middle_age', 'senior'])

df['income_bracket'] = pd.qcut(df['income'], q=5, 
                                labels=['very_low', 'low', 'medium', 'high', 'very_high'])

# === NORMALIZATION RELATIVE TO GROUPS ===
df['income_vs_city_median'] = df['income'] / df.groupby('city')['income'].transform('median')
df['income_percentile_in_city'] = df.groupby('city')['income'].rank(pct=True)

# === RATIOS ===
df['debt_to_income'] = df['debt'] / df['income']
df['savings_rate'] = df['savings'] / df['income']
df['price_per_sqft'] = df['price'] / df['square_feet']

# === DIFFERENCES ===
df['price_vs_avg'] = df['price'] - df['price'].mean()
df['age_vs_median'] = df['age'] - df['age'].median()

# === INTERACTIONS ===
df['income_x_education'] = df['income'] * df['education_years']
df['age_x_experience'] = df['age'] * df['years_experience']

# === POLYNOMIAL FEATURES (use sparingly!) ===
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['age', 'income']])

Example: Why log transform matters

# Income distribution is heavily skewed
incomes = [30000, 35000, 40000, 45000, 50000, 55000, 60000, 5000000]

# Mean is pulled by the outlier
print(f"Mean: ${np.mean(incomes):,.0f}")        # $660,625
print(f"Median: ${np.median(incomes):,.0f}")    # $47,500

# After log transform, the $5M doesn't dominate
log_incomes = np.log1p(incomes)
print(f"Log mean: {np.mean(log_incomes):.2f}")  # 10.95
print(f"Log of $5M: {np.log1p(5000000):.2f}")   # 15.42 (not that extreme anymore!)

Category 3: Categorical Feature Engineering

Categories are more than labels — they're information carriers.

# === FREQUENCY ENCODING ===
# How common is this category?
df['brand_frequency'] = df.groupby('brand')['brand'].transform('count') / len(df)

# === TARGET ENCODING (careful with leakage!) ===
# What's the average target for this category?
brand_means = df.groupby('brand')['purchased'].mean()
df['brand_purchase_rate'] = df['brand'].map(brand_means)

# === COUNT ENCODING ===
df['brand_count'] = df['brand'].map(df['brand'].value_counts())

# === RARE CATEGORY GROUPING ===
# Group infrequent categories into "Other"
threshold = 0.01  # Categories with < 1% frequency
freq = df['brand'].value_counts(normalize=True)
rare_brands = freq[freq < threshold].index
df['brand_grouped'] = df['brand'].replace(rare_brands, 'Other')

# === BINARY FLAGS ===
df['is_premium_brand'] = df['brand'].isin(['Apple', 'Samsung', 'Sony']).astype(int)
df['is_domestic'] = df['country'].isin(['USA', 'Canada']).astype(int)

# === CATEGORY COMBINATIONS ===
df['brand_category'] = df['brand'] + '_' + df['product_category']
df['location_type'] = df['city'] + '_' + df['store_type']

# === HIERARCHICAL EXTRACTION ===
# From "Electronics > Phones > Smartphones"
df['category_level_1'] = df['category_path'].str.split(' > ').str[0]
df['category_level_2'] = df['category_path'].str.split(' > ').str[1]
df['category_depth'] = df['category_path'].str.count(' > ') + 1

Category 4: Text Features

Text is unstructured gold waiting to be mined.

import re

# === BASIC METRICS ===
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text'].apply(lambda x: np.mean([len(w) for w in x.split()]))
df['sentence_count'] = df['text'].str.count(r'[.!?]+')

# === CHARACTER PATTERNS ===
df['exclamation_count'] = df['text'].str.count('!')
df['question_count'] = df['text'].str.count(r'\?')
df['uppercase_ratio'] = df['text'].apply(lambda x: sum(1 for c in x if c.isupper()) / len(x))
df['digit_count'] = df['text'].str.count(r'\d')
df['special_char_count'] = df['text'].str.count(r'[^a-zA-Z0-9\s]')

# === KEYWORD PRESENCE ===
df['contains_urgent'] = df['text'].str.lower().str.contains('urgent|asap|immediately').astype(int)
df['contains_money'] = df['text'].str.lower().str.contains(r'\$|\bdollar|\bprice').astype(int)
df['contains_negation'] = df['text'].str.lower().str.contains(r'\bnot\b|\bno\b|\bnever\b').astype(int)

# === SENTIMENT (simple) ===
positive_words = ['good', 'great', 'excellent', 'amazing', 'love', 'best']
negative_words = ['bad', 'terrible', 'awful', 'hate', 'worst', 'horrible']

df['positive_word_count'] = df['text'].apply(
    lambda x: sum(1 for w in x.lower().split() if w in positive_words)
)
df['negative_word_count'] = df['text'].apply(
    lambda x: sum(1 for w in x.lower().split() if w in negative_words)
)
df['sentiment_ratio'] = (df['positive_word_count'] + 1) / (df['negative_word_count'] + 1)

# === EMAIL SPECIFIC ===
df['email_domain'] = df['email'].str.split('@').str[1]
df['is_business_email'] = (~df['email_domain'].isin(['gmail.com', 'yahoo.com', 'hotmail.com'])).astype(int)

# === NAME FEATURES ===
df['name_length'] = df['name'].str.len()
df['name_word_count'] = df['name'].str.split().str.len()
df['has_middle_name'] = (df['name'].str.split().str.len() > 2).astype(int)
df['first_name'] = df['name'].str.split().str[0]

Category 5: Geographic Features

Coordinates are just the beginning.

from math import radians, sin, cos, sqrt, atan2

# === DISTANCE CALCULATIONS ===
def haversine_distance(lat1, lon1, lat2, lon2):
    """Calculate distance between two points in km."""
    R = 6371  # Earth's radius in km

    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1

    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * atan2(sqrt(a), sqrt(1-a))

    return R * c

# Distance to key locations
df['dist_to_city_center'] = df.apply(
    lambda row: haversine_distance(row['lat'], row['lon'], city_center_lat, city_center_lon),
    axis=1
)
df['dist_to_nearest_station'] = df.apply(
    lambda row: min(haversine_distance(row['lat'], row['lon'], s['lat'], s['lon']) 
                    for s in stations),
    axis=1
)

# === GEOHASHING (grouping nearby points) ===
import geohash2
df['geohash_5'] = df.apply(lambda row: geohash2.encode(row['lat'], row['lon'], precision=5), axis=1)

# === ZONE/REGION ASSIGNMENT ===
df['zip_code'] = df.apply(lambda row: get_zipcode(row['lat'], row['lon']), axis=1)
df['neighborhood'] = df['zip_code'].map(zip_to_neighborhood)
df['is_urban'] = df['population_density'] > 1000

# === DERIVED METRICS ===
df['restaurants_within_1km'] = df.apply(count_pois_within_radius, poi_type='restaurant', radius=1)
df['avg_income_in_area'] = df['zip_code'].map(zip_income_data)
df['crime_rate_area'] = df['zip_code'].map(zip_crime_data)

Category 6: Aggregation Features

Individual rows gain meaning from their groups.

# === USER-LEVEL AGGREGATIONS ===
user_stats = df.groupby('user_id').agg({
    'purchase_amount': ['mean', 'sum', 'std', 'min', 'max', 'count'],
    'timestamp': ['min', 'max'],
    'product_category': 'nunique'
}).reset_index()

user_stats.columns = ['user_id', 'user_avg_purchase', 'user_total_spend', 
                       'user_purchase_std', 'user_min_purchase', 'user_max_purchase',
                       'user_purchase_count', 'user_first_purchase', 'user_last_purchase',
                       'user_unique_categories']

# Merge back
df = df.merge(user_stats, on='user_id', how='left')

# === RELATIVE TO USER ===
df['purchase_vs_user_avg'] = df['purchase_amount'] / df['user_avg_purchase']
df['is_above_user_avg'] = (df['purchase_amount'] > df['user_avg_purchase']).astype(int)

# === TIME-BASED AGGREGATIONS ===
df['user_purchases_last_7d'] = df.groupby('user_id').apply(
    lambda x: x.rolling('7D', on='timestamp')['purchase_amount'].count()
).reset_index(level=0, drop=True)

df['user_spend_last_30d'] = df.groupby('user_id').apply(
    lambda x: x.rolling('30D', on='timestamp')['purchase_amount'].sum()
).reset_index(level=0, drop=True)

# === PRODUCT-LEVEL AGGREGATIONS ===
product_stats = df.groupby('product_id').agg({
    'purchase_amount': 'mean',
    'user_id': 'nunique',
    'rating': 'mean'
}).reset_index()
product_stats.columns = ['product_id', 'product_avg_price', 'product_unique_buyers', 'product_avg_rating']

df = df.merge(product_stats, on='product_id', how='left')

# === CROSS-ENTITY STATS ===
df['user_vs_product_buyers'] = df['user_purchase_count'] / df['product_unique_buyers']

Category 7: Lag and Window Features (Time Series)

The past predicts the future.

# === LAG FEATURES ===
df = df.sort_values(['user_id', 'timestamp'])

df['prev_purchase_amount'] = df.groupby('user_id')['purchase_amount'].shift(1)
df['prev_2_purchase_amount'] = df.groupby('user_id')['purchase_amount'].shift(2)
df['prev_3_purchase_amount'] = df.groupby('user_id')['purchase_amount'].shift(3)

# === DIFFERENCE FROM PREVIOUS ===
df['purchase_change'] = df['purchase_amount'] - df['prev_purchase_amount']
df['purchase_pct_change'] = df['purchase_amount'] / df['prev_purchase_amount']

# === ROLLING STATISTICS ===
df['rolling_mean_3'] = df.groupby('user_id')['purchase_amount'].transform(
    lambda x: x.rolling(3, min_periods=1).mean()
)
df['rolling_std_3'] = df.groupby('user_id')['purchase_amount'].transform(
    lambda x: x.rolling(3, min_periods=1).std()
)
df['rolling_max_7'] = df.groupby('user_id')['purchase_amount'].transform(
    lambda x: x.rolling(7, min_periods=1).max()
)

# === EXPANDING STATISTICS ===
df['cumulative_purchases'] = df.groupby('user_id')['purchase_amount'].cumsum()
df['cumulative_count'] = df.groupby('user_id').cumcount() + 1
df['expanding_mean'] = df['cumulative_purchases'] / df['cumulative_count']

# === TIME SINCE EVENTS ===
df['days_since_last_purchase'] = df.groupby('user_id')['timestamp'].diff().dt.days
df['days_since_first_purchase'] = (df['timestamp'] - df['user_first_purchase']).dt.days

# === TREND INDICATORS ===
df['is_increasing'] = (df['purchase_amount'] > df['prev_purchase_amount']).astype(int)
df['consecutive_increases'] = df.groupby('user_id')['is_increasing'].transform(
    lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1)
)

Category 8: Domain-Specific Features

This is where domain expertise shines.

E-commerce Features

# === SHOPPING BEHAVIOR ===
df['cart_abandonment_rate'] = df['carts_abandoned'] / df['carts_created']
df['wishlist_to_purchase_rate'] = df['wishlist_purchases'] / df['wishlist_adds']
df['avg_time_to_purchase'] = df['total_time_to_purchase'] / df['purchase_count']

# === PRODUCT FEATURES ===
df['is_on_sale'] = (df['sale_price'] < df['original_price']).astype(int)
df['discount_pct'] = (df['original_price'] - df['sale_price']) / df['original_price']
df['price_tier'] = pd.qcut(df['price'], q=5, labels=['budget', 'low', 'mid', 'high', 'premium'])

# === SEASONALITY ===
df['is_holiday_season'] = df['month'].isin([11, 12]).astype(int)
df['is_back_to_school'] = df['month'].isin([8, 9]).astype(int)

Healthcare Features

# === PATIENT METRICS ===
df['bmi'] = df['weight_kg'] / (df['height_m'] ** 2)
df['bmi_category'] = pd.cut(df['bmi'], bins=[0, 18.5, 25, 30, 100],
                             labels=['underweight', 'normal', 'overweight', 'obese'])

# === VITAL SIGN FEATURES ===
df['heart_rate_zone'] = pd.cut(df['heart_rate'], bins=[0, 60, 100, 120, 200],
                                labels=['low', 'normal', 'elevated', 'high'])
df['blood_pressure_category'] = df.apply(classify_bp, axis=1)

# === TEMPORAL HEALTH ===
df['days_since_last_visit'] = (df['current_date'] - df['last_visit_date']).dt.days
df['visits_per_year'] = df['total_visits'] / df['years_as_patient']
df['medication_adherence'] = df['doses_taken'] / df['doses_prescribed']

Finance Features

# === CREDIT RISK ===
df['debt_to_income_ratio'] = df['total_debt'] / df['annual_income']
df['credit_utilization'] = df['credit_used'] / df['credit_limit']
df['payment_to_income'] = df['monthly_payment'] / (df['annual_income'] / 12)

# === ACCOUNT BEHAVIOR ===
df['avg_daily_balance'] = df['total_balance'] / df['days_in_period']
df['balance_volatility'] = df['balance_std'] / df['balance_mean']
df['overdraft_frequency'] = df['overdraft_count'] / df['total_transactions']

# === FRAUD INDICATORS ===
df['transaction_velocity'] = df['transactions_last_hour']
df['amount_vs_avg'] = df['transaction_amount'] / df['user_avg_transaction']
df['is_new_merchant'] = (df['times_at_merchant'] == 1).astype(int)
df['distance_from_home'] = haversine_distance(df['transaction_lat'], df['transaction_lon'],
                                               df['home_lat'], df['home_lon'])

The Feature Engineering Process

Step 1: Understand the Problem

Questions to ask:
├── What am I trying to predict?
├── What would a human expert look at?
├── What patterns exist in the data?
├── What external knowledge is relevant?
└── What's the timeline? (When do I need to predict?)

Step 2: Explore the Data

# Profile your data
import pandas as pd

def feature_profile(df):
    """Generate feature engineering opportunities."""
    print("=== DATA TYPES ===")
    print(df.dtypes.value_counts())

    print("\n=== DATETIME COLUMNS ===")
    datetime_cols = df.select_dtypes(include=['datetime64']).columns.tolist()
    print(f"Found: {datetime_cols}")
    print("Opportunities: hour, day, month, is_weekend, is_holiday, cyclical encoding")

    print("\n=== NUMERIC COLUMNS ===")
    numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
    print(f"Found: {numeric_cols}")
    print("Opportunities: log transform, binning, ratios, interactions, normalization")

    print("\n=== CATEGORICAL COLUMNS ===")
    cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    for col in cat_cols:
        nunique = df[col].nunique()
        print(f"  {col}: {nunique} unique values")
    print("Opportunities: frequency encoding, target encoding, grouping rare, combinations")

    print("\n=== POTENTIAL TEXT COLUMNS ===")
    for col in cat_cols:
        avg_len = df[col].str.len().mean()
        if avg_len > 50:
            print(f"  {col}: avg length {avg_len:.0f} chars")
    print("Opportunities: length, word count, sentiment, keywords, TF-IDF")

feature_profile(df)

Step 3: Generate Features

Start broad, then narrow down.

# Generate MANY features first
df = create_datetime_features(df)
df = create_numeric_features(df)
df = create_categorical_features(df)
df = create_aggregation_features(df)
df = create_domain_features(df)

print(f"Features: {len(df.columns)}")
# Features: 150+

Step 4: Select Features

Not all features are useful. Some are noise. Some are redundant.

from sklearn.feature_selection import SelectKBest, mutual_info_classif

# Method 1: Correlation-based
corr_with_target = df.corr()['target'].abs().sort_values(ascending=False)
top_features = corr_with_target.head(20).index.tolist()

# Method 2: Mutual Information
selector = SelectKBest(mutual_info_classif, k=20)
selector.fit(X, y)
selected_features = X.columns[selector.get_support()].tolist()

# Method 3: Feature Importance from Model
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)

importance = pd.Series(model.feature_importances_, index=X.columns)
top_features = importance.nlargest(20).index.tolist()

# Method 4: Recursive Feature Elimination
from sklearn.feature_selection import RFE

rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=20)
rfe.fit(X, y)
selected_features = X.columns[rfe.support_].tolist()

Common Mistakes

Mistake 1: Feature Leakage

# ❌ WRONG: Using future information
df['avg_next_7_days'] = df.groupby('user')['sales'].transform(
    lambda x: x.shift(-7).rolling(7).mean()  # LOOKING INTO THE FUTURE!
)

# ✅ RIGHT: Only use past information
df['avg_last_7_days'] = df.groupby('user')['sales'].transform(
    lambda x: x.shift(1).rolling(7).mean()  # Only past data
)

Mistake 2: Target Leakage

# ❌ WRONG: Feature derived from target
df['category_avg_target'] = df.groupby('category')['target'].transform('mean')
# This uses the target to create a feature!

# ✅ RIGHT: Use proper target encoding with CV
from category_encoders import TargetEncoder
encoder = TargetEncoder(cols=['category'])
encoder.fit(X_train, y_train)
X_train_encoded = encoder.transform(X_train)

Mistake 3: Not Handling Missing Values Created by Features

# ❌ WRONG: Lag features create NaN for first rows
df['prev_purchase'] = df.groupby('user')['purchase'].shift(1)
# First purchase per user has NaN!

# ✅ RIGHT: Handle the NaN appropriately
df['prev_purchase'] = df.groupby('user')['purchase'].shift(1).fillna(0)
# Or create a flag
df['is_first_purchase'] = df['prev_purchase'].isna().astype(int)
df['prev_purchase'] = df['prev_purchase'].fillna(df['purchase'].median())

Mistake 4: Over-Engineering

# ❌ WRONG: Creating 1000 features for 1000 rows
# More features than samples = overfitting disaster!

# ✅ RIGHT: Keep features << samples
# Rule of thumb: Start with features < samples / 10

The Feature Engineering Cheat Sheet

Data Type	Common Features
DateTime	hour, day, month, is_weekend, is_holiday, cyclical encoding, time since event
Numeric	log, sqrt, bins, ratios, interactions, percentiles, z-scores
Categorical	frequency, target encoding, combinations, rare grouping, binary flags
Text	length, word count, sentiment, keywords, TF-IDF, embeddings
Geographic	distance to POI, geohash, region, population density, nearby counts
Time Series	lag, rolling mean/std, cumulative, trend, seasonality
User/Entity	aggregations (mean, sum, count, std), lifetime value, recency

Key Takeaways

Feature engineering is often more important than model selection — A good feature can be worth 10 model tuning iterations
Domain knowledge is your superpower — Know what matters in your problem space
DateTime is a goldmine — Extract every signal from timestamps
Create ratios and interactions — Features combined often reveal hidden patterns
Aggregate at multiple levels — User stats, product stats, time-window stats
Watch for leakage — Never use future information or the target itself
Generate many, select few — Create broadly, then prune ruthlessly
Document everything — Future you needs to know why you created each feature

The One-Sentence Summary

Feature engineering is teaching your model to see what Inspector Chen sees — transforming raw timestamps into "Monday afternoon in winter" and GPS coordinates into "0.3 miles from the subway in a high-income neighborhood."

What's Next?

Now that you understand feature engineering, you're ready for:

Feature Selection — Choosing the best features from many
Automated Feature Engineering — Tools like Featuretools and AutoML
Feature Stores — Production-grade feature management
Embeddings — Deep learning's answer to feature engineering

Follow me for the next article in this series!

Let's Connect!

If this expanded your feature engineering toolkit, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's your favorite feature engineering trick? Share your secret weapons!

The difference between a model that sees "timestamp: 1705333620" and one that sees "Monday afternoon in January, 3 days before payday, during business hours"? Feature engineering. Be Inspector Chen, not the rookie.

Share this with someone whose model is struggling. The answer might not be a better algorithm — it might be better features.

Happy engineering! 🔧

DEV Community