Giri Dharan

Posted on Dec 17, 2025

MLOps : Feature Engineering Technique and Process, Where most of the time have been spent for extracting solutions.

#machinelearning #datascience #aiops #database

Feature engineering is the process of transforming raw data into meaningful input features that help a machine learning model learn better and make more accurate predictions. It’s often said that “garbage in, garbage out” — no matter how fancy the algorithm, a model can only perform as well as the quality of its features.

In this blog, we’ll explain what feature engineering is, why it matters, and walk through a real‑world example using the well‑known Titanic survival dataset .

What is a “feature”?

In machine learning, a feature is an individual measurable property or characteristic of the data that the model uses as input. For example:

In a house price prediction model, features might be: bedrooms, area_sqft, age_of_house, location.
In a customer churn model, features could be: monthly_spend, days_since_last_login, number_of_support_tickets.

The target (or label) is what we want to predict, like price or churned (yes/no).

What is feature engineering?

Feature engineering is the art and science of:

Selecting which raw variables to use as features.
Transforming them (scaling, encoding, binning, etc.).
Creating new features from existing ones (e.g., ratios, aggregations, time‑based features).

The goal is to make the patterns in the data more obvious to the model, so it can learn faster and generalize better to unseen data.

Why is feature engineering important?

Even with powerful algorithms like XGBoost or deep learning, good feature engineering often has a bigger impact on model performance than hyperparameter tuning. Here’s why:

Better patterns: Raw data (like dates, text, or categories) may not be in a form that models can easily understand; engineering converts them into numerical signals.
Handles noise and missing data: Techniques like imputation, outlier handling, and normalization make the data more robust .
Reduces dimensionality: Removing irrelevant features or combining correlated ones can speed up training and reduce overfitting.
Improves interpretability: Well‑engineered features (like age_group instead of raw age) are easier for humans to understand.

In real‑time ML systems (like fraud detection or recommendations), feature engineering also affects latency and scalability, since features must be computed quickly on streaming data.

Real‑world example: Titanic survival prediction

Let’s walk through feature engineering on the classic Titanic dataset, where the goal is to predict whether a passenger survived the disaster.

Raw dataset (first few rows)

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen	male	22	1	A/5 21171	7.25	NaN	S
2	1	1	Cumings, Mrs. John	female	38	1	PC 17599	71.28	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	NaN	S

Target: Survived (0 = No, 1 = Yes)

Features: everything else.

Step 1: Handle missing values

Real data is messy; many passengers have missing Age or Cabin values.

Age: Instead of dropping rows, impute missing ages with the median age of passengers in the same Pclass (ticket class) .

  df['Age'] = df.groupby('Pclass')['Age'].transform(lambda x: x.fillna(x.median()))

Cabin: Since many are missing, create a binary feature HasCabin (1 if cabin is known, 0 otherwise) .

  df['HasCabin'] = df['Cabin'].notna().astype(int)

Embarked: Only 2 missing; fill with the most frequent port (‘S’ for Southampton) .

Step 2: Encode categorical variables

Models like logistic regression or tree‑based models need numbers, not text .

Sex: Convert male/female to 0/1 .

  df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

Embarked: Use one‑hot encoding to create three binary columns: Embarked_S, Embarked_C, Embarked_Q .

  df = pd.get_dummies(df, columns=['Embarked'], prefix='Embarked')

Step 3: Create new features (feature construction)

This is where domain knowledge shines: we create features that capture meaningful patterns .

Family size: Combine SibSp (siblings/spouses) and Parch (parents/children) to get total family members on board .

  df['FamilySize'] = df['SibSp'] + df['Parch'] + 1  # +1 for the passenger

IsAlone: Flag passengers traveling alone (FamilySize = 1) .

  df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

Title from Name: Extract titles like “Mr.”, “Mrs.”, “Miss”, “Master” from the Name field, then group rare titles into a single category .

  df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
  df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
  df['Title'] = df['Title'].replace('Mlle', 'Miss')
  df['Title'] = df['Title'].replace('Ms', 'Miss')
  df['Title'] = df['Title'].replace('Mme', 'Mrs')
  df = pd.get_dummies(df, columns=['Title'], prefix='Title')

Age group: Bin Age into categories like child, young adult, adult, senior .

  df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 100], labels=[0,1,2,3,4])

Fare per person: Divide Fare by FamilySize to get fare per person, which may be more meaningful than total fare .

  df['FarePerPerson'] = df['Fare'] / df['FamilySize']

Step 4: Scale numerical features (optional)

Some algorithms (like SVM, logistic regression with regularization) perform better when features are on a similar scale .

Use standardization (mean=0, std=1) or min‑max scaling on continuous features like Age, Fare, FarePerPerson .

  from sklearn.preprocessing import StandardScaler
  scaler = StandardScaler()
  df[['Age', 'Fare', 'FarePerPerson']] = scaler.fit_transform(df[['Age', 'Fare', 'FarePerPerson']])

Tree‑based models (Random Forest, XGBoost) are less sensitive to scaling, so this step is optional depending on the model choice .

Step 5: Drop irrelevant columns

Remove columns that won’t be used as features:

PassengerId, Name, Ticket, Cabin (we already used HasCabin) .

Final feature set might look like:

Pclass, Sex, Age, SibSp, Parch, Fare, HasCabin,
FamilySize, IsAlone, AgeGroup, FarePerPerson,
Embarked_S, Embarked_C, Embarked_Q,
Title_Master, Title_Miss, Title_Mr, Title_Mrs, Title_Rare

How this helps the model

After feature engineering, the model sees:

More signal: FamilySize and IsAlone capture social context; Title captures social status, which historically influenced survival .
Cleaner input: Missing values are handled, categories are encoded, and numerical features are scaled .
Better generalization: The model can now learn that, for example, women, children, and higher‑class passengers had higher survival rates, using these engineered features .

In practice, a simple model (like Logistic Regression) trained on well‑engineered features often outperforms a complex model on raw data.

Feature engineering in real‑time ML

In production systems (e.g., real‑time fraud detection), feature engineering must be:

Fast: Features computed on the fly from streaming events (e.g., “number of transactions in last 5 minutes”) .
Consistent: The same transformations used in training must be applied at inference time.
Maintainable: Features are often stored in a feature store so they can be reused across models and pipelines .

For example, in a real‑time recommendation system, features like:

user_avg_session_duration_last_7d
item_popularity_score_last_hour
time_since_last_purchase

are computed continuously from event streams and served to the model with low latency.

Best practices

Start simple: Begin with basic transformations (handle missing values, encode categories) before adding complex features.
Use domain knowledge: Talk to business experts to understand what features might matter (e.g., “VIP customers” in churn prediction).
Iterate and validate: Use cross‑validation to measure if a new feature actually improves performance .
Avoid data leakage: Never use future information (e.g., “average future spend”) in features; only use data available at prediction time .
Document features: Maintain a feature catalog so everyone knows what each feature means and how it’s computed.

Summary

Feature engineering turns messy, raw data into clean, meaningful inputs that help ML models learn effectively . In the Titanic example, simple steps like:

Imputing missing Age,
Creating FamilySize and IsAlone,
Extracting Title from names,

can significantly boost model performance .

In real‑time ML, feature engineering becomes even more critical: it directly impacts prediction accuracy, latency, and system scalability . Investing time in thoughtful feature engineering is one of the most effective ways to build robust, high‑performing ML systems.

DEV Community