DEV Community

Cover image for MLOps : Feature Engineering Technique and Process, Where most of the time have been spent for extracting solutions.
Giri Dharan
Giri Dharan

Posted on

MLOps : Feature Engineering Technique and Process, Where most of the time have been spent for extracting solutions.

Feature engineering is the process of transforming raw data into meaningful input features that help a machine learning model learn better and make more accurate predictions. It’s often said that “garbage in, garbage out” — no matter how fancy the algorithm, a model can only perform as well as the quality of its features.

In this blog, we’ll explain what feature engineering is, why it matters, and walk through a real‑world example using the well‑known Titanic survival dataset .


What is a “feature”?

In machine learning, a feature is an individual measurable property or characteristic of the data that the model uses as input. For example:

  • In a house price prediction model, features might be: bedrooms, area_sqft, age_of_house, location.
  • In a customer churn model, features could be: monthly_spend, days_since_last_login, number_of_support_tickets.

The target (or label) is what we want to predict, like price or churned (yes/no).


What is feature engineering?

Feature engineering is the art and science of:

  1. Selecting which raw variables to use as features.
  2. Transforming them (scaling, encoding, binning, etc.).
  3. Creating new features from existing ones (e.g., ratios, aggregations, time‑based features).

The goal is to make the patterns in the data more obvious to the model, so it can learn faster and generalize better to unseen data.


Why is feature engineering important?

Even with powerful algorithms like XGBoost or deep learning, good feature engineering often has a bigger impact on model performance than hyperparameter tuning. Here’s why:

  • Better patterns: Raw data (like dates, text, or categories) may not be in a form that models can easily understand; engineering converts them into numerical signals.
  • Handles noise and missing data: Techniques like imputation, outlier handling, and normalization make the data more robust .
  • Reduces dimensionality: Removing irrelevant features or combining correlated ones can speed up training and reduce overfitting.
  • Improves interpretability: Well‑engineered features (like age_group instead of raw age) are easier for humans to understand.

In real‑time ML systems (like fraud detection or recommendations), feature engineering also affects latency and scalability, since features must be computed quickly on streaming data.


Real‑world example: Titanic survival prediction

Let’s walk through feature engineering on the classic Titanic dataset, where the goal is to predict whether a passenger survived the disaster.

Raw dataset (first few rows)

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen male 22 1 0 A/5 21171 7.25 NaN S
2 1 1 Cumings, Mrs. John female 38 1 0 PC 17599 71.28 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S

Target: Survived (0 = No, 1 = Yes)

Features: everything else.


Step 1: Handle missing values

Real data is messy; many passengers have missing Age or Cabin values.

  • Age: Instead of dropping rows, impute missing ages with the median age of passengers in the same Pclass (ticket class) .
  df['Age'] = df.groupby('Pclass')['Age'].transform(lambda x: x.fillna(x.median()))
Enter fullscreen mode Exit fullscreen mode
  • Cabin: Since many are missing, create a binary feature HasCabin (1 if cabin is known, 0 otherwise) .
  df['HasCabin'] = df['Cabin'].notna().astype(int)
Enter fullscreen mode Exit fullscreen mode
  • Embarked: Only 2 missing; fill with the most frequent port (‘S’ for Southampton) .

Step 2: Encode categorical variables

Models like logistic regression or tree‑based models need numbers, not text .

  • Sex: Convert male/female to 0/1 .
  df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
Enter fullscreen mode Exit fullscreen mode
  • Embarked: Use one‑hot encoding to create three binary columns: Embarked_S, Embarked_C, Embarked_Q .
  df = pd.get_dummies(df, columns=['Embarked'], prefix='Embarked')
Enter fullscreen mode Exit fullscreen mode

Step 3: Create new features (feature construction)

This is where domain knowledge shines: we create features that capture meaningful patterns .

  • Family size: Combine SibSp (siblings/spouses) and Parch (parents/children) to get total family members on board .
  df['FamilySize'] = df['SibSp'] + df['Parch'] + 1  # +1 for the passenger
Enter fullscreen mode Exit fullscreen mode
  • IsAlone: Flag passengers traveling alone (FamilySize = 1) .
  df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
Enter fullscreen mode Exit fullscreen mode
  • Title from Name: Extract titles like “Mr.”, “Mrs.”, “Miss”, “Master” from the Name field, then group rare titles into a single category .
  df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
  df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
  df['Title'] = df['Title'].replace('Mlle', 'Miss')
  df['Title'] = df['Title'].replace('Ms', 'Miss')
  df['Title'] = df['Title'].replace('Mme', 'Mrs')
  df = pd.get_dummies(df, columns=['Title'], prefix='Title')
Enter fullscreen mode Exit fullscreen mode
  • Age group: Bin Age into categories like child, young adult, adult, senior .
  df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 100], labels=[0,1,2,3,4])
Enter fullscreen mode Exit fullscreen mode
  • Fare per person: Divide Fare by FamilySize to get fare per person, which may be more meaningful than total fare .
  df['FarePerPerson'] = df['Fare'] / df['FamilySize']
Enter fullscreen mode Exit fullscreen mode

Step 4: Scale numerical features (optional)

Some algorithms (like SVM, logistic regression with regularization) perform better when features are on a similar scale .

  • Use standardization (mean=0, std=1) or min‑max scaling on continuous features like Age, Fare, FarePerPerson .
  from sklearn.preprocessing import StandardScaler
  scaler = StandardScaler()
  df[['Age', 'Fare', 'FarePerPerson']] = scaler.fit_transform(df[['Age', 'Fare', 'FarePerPerson']])
Enter fullscreen mode Exit fullscreen mode

Tree‑based models (Random Forest, XGBoost) are less sensitive to scaling, so this step is optional depending on the model choice .


Step 5: Drop irrelevant columns

Remove columns that won’t be used as features:

  • PassengerId, Name, Ticket, Cabin (we already used HasCabin) .

Final feature set might look like:

Pclass, Sex, Age, SibSp, Parch, Fare, HasCabin,
FamilySize, IsAlone, AgeGroup, FarePerPerson,
Embarked_S, Embarked_C, Embarked_Q,
Title_Master, Title_Miss, Title_Mr, Title_Mrs, Title_Rare
Enter fullscreen mode Exit fullscreen mode

How this helps the model

After feature engineering, the model sees:

  • More signal: FamilySize and IsAlone capture social context; Title captures social status, which historically influenced survival .
  • Cleaner input: Missing values are handled, categories are encoded, and numerical features are scaled .
  • Better generalization: The model can now learn that, for example, women, children, and higher‑class passengers had higher survival rates, using these engineered features .

In practice, a simple model (like Logistic Regression) trained on well‑engineered features often outperforms a complex model on raw data.


Feature engineering in real‑time ML

In production systems (e.g., real‑time fraud detection), feature engineering must be:

  • Fast: Features computed on the fly from streaming events (e.g., “number of transactions in last 5 minutes”) .
  • Consistent: The same transformations used in training must be applied at inference time.
  • Maintainable: Features are often stored in a feature store so they can be reused across models and pipelines .

For example, in a real‑time recommendation system, features like:

  • user_avg_session_duration_last_7d
  • item_popularity_score_last_hour
  • time_since_last_purchase

are computed continuously from event streams and served to the model with low latency.


Best practices

  1. Start simple: Begin with basic transformations (handle missing values, encode categories) before adding complex features.
  2. Use domain knowledge: Talk to business experts to understand what features might matter (e.g., “VIP customers” in churn prediction).
  3. Iterate and validate: Use cross‑validation to measure if a new feature actually improves performance .
  4. Avoid data leakage: Never use future information (e.g., “average future spend”) in features; only use data available at prediction time .
  5. Document features: Maintain a feature catalog so everyone knows what each feature means and how it’s computed.

Summary

Feature engineering turns messy, raw data into clean, meaningful inputs that help ML models learn effectively . In the Titanic example, simple steps like:

  • Imputing missing Age,
  • Creating FamilySize and IsAlone,
  • Extracting Title from names,

can significantly boost model performance .

In real‑time ML, feature engineering becomes even more critical: it directly impacts prediction accuracy, latency, and system scalability . Investing time in thoughtful feature engineering is one of the most effective ways to build robust, high‑performing ML systems.

Top comments (0)