Introduction
Feature engineering is one of the most essential steps in the data science pipeline. It consists of reconstructing raw data into meaningful features that enhance machine learning models' performance.
In this article, we will dive into the key techniques for effective feature engineering along with hands-on examples to assist you in getting started.
Roles of Features in Machine Learning
In feature engineering, features refer to the measurable properties machine learning models use to make predictions or decisions, they are obtained from the primary data and modified to formats that can be used efficiently by algorithms.
Some of these features include:
1.Raw Features
These features are derived from the main dataset without any moderation. They include subject, grade, and class in a student's dataset.
2.Derived Features
These are features that are generated from the already existing features through combinations for instance a Density feature from mass and volume.
3.Categorical Features
These features represent discrete values or classifications such as brands or types. In most of the algorithms in machine learning, they need to be converted to numerical values.
4.Numerical Features
They represent continuous or discrete data such as age, income, or weight.
5.Aggregated Features
These features summarize information over groups of data, such as average
6.Spatial Features
These features represent geographical or spatial information, such as the distance between different locations.
Techniques for Feature Engineering
1.Handling Missing Data
Imputation
This method replaces the missing values in the dataset with a statistic such as mean, median or mode. Example in Python code:
import pandas as pd
from sklearn.impute import SimpleImputer
# Sample DataFrame with missing values
df = pd.DataFrame({
'A': [1, 2, None, 4],
'B': [5, None, 7, 8]
})
# Initialize the imputer
imputer = SimpleImputer(strategy="mean")
# Impute missing values
df_imputed = df.copy()
for col in df.select_dtypes(include="number").columns:
df_imputed[col] = imputer.fit_transform(df[[col]])
print(df_imputed)
Flagging Missing values
These techniques create a new feature which tends to indicate all the missing values in the dataset. Example in Python code:
df['A_missing'] = df['A'].isnull().astype(int)
df['B_missing'] = df['B'].isnull().astype(int)
print(df)
2.Encoding Categorical Variables
One-Hot Encoding
This method converts the categorical variables in the data into binary variables. Example in Python code:
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'blue']
})
df_encoded = pd.get_dummies(df, columns=['color'])
print(df_encoded)
Label Encoding
This method gives a unique integer to each category in the data. Example in Python code:
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'blue']
})
df_encoded = pd.get_dummies(df, columns=['color'])
print(df_encoded)
3.Creating Interaction Features
Polynomial Features
This technique generates new features by multiplying the existing ones together. Example in Python code:
from sklearn.preprocessing import PolynomialFeatures
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
poly = PolynomialFeatures(degree=2, include_bias=False)
df_poly = pd.DataFrame(poly.fit_transform(df), columns=poly.get_feature_names_out())
print(df_poly)
4.Binning and Discretization
Binning
This method categorizes the data into bins. Example in Python code.
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5]
})
df['A_binned'] = pd.cut(df['A'], bins=[0, 2, 4, 6], labels=['low', 'medium', 'high'])
print(df)
Discretization
This technique encodes the continuous variables in the dataset into discrete categories. Example in Python code:
df['A_discretized'] = pd.cut(df['A'], q=3, labels=['low', 'medium', 'high'])
print(df)
5.Feature Extraction
Principal Component Analysis(PCA)
This technique minimizes the amplitude of the data. Example in Python code:
from sklearn.decomposition import PCA
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]
})
pca = PCA(n_components=1)
df_pca = pd.DataFrame(pca.fit_transform(df), columns=['PC1'])
print(df_pca)
t-SNE
This technique visualizes the high-dimensional data.
Example in Python code:
from sklearn.manifold import TSNE
import numpy as np
df = pd.DataFrame({
'A': np.random.rand(100),
'B': np.random.rand(100)
})
tsne = TSNE(n_components=2)
df_tsne = pd.DataFrame(tsne.fit_transform(df), columns=['Dim1', 'Dim2'])
print(df_tsne.head())
6.Feature Selection
Filter Methods
This type of technique tends to select features based on the statistical properties of the dataset. Example in Python code:
from sklearn.feature_selection import SelectKBest, f_classif
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'target': [1, 0, 1, 0]
})
X = df[['A', 'B']]
y = df['target']
selector = SelectKBest(score_func=f_classif, k=1)
X_new = selector.fit_transform(X, y)
print(X_new)
Wrapper
This method uses a model to assess feature subsets. Example in Python code:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'target': [1, 0, 1, 0]
})
X = df[['A', 'B']]
y = df['target']
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=1)
X_rfe = rfe.fit_transform(X, y)
print(X_rfe)
Challenges in Feature Engineering
While Feature engineering remains an important part of leveraging big datasets it also comes with its shortcomings.
1.Time Consuming
The manual feature engineering process involves the data scientist thoroughly examining all available data. The goal is to identify potential combinations of columns and predictors that could yield valuable insights to address the business problem at hand. This ends up requiring a significant amount of time and effort to complete all these steps.
2.Field Expertise
Having a deep understanding of the industry related to a machine learning project is crucial for identifying which features are pertinent and valuable. This knowledge also helps in visualizing how data points may interconnect in meaningful and predictive ways.
3.Advanced Technical Skillset
Feature engineering necessitates advanced technical skills and a comprehensive understanding of data science as well as machine learning algorithms. It requires a specific skill set that includes programming abilities and familiarity with database management. Most feature engineering techniques rely heavily on Python coding skills. Additionally, evaluating the effectiveness of newly created features involves a process of repetitive trial and error.
4.Overfitting
Generating an excessive number of features or overly complex features can result in overfitting. This occurs when the model excels on the training data but struggles to perform effectively on new, unseen data.
Tools for Feature Engineering
Pandas
This is a Python library for data manipulation and also for creating new features for data models.
Scikit-learn
This is an open-source Python library that is designed to be in feature engineering.
Feature-Engine
This is a Python library with multiple transformers to engineer and select features for machine learning models.
Featuretools
This is an automated feature engineering library that can create new features from relational data.
Conclusion
Feature engineering is an essential step in data science. Through feature engineering, you can process your data, to discover hidden trends and boost the performance of the machine learning models.
By mastering feature engineering, you enhance your models while also gaining a deeper insight into the underlying data and the specific problem you are addressing.
Top comments (0)