DEV Community

MustafaLSailor
MustafaLSailor

Posted on

Dimensionality reduction

Dimensionality reduction is a technique used to reduce the complexity of data and shorten processing time. It is often used on large data sets. Dimensionality reduction attempts to preserve the underlying structure and information of the data while reducing the size (number of features) of the data.

For example, there may be thousands of features in a data set, but not all features may be equally important or some may be strongly correlated with each other. In this case, dimensionality reduction techniques can transform these features into a smaller feature set.

Dimension reduction falls into two main categories:

  1. Feature Selection: This method tries to determine the most informative features among the features in the original data set. This can reduce the complexity of the model, reduce training time, and prevent overfitting. Feature selection techniques are generally divided into three main categories: filter methods, wrapper methods and embedded methods.

  2. Feature Extraction: This method aims to create new features using a combination or transformation of the original features. This provides a lower dimensional representation of the data and is often used to visualize data or simplify its complex structure. Feature transformation techniques include methods such as principal component analysis (PCA), linear discriminant analysis (LDA), and t-distributed stochastic neighbor embedding (t-SNE).

Dimensionality reduction can both make data more easily understandable (e.g., for visualization) and improve the performance of some machine learning algorithms. Especially with high-dimensional data (the "Curse of Dimensionality" problem), dimensionality reduction techniques can be very valuable.

PCA && LDA

PCA && LDA

Two popular dimensionality reduction techniques are PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis).

*PCA *(Principal Component Analysis): PCA is a technique that creates new variables by using the correlation between variables in the data set. These new variables are created as a combination of the original variables and are called "principal components". Principal components capture most of the variance in the data set and generally reduce the size of the original data set with fewer components.

LDA (Linear Discriminant Analysis): LDA is a technique used in classification problems. LDA tries to minimize differences within the same class while maximizing differences between classes. In this way, it helps maintain classification performance while reducing the size of the data.

Both techniques are widely used in the fields of machine learning and data analysis. Which technique to use depends on the specific application and data set.

Feature selection

Feature selection is a technique used for machine learning and data analysis. This technique helps identify and remove unnecessary features (or variables) to improve the performance of the model, prevent overfitting, increase the understandability and interpretability of the model, and reduce training times.

Feature selection falls into three main categories: filtering methods, packing methods, and embedding methods.

Filtering Methods: These methods are based on statistical relationships between features and the target variable. Features are evaluated independently and the most important features are selected. For example, metrics such as Pearson correlation or Chi-Square test can be used.

Wrapping Methods: These methods work in conjunction with a specific machine learning algorithm and iteratively adjust the feature set to optimize the performance of the model. For example, there are methods such as backstep elimination or iterative elimination.

Embedding Methods: These methods integrate feature selection into the training process of the model. For example, there are methods such as regularization techniques (Lasso, Ridge) or tree-based algorithms (Random Forest, Gradient Boosting).

Feature selection is a critical step to improve the overall performance and efficiency of the model.

Python

Below you can find sample codes showing how to use PCA and LDA in Python.

First of all, remember that the necessary libraries must be installed for these codes to work. These libraries are usually numpy, matplotlib, pandas and sklearn.

Python Code for PCA:


from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load iris dataset
data = load_iris()
X = data.data
y = data.target

# Build the PCA model
pca = PCA(n_components=2) 

# Transform data with PCA
X_pca = pca.fit_transform(X)

# Plot the results
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Python Code for LDA:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load iris dataset
data = load_iris()
X = data.data
y = data.target

# Create the LDA model
lda = LDA(n_components=2)

# Convert data with LDA
X_lda = lda.fit_transform(X, y)

# Plot the results
plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y)
plt.xlabel('First Linear Discriminant')
plt.ylabel('Second Linear Discriminant')
plt.show()
Enter fullscreen mode Exit fullscreen mode

These codes show how to apply PCA and LDA using the Iris dataset. In both cases, the data set is reduced to two dimensions and the results are visualized with a scatter plot.

what is the n_components=2 ?

The n_components=2 parameter specifies how many components (or dimensions) PCA or LDA will create.

For example, setting n_components=2 in PCA means that the data will be reduced to two principal components. This is especially used for visualizing high-dimensional data because we can easily plot the results in a two-dimensional graph.

Similarly, setting n_components=2 in LDA means that the data will be reduced to two linear discriminant dimensions.

This parameter is usually adjusted depending on the size of the data set and what information needs to be preserved for analysis. More components preserve more original information but can also lead to more complexity and less interpretability.

Top comments (0)