DEV Community

Cover image for Feature Selection vs Feature Extraction: Choosing Your Best Songs vs Remixing Them Into Something New
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

Feature Selection vs Feature Extraction: Choosing Your Best Songs vs Remixing Them Into Something New

The One-Line Summary: Feature selection CHOOSES the best existing features and discards the rest. Feature extraction CREATES new features by combining and transforming the originals. Selection keeps "salary" and "age." Extraction might create "PC1" that's 0.7Γ—salary + 0.3Γ—age.


The DJ's Dilemma

DJ Maya had a problem.

She'd collected 500 tracks over the years β€” bangers, deep cuts, rare remixes, forgotten classics. Every one of them special.

But her upcoming festival set was only 90 minutes. She could fit maybe 20 tracks.

She had two options:


Option 1: Curate the Set (Selection)

Pick the 20 best tracks. Play them as they are.

Original Collection: 500 tracks
                      ↓
          [Selection Process]
          - Remove duplicates
          - Cut low-energy tracks
          - Keep crowd favorites
          - Ensure genre diversity
                      ↓
Final Set: 20 tracks (original, unchanged)

"Billie Jean" stays "Billie Jean"
"One More Time" stays "One More Time"
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Each track is familiar, recognizable
  • The original artistry is preserved
  • Easy to explain: "These are my top 20"

Cons:

  • 480 tracks contribute NOTHING
  • What if the best moments are scattered across many tracks?

Option 2: Create a Megamix (Extraction)

Take ALL 500 tracks and remix them into 20 NEW mashups that capture the essence of everything.

Original Collection: 500 tracks
                      ↓
          [Extraction Process]
          - Analyze BPM, key, energy of all tracks
          - Extract common patterns
          - Blend similar vibes together
          - Create new compositions
                      ↓
Final Set: 20 NEW mashup tracks

"Billie Jean" + "One More Time" + 50 others 
    β†’ "Megamix #1: Disco Groove Essence"
Enter fullscreen mode Exit fullscreen mode

Pros:

  • EVERY original track contributes something
  • Captures patterns across the entire collection
  • Creates something new and unique

Cons:

  • Original tracks are no longer recognizable
  • Harder to explain: "What is 'Principal Component 3'?"

This is feature selection vs feature extraction.

Selection picks the best originals. Extraction creates new compositions from all of them.

Both reduce your 500 β†’ 20. But the 20 you end up with are fundamentally different.


The Technical Translation

Feature Selection

Definition: Choose a subset of the original features. Discard the rest entirely.

Original features: [age, income, debt, savings, credit_score, 
                    education, job_tenure, num_cards, ...]

                   (100 features)
                         ↓
              [Selection Algorithm]
                         ↓

Selected features: [income, credit_score, debt, age, job_tenure]

                   (5 features β€” SAME AS ORIGINALS)
Enter fullscreen mode Exit fullscreen mode

The features you keep are exactly what you started with. "Income" is still income. "Age" is still age.


Feature Extraction

Definition: Transform ALL original features into NEW features that capture the essential information.

Original features: [age, income, debt, savings, credit_score,
                    education, job_tenure, num_cards, ...]

                   (100 features)
                         ↓
              [Extraction Algorithm]
                         ↓

Extracted features: [PC1, PC2, PC3, PC4, PC5]

                   (5 features β€” BRAND NEW!)

Where PC1 = 0.45Γ—income + 0.38Γ—credit_score + 0.21Γ—savings - 0.15Γ—debt + ...
Enter fullscreen mode Exit fullscreen mode

The new features are mathematical combinations of the originals. "PC1" isn't income or age β€” it's a blend of everything.


Visual: The Key Difference

FEATURE SELECTION (Picking Songs)
════════════════════════════════════════════════════════════

Original:  [F1] [F2] [F3] [F4] [F5] [F6] [F7] [F8] [F9] [F10]
             ↓        ↓                  ↓             ↓
Selected:  [F1]     [F3]               [F7]          [F10]

β€’ 4 features kept EXACTLY as they were
β€’ 6 features COMPLETELY discarded
β€’ F1 is still F1, F3 is still F3


FEATURE EXTRACTION (Creating Mashups)
════════════════════════════════════════════════════════════

Original:  [F1] [F2] [F3] [F4] [F5] [F6] [F7] [F8] [F9] [F10]
             β†˜   ↓   ↙         β†˜   ↓   ↓   ↙        β†˜   ↙
Extracted:    [NEW1]              [NEW2]             [NEW3]

β€’ 3 NEW features created
β€’ EVERY original feature contributed something
β€’ NEW1 = blend of F1, F2, F3
β€’ NEW2 = blend of F4, F5, F6, F7
β€’ None of the new features existed before
Enter fullscreen mode Exit fullscreen mode

Feature Selection: The Complete Guide

Why Select Features?

  1. Remove noise β€” Irrelevant features hurt performance
  2. Reduce overfitting β€” Fewer features = less memorization
  3. Speed up training β€” Less data to process
  4. Improve interpretability β€” "Income matters most" is understandable

Selection Method 1: Filter Methods

The idea: Score each feature independently. Keep the high scorers.

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
import pandas as pd
import numpy as np

# Generate sample data
np.random.seed(42)
X = pd.DataFrame({
    'important_1': np.random.randn(1000) + np.random.randint(0, 2, 1000),
    'important_2': np.random.randn(1000) * 2 + np.random.randint(0, 2, 1000),
    'noise_1': np.random.randn(1000),
    'noise_2': np.random.randn(1000),
    'noise_3': np.random.randn(1000),
})
y = (X['important_1'] + X['important_2'] > 1).astype(int)

# === CORRELATION-BASED ===
correlations = X.apply(lambda col: col.corr(y)).abs()
print("Correlation with target:")
print(correlations.sort_values(ascending=False))

# === ANOVA F-TEST (for classification) ===
selector = SelectKBest(score_func=f_classif, k=2)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()].tolist()
print(f"\nANOVA selected: {selected_features}")

# === MUTUAL INFORMATION ===
selector = SelectKBest(score_func=mutual_info_classif, k=2)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()].tolist()
print(f"Mutual Info selected: {selected_features}")
Enter fullscreen mode Exit fullscreen mode

Output:

Correlation with target:
important_2    0.523
important_1    0.412
noise_1        0.031
noise_2        0.028
noise_3        0.015

ANOVA selected: ['important_1', 'important_2']
Mutual Info selected: ['important_1', 'important_2']
Enter fullscreen mode Exit fullscreen mode

Filter methods correctly identified the important features!

Filter Method Best For How It Works
Correlation Numeric target Linear relationship strength
Chi-squared Categorical Independence test
ANOVA F-test Classification Variance between classes
Mutual Information Any Non-linear relationships
Variance Threshold Any Remove low-variance features

Selection Method 2: Wrapper Methods

The idea: Actually train models with different feature subsets. Keep the subset that performs best.

from sklearn.feature_selection import RFE, SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# === RECURSIVE FEATURE ELIMINATION (RFE) ===
# Repeatedly removes the weakest feature
model = RandomForestClassifier(n_estimators=50, random_state=42)
rfe = RFE(estimator=model, n_features_to_select=2, step=1)
rfe.fit(X, y)

print("RFE Rankings (1 = selected):")
for feature, rank in zip(X.columns, rfe.ranking_):
    status = "βœ“ SELECTED" if rank == 1 else f"  Rank {rank}"
    print(f"  {feature}: {status}")

# === FORWARD SELECTION ===
# Starts empty, adds features one by one
sfs_forward = SequentialFeatureSelector(
    LogisticRegression(max_iter=1000),
    n_features_to_select=2,
    direction='forward',
    cv=5
)
sfs_forward.fit(X, y)
print(f"\nForward Selection: {X.columns[sfs_forward.get_support()].tolist()}")

# === BACKWARD ELIMINATION ===
# Starts with all, removes features one by one
sfs_backward = SequentialFeatureSelector(
    LogisticRegression(max_iter=1000),
    n_features_to_select=2,
    direction='backward',
    cv=5
)
sfs_backward.fit(X, y)
print(f"Backward Elimination: {X.columns[sfs_backward.get_support()].tolist()}")
Enter fullscreen mode Exit fullscreen mode

Output:

RFE Rankings (1 = selected):
  important_1: βœ“ SELECTED
  important_2: βœ“ SELECTED
  noise_1:   Rank 3
  noise_2:   Rank 4
  noise_3:   Rank 2

Forward Selection: ['important_1', 'important_2']
Backward Elimination: ['important_1', 'important_2']
Enter fullscreen mode Exit fullscreen mode
Wrapper Method Strategy Pros Cons
RFE Remove weakest iteratively Considers feature interactions Slow
Forward Add best iteratively Fast for few features Misses interactions
Backward Remove worst iteratively Considers all initially Slow for many features

Selection Method 3: Embedded Methods

The idea: Feature selection is BUILT INTO the model training.

from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# === LASSO (L1 Regularization) ===
# Automatically sets unimportant feature weights to ZERO
lasso = LogisticRegression(penalty='l1', solver='liblinear', C=0.5)
lasso.fit(X, y)

print("LASSO Coefficients:")
for feature, coef in zip(X.columns, lasso.coef_[0]):
    status = "βœ— ELIMINATED" if abs(coef) < 0.01 else f"  coef = {coef:.3f}"
    print(f"  {feature}: {status}")

# === TREE-BASED FEATURE IMPORTANCE ===
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

print("\nRandom Forest Importance:")
importance = pd.Series(rf.feature_importances_, index=X.columns)
for feature, imp in importance.sort_values(ascending=False).items():
    bar = "β–ˆ" * int(imp * 50)
    print(f"  {feature}: {bar} ({imp:.3f})")

# Select features above threshold
threshold = 0.1
selected = importance[importance > threshold].index.tolist()
print(f"\nSelected (importance > {threshold}): {selected}")
Enter fullscreen mode Exit fullscreen mode

Output:

LASSO Coefficients:
  important_1:   coef = 0.847
  important_2:   coef = 0.612
  noise_1: βœ— ELIMINATED
  noise_2: βœ— ELIMINATED
  noise_3: βœ— ELIMINATED

Random Forest Importance:
  important_2: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ (0.412)
  important_1: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ (0.338)
  noise_3: β–ˆβ–ˆβ–ˆ (0.089)
  noise_1: β–ˆβ–ˆ (0.081)
  noise_2: β–ˆβ–ˆ (0.080)

Selected (importance > 0.1): ['important_1', 'important_2']
Enter fullscreen mode Exit fullscreen mode

Feature Extraction: The Complete Guide

Why Extract Features?

  1. Capture hidden patterns β€” Combinations might be more meaningful than individuals
  2. Handle multicollinearity β€” Correlated features become orthogonal components
  3. Reduce dimensionality dramatically β€” 10,000 features β†’ 50 components
  4. Enable visualization β€” Project to 2D/3D for plotting

Extraction Method 1: PCA (Principal Component Analysis)

The idea: Find directions of maximum variance. Project data onto these directions.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt

# Create correlated data (100 features, but really only ~3 underlying dimensions)
np.random.seed(42)
n_samples = 500

# Three true underlying factors
factor1 = np.random.randn(n_samples)
factor2 = np.random.randn(n_samples)
factor3 = np.random.randn(n_samples)

# 100 features that are noisy combinations of these factors
X = np.column_stack([
    factor1 + 0.1 * np.random.randn(n_samples) for _ in range(30)
] + [
    factor2 + 0.1 * np.random.randn(n_samples) for _ in range(30)
] + [
    factor3 + 0.1 * np.random.randn(n_samples) for _ in range(30)
] + [
    np.random.randn(n_samples) for _ in range(10)  # Pure noise
])

print(f"Original shape: {X.shape}")  # (500, 100)

# Standardize (important for PCA!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# How much variance does each component explain?
print("\nVariance explained by top components:")
for i, var in enumerate(pca.explained_variance_ratio_[:10]):
    bar = "β–ˆ" * int(var * 100)
    cumulative = sum(pca.explained_variance_ratio_[:i+1])
    print(f"  PC{i+1}: {bar} {var:.1%} (cumulative: {cumulative:.1%})")

# How many components to keep 95% variance?
cumsum = np.cumsum(pca.explained_variance_ratio_)
n_components_95 = np.argmax(cumsum >= 0.95) + 1
print(f"\nComponents needed for 95% variance: {n_components_95}")
print(f"Dimensionality reduction: 100 β†’ {n_components_95} features!")
Enter fullscreen mode Exit fullscreen mode

Output:

Original shape: (500, 100)

Variance explained by top components:
  PC1: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 30.2% (cumulative: 30.2%)
  PC2: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 29.8% (cumulative: 60.0%)
  PC3: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 29.5% (cumulative: 89.5%)
  PC4: β–ˆ 1.1% (cumulative: 90.6%)
  PC5: β–ˆ 1.0% (cumulative: 91.6%)
  PC6:  0.9% (cumulative: 92.5%)
  PC7:  0.9% (cumulative: 93.4%)
  PC8:  0.8% (cumulative: 94.2%)
  PC9:  0.8% (cumulative: 95.0%)
  PC10:  0.7% (cumulative: 95.7%)

Components needed for 95% variance: 9
Dimensionality reduction: 100 β†’ 9 features!
Enter fullscreen mode Exit fullscreen mode

PCA found the 3 true underlying factors! PC1, PC2, PC3 capture 89.5% of variance.


Extraction Method 2: LDA (Linear Discriminant Analysis)

The idea: Find directions that MAXIMIZE class separation (supervised).

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_iris

# Load iris dataset (4 features, 3 classes)
iris = load_iris()
X, y = iris.data, iris.target

print(f"Original features: {X.shape[1]}")  # 4

# LDA: Extract components that separate classes
lda = LinearDiscriminantAnalysis(n_components=2)  # max = n_classes - 1
X_lda = lda.fit_transform(X, y)

print(f"LDA components: {X_lda.shape[1]}")  # 2

# Visualize
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.title('Original Features (first 2)')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])

plt.subplot(1, 2, 2)
plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.title('LDA Components (2)')
plt.xlabel('LD1')
plt.ylabel('LD2')

plt.tight_layout()
plt.savefig('lda_visualization.png', dpi=150)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Key difference from PCA:

  • PCA: Maximizes variance (unsupervised)
  • LDA: Maximizes class separation (supervised, needs labels)

Extraction Method 3: t-SNE (for Visualization)

The idea: Preserve local neighborhoods when projecting to 2D/3D.

from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

# Load digits dataset (64 features - 8x8 images)
digits = load_digits()
X, y = digits.data, digits.target

print(f"Original: {X.shape}")  # (1797, 64)

# t-SNE to 2D
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X)

print(f"t-SNE: {X_tsne.shape}")  # (1797, 2)

# Visualize
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', alpha=0.7, s=10)
plt.colorbar(scatter, label='Digit')
plt.title('t-SNE: 64D Digit Images β†’ 2D')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.savefig('tsne_digits.png', dpi=150)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Warning: t-SNE is for VISUALIZATION only. Don't use it for model training!


Extraction Method 4: Autoencoders (Deep Learning)

The idea: Train a neural network to compress and reconstruct. The bottleneck IS your extracted features.

import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense

# Original data: 100 features
X = np.random.randn(1000, 100)

# Build autoencoder
input_dim = 100
encoding_dim = 10  # Compress to 10 features

# Encoder
input_layer = Input(shape=(input_dim,))
encoded = Dense(50, activation='relu')(input_layer)
encoded = Dense(encoding_dim, activation='relu')(encoded)  # Bottleneck!

# Decoder  
decoded = Dense(50, activation='relu')(encoded)
decoded = Dense(input_dim, activation='linear')(decoded)

# Full autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mse')

# Just the encoder (for feature extraction)
encoder = Model(input_layer, encoded)

# Train
autoencoder.fit(X, X, epochs=50, batch_size=32, verbose=0)

# Extract features
X_encoded = encoder.predict(X)
print(f"Original: {X.shape}")      # (1000, 100)
print(f"Extracted: {X_encoded.shape}")  # (1000, 10)
Enter fullscreen mode Exit fullscreen mode
Original: (1000, 100)
Extracted: (1000, 10)
Enter fullscreen mode Exit fullscreen mode

The 10-dimensional bottleneck captures the essence of the 100 original features.


Head-to-Head Comparison

Aspect Feature Selection Feature Extraction
What it does Picks best existing features Creates new features
Features after Subset of originals Completely new
Interpretability High ("income matters") Low ("what is PC3?")
Information loss Discarded features lost entirely All features contribute
Multicollinearity Doesn't solve Solves (orthogonal components)
Speed Usually faster Can be slower
Domain knowledge Preserved Obscured
Examples RFE, LASSO, SelectKBest PCA, LDA, t-SNE, Autoencoders

When to Use Which?

START
  β”‚
  β–Ό
Do you need INTERPRETABILITY?
(Need to explain which features matter?)
  β”‚
  β”œβ”€β”€ YES ────────────────────────────────► FEATURE SELECTION
  β”‚                                          "Income and age are most important"
  β”‚
  └── NO
       β”‚
       β–Ό
Do you have HIGHLY CORRELATED features?
  β”‚
  β”œβ”€β”€ YES ────────────────────────────────► FEATURE EXTRACTION (PCA)
  β”‚                                          Correlated features β†’ orthogonal components
  β”‚
  └── NO
       β”‚
       β–Ό
Do you need to VISUALIZE high-dimensional data?
  β”‚
  β”œβ”€β”€ YES ────────────────────────────────► FEATURE EXTRACTION (t-SNE/UMAP)
  β”‚                                          For 2D/3D plots only
  β”‚
  └── NO
       β”‚
       β–Ό
Is this a CLASSIFICATION problem?
  β”‚
  β”œβ”€β”€ YES ────────────────────────────────► Try BOTH!
  β”‚     β”‚                                    Selection for interpretability
  β”‚     β”‚                                    LDA for class separation
  β”‚     β”‚
  β”‚     └── Compare performance
  β”‚
  └── NO (regression, clustering, etc.)
       β”‚
       └──────────────────────────────────► FEATURE SELECTION
                                            Often sufficient and simpler
Enter fullscreen mode Exit fullscreen mode

Using Both Together

The best pipelines often combine both approaches!

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Generate data with 100 features
np.random.seed(42)
X = np.random.randn(500, 100)
y = (X[:, 0] + X[:, 1] + X[:, 2] > 0).astype(int)

# Approach 1: Selection only
pipe_selection = Pipeline([
    ('scaler', StandardScaler()),
    ('select', SelectKBest(f_classif, k=10)),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Approach 2: Extraction only
pipe_extraction = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=10)),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Approach 3: Selection THEN Extraction
pipe_both = Pipeline([
    ('scaler', StandardScaler()),
    ('select', SelectKBest(f_classif, k=30)),  # First: keep top 30
    ('pca', PCA(n_components=10)),              # Then: compress to 10
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Compare
print("Cross-validation accuracy:")
for name, pipe in [('Selection', pipe_selection), 
                    ('Extraction', pipe_extraction),
                    ('Both', pipe_both)]:
    scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
    print(f"  {name}: {scores.mean():.1%} Β± {scores.std():.1%}")
Enter fullscreen mode Exit fullscreen mode

Output:

Cross-validation accuracy:
  Selection: 92.4% Β± 2.1%
  Extraction: 89.8% Β± 3.2%
  Both: 93.2% Β± 1.8%
Enter fullscreen mode Exit fullscreen mode

Combining selection THEN extraction often gives the best results!


Common Mistakes

Mistake 1: Confusing Them

# ❌ WRONG: Thinking PCA "selects" features
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X)
# X_pca columns are NOT original features!

# βœ… RIGHT: Understanding the difference
# PCA creates NEW features (linear combinations)
# To select original features, use SelectKBest, RFE, etc.
Enter fullscreen mode Exit fullscreen mode

Mistake 2: PCA Without Scaling

# ❌ WRONG: PCA on unscaled data
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X)  # Income in thousands dominates!

# βœ… RIGHT: Always scale before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Using t-SNE for Model Training

# ❌ WRONG: t-SNE as preprocessing for classifier
tsne = TSNE(n_components=2)
X_train_tsne = tsne.fit_transform(X_train)
X_test_tsne = tsne.fit_transform(X_test)  # Different embedding!
model.fit(X_train_tsne, y_train)
model.predict(X_test_tsne)  # Meaningless!

# βœ… RIGHT: t-SNE for visualization only
# For model training, use PCA or feature selection
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Fitting on Full Data (Leakage!)

# ❌ WRONG: Fit PCA on all data
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X)
X_train, X_test = train_test_split(X_pca, y)

# βœ… RIGHT: Fit only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y)
pca = PCA(n_components=5)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)  # Transform only!
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Method Type Supervised? Use Case
SelectKBest Selection Yes Quick filtering
RFE Selection Yes Model-based selection
LASSO Selection Yes Automatic zeroing
Tree Importance Selection Yes Non-linear importance
PCA Extraction No Reduce dimensions, decorrelate
LDA Extraction Yes Maximize class separation
t-SNE Extraction No Visualization only
UMAP Extraction No Visualization (faster than t-SNE)
Autoencoder Extraction No Non-linear compression

Key Takeaways

  1. Selection picks, Extraction transforms β€” Fundamentally different approaches

  2. Selection preserves interpretability β€” "Income matters most" is clear

  3. Extraction captures everything β€” No feature is completely discarded

  4. PCA solves multicollinearity β€” Creates orthogonal components

  5. t-SNE is for visualization ONLY β€” Never use for model training

  6. Always scale before PCA β€” Unscaled features will dominate

  7. Fit on training data only β€” Apply transform to test data

  8. Combine both for best results β€” Selection β†’ Extraction often wins


The One-Sentence Summary

Feature selection is DJ Maya picking her 20 best tracks to play as-is; feature extraction is her remixing all 500 tracks into 20 NEW mashups that capture the essence of her entire collection.


What's Next?

Now that you understand selection vs extraction, you're ready for:

  • PCA Deep Dive β€” Understanding eigenvectors and variance
  • Dimensionality Reduction for NLP β€” Word embeddings and LSA
  • Feature Stores β€” Production-grade feature management
  • AutoML Feature Engineering β€” Automated feature discovery

Follow me for the next article in this series!


Let's Connect!

If this clarified the selection vs extraction debate, drop a heart!

Questions? Ask in the comments β€” I read and respond to every one.

Which do you use more: selection or extraction? Share your preferences!


The difference between "income is the most important feature" and "PC1 explains 45% of variance"? Understanding whether you SELECTED original features or EXTRACTED new ones. Both reduce dimensions. Both can improve models. But the output tells a completely different story.


Share this with someone mixing up PCA with feature selection. The DJ analogy will clear it up.

Happy reducing! 🎡

Top comments (0)