hqqqqy

Posted on May 15

Visualizing Why Standardization Changes Decision Boundaries

#beginners #machinelearning #python #tutorial

My SVM classifier drew a perfect decision boundary in testing. In production, it misclassified 40% of samples. The only difference: I forgot to standardize one new feature. Here's why that completely changed where the boundary was drawn.

The Visual Intuition

Imagine classifying customers as "will churn" or "won't churn" based on two features: age (20-60) and income (20,000-200,000). Without standardization, the decision boundary is almost vertical because income varies 100× more than age.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# Generate sample data: [age, income]
np.random.seed(42)
X_class0 = np.random.randn(50, 2) * [5, 20000] + [30, 50000]   # Won't churn
X_class1 = np.random.randn(50, 2) * [5, 20000] + [45, 120000]  # Will churn

X = np.vstack([X_class0, X_class1])
y = np.array([0]*50 + [1]*50)

# Train SVM WITHOUT standardization
svm_no_scale = SVC(kernel='linear')
svm_no_scale.fit(X, y)

# Train SVM WITH standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

svm_with_scale = SVC(kernel='linear')
svm_with_scale.fit(X_scaled, y)

print(f"Without scaling - accuracy: {svm_no_scale.score(X, y):.3f}")
print(f"With scaling - accuracy: {svm_with_scale.score(X_scaled, y):.3f}")

What happens: The unscaled SVM ignores age almost entirely because income dominates the distance calculation. The scaled SVM treats both features equally.

In my exploration of how standardization affects distance-based algorithms, I found that the decision boundary isn't just shifted — it's rotated and reshaped when you standardize features.

The Math: Why Boundaries Change

SVM finds the hyperplane that maximizes the margin between classes. The margin is measured using distance, and distance depends on feature scales.

Without standardization:

\text{distance} = \sqrt{(\Delta \text{age})^2 + (\Delta \text{income})^2}

If age differs by 10 and income differs by 10,000:

\text{distance} = \sqrt{10^2 + 10000^2} \approx 10000

The age difference contributes 0.01% to the distance — effectively ignored.

With standardization (mean=0, std=1 for both features):

\text{distance} = \sqrt{(\Delta \text{age}{\text{scaled}})^2 + (\Delta \text{income}{\text{scaled}})^2}

Now both features contribute equally to distance, and the decision boundary considers both.

Visualizing the Impact

Here's code to see the decision boundary before and after scaling:

def plot_decision_boundary(X, y, model, title):
    """
    Plot decision boundary for 2D data
    """
    h = 0.02  # Step size in mesh

    # Create mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # Predict on mesh
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot
    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

# Plot both
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plot_decision_boundary(X, y, svm_no_scale, 'Without Standardization')
plt.xlabel('Age')
plt.ylabel('Income')

plt.subplot(1, 2, 2)
plot_decision_boundary(X_scaled, y, svm_with_scale, 'With Standardization')
plt.xlabel('Age (scaled)')
plt.ylabel('Income (scaled)')

plt.tight_layout()
plt.show()

What you'll see: The unscaled boundary is nearly vertical (only considers income). The scaled boundary is diagonal (considers both features).

The Three Ways Standardization Changes Boundaries

1. Rotation

The decision boundary rotates to align with the actual data structure, not the arbitrary scales:

# Calculate decision boundary angle
def boundary_angle(model, X):
    """
    Calculate angle of linear decision boundary
    """
    w = model.coef_[0]
    angle = np.arctan2(w[1], w[0]) * 180 / np.pi
    return angle

angle_no_scale = boundary_angle(svm_no_scale, X)
angle_with_scale = boundary_angle(svm_with_scale, X_scaled)

print(f"Boundary angle without scaling: {angle_no_scale:.1f}°")
print(f"Boundary angle with scaling: {angle_with_scale:.1f}°")

2. Margin Width

The margin (distance from boundary to nearest points) changes because distance is measured differently:

# Calculate margin width
def margin_width(model, X):
    """
    Calculate SVM margin width
    """
    w = model.coef_[0]
    margin = 2 / np.linalg.norm(w)
    return margin

margin_no_scale = margin_width(svm_no_scale, X)
margin_with_scale = margin_width(svm_with_scale, X_scaled)

print(f"Margin without scaling: {margin_no_scale:.2f}")
print(f"Margin with scaling: {margin_with_scale:.2f}")

3. Support Vectors

Different points become support vectors (the critical points that define the boundary):

# Compare support vectors
print(f"Support vectors without scaling: {len(svm_no_scale.support_vectors_)}")
print(f"Support vectors with scaling: {len(svm_with_scale.support_vectors_)}")

# Often different points are selected as support vectors

What Most Tutorials Miss

The biggest mistake I made was thinking standardization just "improves performance". It doesn't improve performance — it changes what the model learns.

Without standardization: The model learns "income is the only thing that matters" (because it dominates distance).

With standardization: The model learns "both age and income matter equally" (because they contribute equally to distance).

Neither is "better" in absolute terms — it depends on whether you want features weighted by their natural scales or weighted equally.

Scenario	Standardize?	Why
Features have meaningful scales (e.g., temperature in Celsius)	Maybe not	Natural scales might be important
Features have arbitrary scales (e.g., survey responses 1-5 vs 1-100)	Yes	Arbitrary scales shouldn't affect importance
One feature is much more important	Maybe not	Let it dominate naturally
All features should contribute equally	Yes	Force equal contribution

Example: When NOT to Standardize

# Medical data: [blood_pressure, age]
# Blood pressure range: 80-200 (clinically meaningful)
# Age range: 0-100 (clinically meaningful)

X_medical = np.array([
    [120, 30],  # Normal BP, young
    [180, 70],  # High BP, old
    [110, 25],  # Normal BP, young
    [190, 75]   # High BP, old
])
y_medical = np.array([0, 1, 0, 1])  # 0 = healthy, 1 = at risk

# Without standardization: BP naturally more important (correct!)
svm_medical_no_scale = SVC(kernel='linear')
svm_medical_no_scale.fit(X_medical, y_medical)

# With standardization: Age and BP weighted equally (maybe wrong!)
scaler_medical = StandardScaler()
X_medical_scaled = scaler_medical.fit_transform(X_medical)

svm_medical_scaled = SVC(kernel='linear')
svm_medical_scaled.fit(X_medical_scaled, y_medical)

# Check feature importance (coefficient magnitude)
print("Without scaling - feature importance:", np.abs(svm_medical_no_scale.coef_[0]))
print("With scaling - feature importance:", np.abs(svm_medical_scaled.coef_[0]))

If blood pressure is clinically more important than age, standardization might hurt by forcing equal weights.

The Production Decision Framework

Here's my decision tree for whether to standardize:

def should_standardize(X, feature_names, domain_knowledge):
    """
    Decide whether to standardize features
    """
    # Check 1: Are scales arbitrary or meaningful?
    if domain_knowledge['scales_meaningful']:
        print("Scales are meaningful - consider NOT standardizing")
        return False

    # Check 2: Do features have very different ranges?
    ranges = X.max(axis=0) - X.min(axis=0)
    scale_ratio = ranges.max() / ranges.min()

    if scale_ratio < 10:
        print(f"Scale ratio {scale_ratio:.1f}× is small - standardization optional")
        return False

    # Check 3: Using distance-based algorithm?
    if domain_knowledge['algorithm'] in ['knn', 'svm', 'neural_network', 'pca']:
        print("Distance-based algorithm - MUST standardize")
        return True

    # Check 4: Tree-based algorithm?
    if domain_knowledge['algorithm'] in ['random_forest', 'xgboost', 'lightgbm']:
        print("Tree-based algorithm - standardization not needed")
        return False

    # Default: standardize
    return True

# Example usage
domain_knowledge = {
    'scales_meaningful': False,
    'algorithm': 'svm'
}

should_std = should_standardize(X, ['age', 'income'], domain_knowledge)

The Debugging Checklist

When your model performs differently in production:

def debug_standardization_issue(X_train, X_test, model):
    """
    Check for standardization-related bugs
    """
    # Check 1: Are train and test scaled the same way?
    train_ranges = X_train.max(axis=0) - X_train.min(axis=0)
    test_ranges = X_test.max(axis=0) - X_test.min(axis=0)

    print("Train feature ranges:", train_ranges)
    print("Test feature ranges:", test_ranges)

    if not np.allclose(train_ranges, test_ranges, rtol=0.5):
        print("⚠️  WARNING: Train and test have different scales")

    # Check 2: Are all features scaled?
    train_means = X_train.mean(axis=0)
    train_stds = X_train.std(axis=0)

    print("\nTrain feature means:", train_means)
    print("Train feature stds:", train_stds)

    if not np.allclose(train_means, 0, atol=0.1) or not np.allclose(train_stds, 1, atol=0.1):
        print("⚠️  WARNING: Features don't appear to be standardized")

    # Check 3: Feature importance
    if hasattr(model, 'coef_'):
        feature_importance = np.abs(model.coef_[0])
        print("\nFeature importance:", feature_importance)

        if feature_importance.max() / feature_importance.min() > 100:
            print("⚠️  WARNING: One feature dominates - check scaling")

# Example usage
debug_standardization_issue(X_train, X_test, svm_with_scale)

Key Takeaways for Developers

Standardization doesn't just improve performance — it changes what the model learns
Decision boundaries rotate, reshape, and use different support vectors after standardization
Distance-based algorithms (SVM, kNN, neural networks) require standardization unless scales are meaningful
Tree-based algorithms don't need standardization — they split on thresholds, not distances
Always fit scaler on training data only, then transform train, validation, test, and production data

The decision boundary that looked perfect in testing but failed in production taught me that preprocessing isn't a minor detail — it fundamentally changes what patterns the model can learn. If you want to see how standardization affects decision boundaries interactively, check out the standardization visualizer — it shows exactly how boundaries change as you scale features.

For more on feature scaling and decision boundaries, see the scikit-learn preprocessing guide and this visual guide to SVM.

DEV Community