The Class Imbalance Problem: How I Achieved 89% Accuracy on Customer Churn Prediction

#datascience #python #machinelearning #softwareengineering

Class imbalance is the silent killer of ML models. In customer churn prediction, you typically have 10-15% churners vs 85-90% loyal customers. My project faced exactly this challenge, and here's how I solved it with a counterintuitive approach.

The Problem: Severe Class Imbalance
Looking at my original dataset, the imbalance was stark:

zeros = db_train[db_train['is_churn'] == 0]
ones = db_train[db_train['is_churn'] == 1]
print(zeros.shape)
print(ones.shape)

Output:

(9354, 2) # Non-churners
(646, 2) # Churners

That's a 14.5:1 ratio - for every churner, I had 14.5 loyal customers. This kind of imbalance would make any model biased toward predicting "no churn" simply because it's the majority class.

The Solution: Strategic Undersampling

Instead of oversampling the minority class (which can introduce synthetic data artifacts), I chose to undersample the majority class:

# undersampling 0's to match the number of 1's
zeros_undersampled = resample(zeros, replace=False, n_samples=len(ones), random_state=42)
db_train = pd.concat([zeros_undersampled, ones])

# shuffling the results
db_train = db_train.sample(frac=1, random_state=42).reset_index(drop=True)

print(ones.count())
print(zeros_undersampled.count())
print(db_train.shape)

Output:

646
646
(1292, 2)

Perfect balance: 646 churners vs 646 non-churners.

Why Undersampling Worked Here

Preserved Data Quality No synthetic data artifacts that could mislead the model. Every data point represents a real customer.
True Performance Metrics With balanced classes, accuracy scores actually reflect real model capability rather than bias toward the majority class.
Focused Learning The model learns from representative examples of both classes, leading to better generalization.

The Results: Stellar Performance
After implementing a sophisticated data pipeline with feature engineering (duration calculation, one-hot encoding for gender)

# AdaBoost with Random Forest base
adabost = AdaBoostClassifier(
    rf, n_estimators=50, learning_rate=0.10, random_state=45
)
adabost.fit(x_train, y_train)
y_pred = adabost.predict(x_test)
score = accuracy_score(y_test, y_pred)
print("Accuracy for adaboost: " + str(round((score*100), 2)) + "%")

Final Results:

AdaBoost: 89.08% accuracy ⭐
Random Forest: 87.39% accuracy
Decision Tree: 86.97% accuracy
K-Nearest Neighbors: 86.55% accuracy
Voting Classifier: 82.77% accuracy
SVM: 74.79% accuracy
Logistic Regression: 73.53% accuracy

The Bottom Line

Class imbalance doesn't have to be a death sentence for your ML models. Sometimes the best solution is the simplest: carefully balance your data and let the algorithms do what they do best. In my case, this approach led to an 89% accuracy rate that would have been impossible with the original imbalanced dataset.

What's your go-to strategy for handling class imbalance? SMOTE? Undersampling? Or do you prefer other techniques?

Project GitHub Link