Class imbalance is the silent killer of ML models. In customer churn prediction, you typically have 10-15% churners vs 85-90% loyal customers. My project faced exactly this challenge, and here's how I solved it with a counterintuitive approach.
The Problem: Severe Class Imbalance
Looking at my original dataset, the imbalance was stark:
zeros = db_train[db_train['is_churn'] == 0]
ones = db_train[db_train['is_churn'] == 1]
print(zeros.shape)
print(ones.shape)
Output:
(9354, 2) # Non-churners
(646, 2) # Churners
That's a 14.5:1 ratio - for every churner, I had 14.5 loyal customers. This kind of imbalance would make any model biased toward predicting "no churn" simply because it's the majority class.
The Solution: Strategic Undersampling
Instead of oversampling the minority class (which can introduce synthetic data artifacts), I chose to undersample the majority class:
# undersampling 0's to match the number of 1's
zeros_undersampled = resample(zeros, replace=False, n_samples=len(ones), random_state=42)
db_train = pd.concat([zeros_undersampled, ones])
# shuffling the results
db_train = db_train.sample(frac=1, random_state=42).reset_index(drop=True)
print(ones.count())
print(zeros_undersampled.count())
print(db_train.shape)
Output:
646
646
(1292, 2)
Perfect balance: 646 churners vs 646 non-churners.
Why Undersampling Worked Here
- Preserved Data Quality No synthetic data artifacts that could mislead the model. Every data point represents a real customer.
- True Performance Metrics With balanced classes, accuracy scores actually reflect real model capability rather than bias toward the majority class.
- Focused Learning The model learns from representative examples of both classes, leading to better generalization.
The Results: Stellar Performance
After implementing a sophisticated data pipeline with feature engineering (duration calculation, one-hot encoding for gender)
# AdaBoost with Random Forest base
adabost = AdaBoostClassifier(
rf, n_estimators=50, learning_rate=0.10, random_state=45
)
adabost.fit(x_train, y_train)
y_pred = adabost.predict(x_test)
score = accuracy_score(y_test, y_pred)
print("Accuracy for adaboost: " + str(round((score*100), 2)) + "%")
Final Results:
AdaBoost: 89.08% accuracy ⭐
Random Forest: 87.39% accuracy
Decision Tree: 86.97% accuracy
K-Nearest Neighbors: 86.55% accuracy
Voting Classifier: 82.77% accuracy
SVM: 74.79% accuracy
Logistic Regression: 73.53% accuracy
The Bottom Line
Class imbalance doesn't have to be a death sentence for your ML models. Sometimes the best solution is the simplest: carefully balance your data and let the algorithms do what they do best. In my case, this approach led to an 89% accuracy rate that would have been impossible with the original imbalanced dataset.
What's your go-to strategy for handling class imbalance? SMOTE? Undersampling? Or do you prefer other techniques?
Top comments (0)