Tabular Machine Learning for Predictive Modeling: A Ridge-XGBoost N-gram Pipeline for Customer Churn

#webdev #programming #machinelearning #computerscience

Kaggle Playground Series S6E3 — Predict Customer Churn | ROC-AUC 0.91685 | Rank 286 / 3,718

The Problem

Customer churn prediction sounds straightforward — given a telecom customer's usage history and contract details, predict whether they'll leave. But the Kaggle Playground S6E3 dataset had 594,000 rows of heavily categorical data where the signal was buried inside combinations of features, not individual columns. Standard approaches plateau quickly here.
My starting point was a LightGBM single model. It was decent — but decent doesn't crack the top 10%. Getting there required rethinking how the model saw the categorical features entirely.

The Core Insight: Treat Categories Like Text

The breakthrough came from an unconventional direction — NLP. In text classification, n-grams capture phrase-level patterns that individual words miss. The same logic applies to categorical feature combinations.

A customer with Contract: Month-to-month is one signal. A customer with Contract: Month-to-month + InternetService: Fiber optic + PaymentMethod: Electronic check is a completely different risk profile — and that combination is what predicts churn.

So I treated categorical columns like tokens and generated bigrams and trigrams across high-impact features: Contract, InternetService, and PaymentMethod. Each unique combination became a new feature, capturing interaction patterns that a standard feature matrix would miss entirely.

Feature Engineering Pipeline

Step 1 — N-gram Categorical Interactions
For each high-signal categorical column, I generated pairwise (bigram) and three-way (trigram) combinations across the feature space. This produced a set of composite interaction features that encoded relationship patterns directly into the model input.

Step 2 — Nested Target Encoding
Raw target encoding leaks — if you encode a categorical feature using the target mean, the model sees information from the row it's predicting. The fix is nested k-fold encoding: encode each fold using only the other folds' target statistics. I used a nested 5-fold stratified scheme applied to both the original categorical features and the new n-gram features.

Step 3 — Service Stack Analysis
Beyond the n-grams, I engineered explicit service combination counts — how many internet services, how many phone add-ons — and their intersections. Customers with more bundled services behave differently at churn time, and these counts captured that pattern numerically.

Step 4 — Digit Features
For continuous columns like tenure and MonthlyCharges, I extracted distributional digit features — essentially encoding the numerical range and pattern of each value. This gave the model a richer representation of where each customer sat within the tenure and charge distributions.

The Two-Stage Ensemble

With the feature matrix built, I used a two-stage stacking approach rather than a single model.

Stage 1 — Ridge Regression
A heavily regularised Ridge classifier served as the first-stage learner. Ridge is simple and interpretable — it captures broad linear trends across the feature space and generalises cleanly. Critically, I ran this with 10-fold stratified cross-validation and collected the out-of-fold (OOF) predictions. These OOF predictions became meta-features for Stage 2.
The reason for Ridge first: it acts as a stabilising baseline. Its predictions encode a smooth, low-variance signal that helps XGBoost in Stage 2 avoid overfitting to noisy feature interactions.

Stage 2 — XGBoost on Original + OOF Features
The second-stage XGBoost classifier was trained on the full engineered feature matrix plus the Ridge OOF predictions as an additional input. This gave XGBoost a pre-computed linear summary of the data to work with alongside the raw features — effectively letting it model residuals and non-linear interactions on top of the Ridge baseline.
Cross-validation remained 10-fold stratified with a fixed seed throughout, ensuring consistent and reproducible evaluation across both stages.

Results

MetricValuePublic Leaderboard AUC 0.91685 Global Rank286 / 3,718 Percentile Top 8%
The top 20 features by importance were dominated by the n-gram interaction features and nested-encoded categorical combinations — validating the core hypothesis that combination patterns outperform individual categorical signals on this dataset.

What I'd Do Differently

The LightGBM baseline I started with was actually competitive on its own — the jump came almost entirely from the n-gram feature engineering, not from model complexity. In hindsight, I would have invested more time earlier in feature interaction design and less time tuning hyperparameters on simpler models.

A second improvement would be experimenting with higher-order n-grams (four-way combinations) on the service stack features — the signal was clearly present in three-way combinations, and there may have been further lift available.

Code & Reproducibility

The full pipeline is open source. The main winning script is src/train_ridge_xgb_ngram.py — run with:

bashpython src/train_ridge_xgb_ngram.py --folds 10 --inner-folds 5 --seed 42 --output-prefix ridge_xgb_ngram10

Full repository: github.com/faissssss/predict-customer-churn

Key Takeaways

Three things that actually moved the needle on this competition:

N-gram thinking for categorical data. If your features are categorical and interactions matter, treat them like text tokens. The combination is the signal.
Nested target encoding, not naive encoding. Leaky encoding hurts generalisation silently — you won't see it in training metrics until the leaderboard disagrees with your CV score.
Stack for stability, not complexity. Ridge + XGBoost worked not because XGBoost needed help, but because Ridge's OOF predictions gave it a cleaner starting point. Stacking should reduce variance, not add layers for its own sake.

Sources:
github.com/faissssss/predict-customer-churn
kaggle.com/competitions/playground-series-s6e3