Tebogo Tseka

Posted on Apr 29

Predicting Telecom Customer Churn with scikit-learn, Keras, and Amazon SageMaker

#aws #datascience #python #machinelearning

Predicting Telecom Customer Churn with scikit-learn, Keras, and Amazon SageMaker

Every month, a telecom operator quietly loses thousands of customers to a competitor. They call it churn — and in an industry where acquiring a new customer costs 5–10x more than retaining an existing one, predicting who is about to leave is one of the most valuable problems machine learning can solve.

In this tutorial, I'll walk you through a complete churn prediction pipeline I built for a telecom use case. We'll generate a realistic synthetic dataset, train three models (Decision Tree, Random Forest, and a Keras neural network), compare their performance, and deploy the best one to an Amazon SageMaker real-time endpoint.

By the end, you'll have a production-ready pipeline you can adapt for any telecoms operator.

Full source code: github.com/tsekatm/ml-churn-predictor

Why Telecom Churn Is a Hard ML Problem

Telecom churn has a few properties that make it interesting:

Class imbalance: Typically 20–40% of customers churn. The model must not simply predict "no churn" for everyone and claim 80% accuracy.
Behavioural signals are subtle: A customer moving from a two-year contract to month-to-month is a strong signal — but it manifests quietly in billing data.
High-value interventions: If you identify a high-risk customer 30 days early, a targeted retention offer (discounted upgrade, free month) can prevent the loss of 24+ months of revenue.

This makes recall — catching as many true churners as possible — more important than raw accuracy. We'll reflect that in our model design.

The Dataset

No real customer data? No problem. I generated a synthetic dataset of 10,000 telecom customers with realistic churn patterns calibrated to industry benchmarks.

python data/generate_data.py
# → data/raw/churn.csv (10,000 rows, 37% churn rate)

Feature	Type	Churn Signal Strength
tenure_months	Numeric	Strong — long-tenured customers rarely leave
contract_type	Categorical	Month-to-month: ~42% churn vs 3% for two-year
monthly_charges	Numeric	Higher bills correlate with higher churn
internet_service	Categorical	Fibre optic: ~41% churn
payment_method	Categorical	Electronic check: highest churn payment method

Pipeline Architecture

generate_data.py → preprocess() → Decision Tree
                              → Random Forest
                              → Keras Neural Network
                              → evaluate() → save_model()
                              → deploy.py → SageMaker Endpoint

Step 1: Data Preprocessing

CATEGORICAL_COLS = [
    "contract_type", "internet_service", "phone_service",
    "multiple_lines", "online_security", "tech_support",
    "payment_method", "paperless_billing",
]
NUMERIC_COLS = ["tenure_months", "monthly_charges", "total_charges", "senior_citizen"]

def preprocess(df):
    df = df.drop(columns=["customer_id"], errors="ignore").copy()
    df = df.dropna(subset=["churn"])

    encoders = {}
    for col in CATEGORICAL_COLS:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].astype(str))
        encoders[col] = le

    X = df[NUMERIC_COLS + CATEGORICAL_COLS].values
    y = df["churn"].values.astype(int)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)  # fit only on train — never leak test stats

    return X_train, X_test, y_train, y_test, scaler, encoders

Two decisions worth noting:

Stratified split — preserves churn ratio in train and test sets.
Fit scaler on train only — fitting on the full dataset leaks test distribution into training.

Step 2: Training Three Models

Decision Tree — The Baseline

model = DecisionTreeClassifier(
    max_depth=8,
    min_samples_leaf=10,
    class_weight="balanced",  # compensates for ~37% minority class
    random_state=42,
)
model.fit(X_train, y_train)

Random Forest — The Workhorse

model = RandomForestClassifier(
    n_estimators=200,
    max_depth=12,
    min_samples_leaf=5,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1,
)

Keras Neural Network — The Contender

model = keras.Sequential([
    layers.Input(shape=(input_dim,)),
    layers.Dense(128, activation="relu"),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(64, activation="relu"),
    layers.BatchNormalization(),
    layers.Dropout(0.2),
    layers.Dense(32, activation="relu"),
    layers.Dense(1, activation="sigmoid"),
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss="binary_crossentropy",
    metrics=["accuracy", keras.metrics.AUC(name="auc")],
)

Batch normalisation stabilises training on mixed-scale features, Dropout prevents overfitting, and EarlyStopping on val_auc restores the best weights.

Step 3: Model Evaluation & Comparison

MODEL COMPARISON SUMMARY
Model                Accuracy   Precision  Recall     ROC-AUC
------------------------------------------------------------
decision_tree        0.7430     0.6189     0.7900     0.8218
random_forest        0.7660     0.6592     0.7575     0.8453
keras_nn             0.7565     0.6298     0.8252     0.8454
------------------------------------------------------------
Best model by ROC-AUC: keras_nn (0.8454)

Key observations:

Random Forest and Keras NN are neck-and-neck on ROC-AUC — both excellent.
Keras NN wins on recall (0.8252 vs 0.7575) — catches more churners, which matters most for retention campaigns.
Decision Tree at ROC-AUC 0.822 is weaker but fully interpretable — valuable for business presentations.

For deployment, I chose Random Forest over Keras NN despite the marginal ROC-AUC difference (0.8453 vs 0.8454). Random Forest offers simpler packaging (joblib serialisation vs TensorFlow SavedModel), faster cold-start inference on SageMaker, and better interpretability for business stakeholders — a practical trade-off in production.

Step 4: Deploying to SageMaker

# Package, upload, and deploy
archive = package_model("models/random_forest.pkl")
model_s3_uri = upload_to_s3(archive, S3_BUCKET, "models/random_forest/model.tar.gz")
create_model(model_name, role_arn, image_uri, model_s3_uri)
create_endpoint_config(config_name, model_name, instance_type="ml.m5.large")
deploy_endpoint(ENDPOINT_NAME, config_name)

# Test inference
result = invoke_endpoint(ENDPOINT_NAME, "12,0,65.5,786.0,1,1,1,0,0,0,1,0")
print(f"Churn probability: {result}")

Dry run (package + upload only):

python src/deploy.py --model-path models/random_forest.pkl --dry-run

Key Takeaways

1. Recall beats accuracy for churn — align your metric to the business objective, not the leaderboard.

2. Class weighting is non-negotiable — without it, your model silently optimises for the majority class.

3. The Decision Tree earns its place — explainability is not optional in regulated industries.

4. SageMaker packaging is straightforward — joblib model + model.tar.gz + role ARN is all you need.

What's Next

Add SHAP values for individual prediction explanations
Build a SageMaker Pipeline for automated retraining on new monthly data
Wire up a retention campaign API triggered at probability > 0.7
Add Model Monitor to detect data drift
Integrate with CDR (Call Detail Records) for real-time churn scoring at the network edge
Connect predictions to CRM retention workflows for automated intervention triggers

References

Tebogo Tseka — Cloud Solutions Architect & ML Engineer
GitHub: @tsekatm | Blog: tebogosacloud.blog*

DEV Community

Predicting Telecom Customer Churn with scikit-learn, Keras, and Amazon SageMaker

Predicting Telecom Customer Churn with scikit-learn, Keras, and Amazon SageMaker

Why Telecom Churn Is a Hard ML Problem

The Dataset

Pipeline Architecture

Step 1: Data Preprocessing

Step 2: Training Three Models

Decision Tree — The Baseline

Random Forest — The Workhorse

Keras Neural Network — The Contender

Step 3: Model Evaluation & Comparison

Step 4: Deploying to SageMaker

Key Takeaways

What's Next

References

Top comments (0)