DEV Community

Tebogo Tseka
Tebogo Tseka

Posted on

Predicting Telecom Customer Churn with scikit-learn, Keras, and Amazon SageMaker

Predicting Telecom Customer Churn with scikit-learn, Keras, and Amazon SageMaker

Every month, a telecom operator quietly loses thousands of customers to a competitor. They call it churn — and in an industry where acquiring a new customer costs 5–10x more than retaining an existing one, predicting who is about to leave is one of the most valuable problems machine learning can solve.

In this tutorial, I'll walk you through a complete churn prediction pipeline I built for a telecom use case. We'll generate a realistic synthetic dataset, train three models (Decision Tree, Random Forest, and a Keras neural network), compare their performance, and deploy the best one to an Amazon SageMaker real-time endpoint.

By the end, you'll have a production-ready pipeline you can adapt for any telecoms operator.

Full source code: github.com/tsekatm/ml-churn-predictor


Why Telecom Churn Is a Hard ML Problem

Telecom churn has a few properties that make it interesting:

  • Class imbalance: Typically 20–40% of customers churn. The model must not simply predict "no churn" for everyone and claim 80% accuracy.
  • Behavioural signals are subtle: A customer moving from a two-year contract to month-to-month is a strong signal — but it manifests quietly in billing data.
  • High-value interventions: If you identify a high-risk customer 30 days early, a targeted retention offer (discounted upgrade, free month) can prevent the loss of 24+ months of revenue.

This makes recall — catching as many true churners as possible — more important than raw accuracy. We'll reflect that in our model design.


The Dataset

No real customer data? No problem. I generated a synthetic dataset of 10,000 telecom customers with realistic churn patterns calibrated to industry benchmarks.

python data/generate_data.py
# → data/raw/churn.csv (10,000 rows, 37% churn rate)
Enter fullscreen mode Exit fullscreen mode
Feature Type Churn Signal Strength
tenure_months Numeric Strong — long-tenured customers rarely leave
contract_type Categorical Month-to-month: ~42% churn vs 3% for two-year
monthly_charges Numeric Higher bills correlate with higher churn
internet_service Categorical Fibre optic: ~41% churn
payment_method Categorical Electronic check: highest churn payment method

Pipeline Architecture

generate_data.py → preprocess() → Decision Tree
                              → Random Forest
                              → Keras Neural Network
                              → evaluate() → save_model()
                              → deploy.py → SageMaker Endpoint
Enter fullscreen mode Exit fullscreen mode

Step 1: Data Preprocessing

CATEGORICAL_COLS = [
    "contract_type", "internet_service", "phone_service",
    "multiple_lines", "online_security", "tech_support",
    "payment_method", "paperless_billing",
]
NUMERIC_COLS = ["tenure_months", "monthly_charges", "total_charges", "senior_citizen"]

def preprocess(df):
    df = df.drop(columns=["customer_id"], errors="ignore").copy()
    df = df.dropna(subset=["churn"])

    encoders = {}
    for col in CATEGORICAL_COLS:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].astype(str))
        encoders[col] = le

    X = df[NUMERIC_COLS + CATEGORICAL_COLS].values
    y = df["churn"].values.astype(int)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)  # fit only on train — never leak test stats

    return X_train, X_test, y_train, y_test, scaler, encoders
Enter fullscreen mode Exit fullscreen mode

Two decisions worth noting:

  1. Stratified split — preserves churn ratio in train and test sets.
  2. Fit scaler on train only — fitting on the full dataset leaks test distribution into training.

Step 2: Training Three Models

Decision Tree — The Baseline

model = DecisionTreeClassifier(
    max_depth=8,
    min_samples_leaf=10,
    class_weight="balanced",  # compensates for ~37% minority class
    random_state=42,
)
model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Random Forest — The Workhorse

model = RandomForestClassifier(
    n_estimators=200,
    max_depth=12,
    min_samples_leaf=5,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1,
)
Enter fullscreen mode Exit fullscreen mode

Keras Neural Network — The Contender

model = keras.Sequential([
    layers.Input(shape=(input_dim,)),
    layers.Dense(128, activation="relu"),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(64, activation="relu"),
    layers.BatchNormalization(),
    layers.Dropout(0.2),
    layers.Dense(32, activation="relu"),
    layers.Dense(1, activation="sigmoid"),
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss="binary_crossentropy",
    metrics=["accuracy", keras.metrics.AUC(name="auc")],
)
Enter fullscreen mode Exit fullscreen mode

Batch normalisation stabilises training on mixed-scale features, Dropout prevents overfitting, and EarlyStopping on val_auc restores the best weights.


Step 3: Model Evaluation & Comparison

MODEL COMPARISON SUMMARY
Model                Accuracy   Precision  Recall     ROC-AUC
------------------------------------------------------------
decision_tree        0.7430     0.6189     0.7900     0.8218
random_forest        0.7660     0.6592     0.7575     0.8453
keras_nn             0.7565     0.6298     0.8252     0.8454
------------------------------------------------------------
Best model by ROC-AUC: keras_nn (0.8454)
Enter fullscreen mode Exit fullscreen mode

Key observations:

  • Random Forest and Keras NN are neck-and-neck on ROC-AUC — both excellent.
  • Keras NN wins on recall (0.8252 vs 0.7575) — catches more churners, which matters most for retention campaigns.
  • Decision Tree at ROC-AUC 0.822 is weaker but fully interpretable — valuable for business presentations.

For deployment, I chose Random Forest over Keras NN despite the marginal ROC-AUC difference (0.8453 vs 0.8454). Random Forest offers simpler packaging (joblib serialisation vs TensorFlow SavedModel), faster cold-start inference on SageMaker, and better interpretability for business stakeholders — a practical trade-off in production.


Step 4: Deploying to SageMaker

# Package, upload, and deploy
archive = package_model("models/random_forest.pkl")
model_s3_uri = upload_to_s3(archive, S3_BUCKET, "models/random_forest/model.tar.gz")
create_model(model_name, role_arn, image_uri, model_s3_uri)
create_endpoint_config(config_name, model_name, instance_type="ml.m5.large")
deploy_endpoint(ENDPOINT_NAME, config_name)

# Test inference
result = invoke_endpoint(ENDPOINT_NAME, "12,0,65.5,786.0,1,1,1,0,0,0,1,0")
print(f"Churn probability: {result}")
Enter fullscreen mode Exit fullscreen mode

Dry run (package + upload only):

python src/deploy.py --model-path models/random_forest.pkl --dry-run
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

1. Recall beats accuracy for churn — align your metric to the business objective, not the leaderboard.

2. Class weighting is non-negotiable — without it, your model silently optimises for the majority class.

3. The Decision Tree earns its place — explainability is not optional in regulated industries.

4. SageMaker packaging is straightforward — joblib model + model.tar.gz + role ARN is all you need.


What's Next

  • Add SHAP values for individual prediction explanations
  • Build a SageMaker Pipeline for automated retraining on new monthly data
  • Wire up a retention campaign API triggered at probability > 0.7
  • Add Model Monitor to detect data drift
  • Integrate with CDR (Call Detail Records) for real-time churn scoring at the network edge
  • Connect predictions to CRM retention workflows for automated intervention triggers

References

  1. Amazon SageMaker Developer Guide
  2. scikit-learn: RandomForestClassifier
  3. Keras: Binary Classification
  4. SageMaker SKLearn Estimator
  5. IBM Telco Customer Churn Dataset
  6. Handling Imbalanced Classes in scikit-learn

Tebogo Tseka — Cloud Solutions Architect & ML Engineer
GitHub: @tsekatm | Blog: tebogosacloud.blog*

Top comments (0)