Girma

Posted on Mar 24

Starting Point for Kagglers: Customer Churn Prediction Competition

#kaggle #machinelearning #aimodeling #ai

You open the Playground Series S6E3 competition, see 250k+ rows of customer data, and think: “Where do I even start?”

I’ve been there. This post is exactly the first notebook I wish I had when I jumped in a dead-simple, copy-paste-ready pipeline that takes you from raw CSV to a solid submission. No theory overload, just the steps that actually work (and why they matter). Let’s go!

1. Grab the Tools

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings("ignore")

These are my go-to imports for every tabular comp. LightGBM will be your hero later.

2. Load & Quick Look

df = pd.read_csv("/kaggle/input/competitions/playground-series-s6e3/train.csv")
X = df.drop(columns=["Churn", "id"])
y = df["Churn"]

Run df.shape, df.head(), df.info(). Clean data, zero missing values — we’re lucky today!

3. Tiny Cleanup (Just in Case)

X["TotalCharges"] = pd.to_numeric(X["TotalCharges"], errors="coerce")

Always make sure numbers are actually numbers.

4. Know Your Columns

Numbers: tenure, MonthlyCharges, TotalCharges, SeniorCitizen
Categories: gender, Contract, PaymentMethod, streaming stuff, etc.

Models only understand numbers, so categories need love.

5. My Secret Weapon: Merge Columns

This one trick makes everything faster and cleaner:

X['StreamingAny'] = ((X['StreamingTV'] == 'Yes') | (X['StreamingMovies'] == 'Yes')).astype(int)
X = X.drop(columns=['StreamingTV', 'StreamingMovies'])

Why I do this every time:

Cuts 4–5 columns → 20–40% faster training
Saves RAM (huge on big datasets)
Removes confusing duplicate signals
Model learns real customer habits instead of memorizing noise

Feels like decluttering your code suddenly everything runs smoother.

6. Turn Words into Numbers

Easy Yes/No first:

binary_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']
for col in binary_cols:
    X[col] = X[col].map({'Yes': 1, 'No': 0})

Then the rest:

X = pd.get_dummies(X, drop_first=True)

All numeric now. Boom.

7. Split Smart

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Stratify keeps the churn ratio the same critical for this competition.

8. Train Two Models (Quick Check + Real Deal)

Baseline (Random Forest):

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
print("RF ROC-AUC:", roc_auc_score(y_val, rf.predict_proba(X_val)[:, 1]))

The one that actually scores well (LightGBM):

lgb = LGBMClassifier(random_state=42)
lgb.fit(X_train, y_train)
print("LGB ROC-AUC:", roc_auc_score(y_val, lgb.predict_proba(X_val)[:, 1]))

LightGBM usually jumps ahead — this is your starting leaderboard model.

9. Test Set (Same Steps, No Leaks!)

test = pd.read_csv("/kaggle/input/competitions/playground-series-s6e3/test.csv")
test_X = test.drop(columns=['id'])

# Same merge
test_X['StreamingAny'] = ((test_X['StreamingTV'] == 'Yes') | (test_X['StreamingMovies'] == 'Yes')).astype(int)
test_X = test_X.drop(columns=['StreamingTV', 'StreamingMovies'])

# Same encoding
test_X = pd.get_dummies(test_X, drop_first=True)
test_X = test_X.reindex(columns=X.columns, fill_value=0)

preds = lgb.predict_proba(test_X)[:, 1]

submission = pd.DataFrame({"id": test["id"], "Churn": preds})
submission.to_csv("submission.csv", index=False)

Want to Level Up Later?

Add cross-validation
Merge more groups (add-ons, contract type)
Tune LightGBM with Optuna
Try CatBoost (zero encoding needed)

One-Sentence Recap

Start with clean loading → merge redundant columns → encode → split → train LGB → apply exact same steps to test → submit.

That’s the real starting point every Kaggler needs.

Copy this notebook, run it, and you’re already ahead.

Got a score? Hit a bug? Drop it in the comments or tag me I reply to every one.

Happy starting !

Girma Wakeyo

Kaggle → https://www.kaggle.com/girmawakeyo

GitHub → https://github.com/Girma35

X → https://x.com/Girma880731631

Follow for more quick-start notebooks and competition tips. Let’s climb those leaderboards together!

DEV Community