DEV Community: Shivam Chhuneja

ARIMA vs. SARIMA: A Practical Guide to Choosing the Right Time Series Model

Shivam Chhuneja — Fri, 20 Jun 2025 09:13:34 +0000

When you're working with time series data, naturally things are going to point towards forecasting.

And two of the most reliable classical tools for this are the ARIMA and SARIMA models.

To be honest I was quite confused when I first learnt about these a year ago but they do make a ton of intuitive sense once you understand them on a deeper level.

The names are similar, and so is their underlying logic, but there is a key difference that more or less decides which one you will pick in what scenario.

So, what separates them, and how do you decide which model fits your forecasting needs?

That's what I want to take you through in this post, are you ready? Are you ready??????

Okay, seems like you are, so let's do it.

First, Let's Look at ARIMA

ARIMA, or AutoRegressive Integrated Moving Average, is a model designed to understand and predict future values in a time series.

Best you didn't know that and this wasn't redundant at all.

Anyways, it breaks down a series into three components:

AR (AutoRegressive): This part of the model assumes that future values have a dependency on past values. It's the "look-back" component. For eg. this month's sales might be influenced by last month's sales.
I (Integrated): This is the differencing step used to make the data stationary. In normal human language, many time series models work best when the data's statistical properties (like its mean and variance) are stable over time. Differencing which means subtracting the previous value from the current value is a common way to achieve this stability.
MA (Moving Average): This component looks at past forecast errors to improve the current prediction. It helps the model adjust for shocks or randomness that weren't captured by the autoregressive term.

ARIMA is at its best when your data has a clear trend but lacks any repeating, cyclical patterns. (Although most business metrics especially sales numbers etc do tend to have some seasonality at least)

A good use case for ARIMA?

Forecasting the monthly electricity consumption for a manufacturing plant.

The usage might be steadily increasing as the company grows, but it doesn't necessarily spike in the same month each year.(at least that's the assumption.) Which it might, for eg. AC usage spikes in summer months every year, heater usage spikes in winter months - but manufacturing facility might stay consistent throughout the year to maintain temperatures either ways.

So, cut me some slack, its difficult to find good examples!

Anyways, this is a classic trend-based pattern, making it a perfect time series data set for an ARIMA model.

What Happens When There is Seasonality in Our Time Series Data?

This is where SARIMA enters the picture.

If your data has predictable, repeating patterns over a fixed period, you're dealing with seasonality, and a standard ARIMA model might think of these cycles as noise.

SARIMA, or Seasonal ARIMA, extends the base model by adding a seasonal component.

AR, I, and MA remain exactly the same but it applies them to the seasonal patterns as well.

SARIMA = ARIMA (for the trend) + A Seasonal Layer (for the cycles)

The model doesn't just look back at the last few data points (like t-1, t-2); it also looks back at seasonal intervals (like t-12 for yearly data or t-7 for weekly data).

When would you use it?

Well, think a business forecasting monthly sales for wine (my project that I did for time series forecasting).

You'd almost certainly see a massive spike in sales during the summer months and another smaller peak around the holidays.

An ARIMA model might capture the general upward or downward trend, but it would likely miss these seasonal peaks. SARIMA is built to pick on them and forecast them properly.

A Quick Example for SARIMA

Let's say your data looks something like this, showing a clear yearly pattern:

Month	Sales
Jan	30
Feb	28
Mar	40
...	...
Dec	80
Jan	32
Feb	27
...	...

An ARIMA model would process this as a general trend with some random fluctuations.

A SARIMA model will identify the repeating 12-month cycle as a primary signal for making predictions.

Real-World Scenarios for SARIMA vs ARIMA

Choosing between them often comes down to the nature of your data.

Scenario	Best Fit	Rationale
Forecasting monthly website visits for a B2B SaaS product	ARIMA	Data is mainly driven by a clean, non-cyclical trend.
Predicting daily ice cream sales	SARIMA	Sales will most probably have strong seasonal peaks in summer.
Estimating weekly foot traffic for a grocery store	SARIMA	Captures both weekly cycles (weekends) and holiday spikes.
Projecting the annual depreciation of a piece of equipment	ARIMA	Follows a slow, steady trend without seasonal cycles.

Okay, Let's Look at the Code

The difference is also clear in the code.

In Python's statsmodels library, you'll see a separate argument for the seasonal component in SARIMA.

ARIMA Model: The order parameter (p,d,q) is for AR, I, and MA terms.

from statsmodels.tsa.arima.model import ARIMA

# For non-seasonal data
model = ARIMA(data, order=(2,1,2))
result = model.fit()
forecast = result.forecast(steps=12)

SARIMA Model:

Here, we add the seasonal_order parameter (P,D,Q,m), where m is the length of the seasonal cycle (e.g., 12 for yearly seasonality on monthly data).

from statsmodels.tsa.statespace.sarimax import SARIMAX

# For seasonal data
model = SARIMAX(data, order=(2,1,2), seasonal_order=(1,0,1,12))
result = model.fit()
forecast = result.forecast(steps=12)

When to Use Which?

The decision is quite straightforward:

Use ARIMA when your time series data shows a trend but has no obvious repeating, seasonal patterns.
Use SARIMA when the data exhibits clear and predictable seasonal cycles.

Bonus Tip: Let the Model Do the Work

If you're unsure whether a seasonal component would improve your forecast, you don't have to guess.

The pmdarima library includes an auto_arima function that can test various model configurations for you.

This is what I did as well apart from looking at the diagnostics myself for my project too.

By setting seasonal=True and specifying the seasonal period m, you can let the function figure out whether a SARIMA model provides a better fit.

Think Grid Search for ML hyperparams but just time series in this case.

import pmdarima as pm

# Let auto_arima find the best parameters, including seasonal ones
model = pm.auto_arima(data, seasonal=True, m=12)

# You can then check the model summary and compare performance metrics like AIC or RMSE
print(model.summary())

Let's Wrap This Up

The machine learning landscape is filled with complex models like LSTMs and Transformers, ARIMA and SARIMA imo are still quite useful.

They are workhorses for a reason: they are fast to train, their results are highly interpretable, and they perform exceptionally well on datasets that aren't massive enough to require deep learning.

And as data scientists I believe our job is not to make the best model but to make the best model that solves the business problem while taking care of resource constraints. (time, effort & money)

For setting up a solid baseline forecast in a business context, these models are going to be the perfect place to start.

Let the data make the choice for you, and you'll have a strong and sensible model.

Full Code Walkthrough - Reducing Churn in E-Commerce with Predictive Modelling

Shivam Chhuneja — Thu, 19 Jun 2025 12:54:20 +0000

If you read part 1 of this series, ala Churn Prediction for E-Commerce with Predictive Modelling, you know I recently wrapped up a full end-to-end churn prediction project as part of my postgrad program. That article was the 30,000-foot view -- the business problem, the segmentation insights, the high-level model results.

With this one I simply walk you through the code.

But instead of just dumping code snippets for you to copy pasta, I want to walk you through what I actually did and, more importantly, why it matters.

This is especially important if you're learning data science like me and want to move past simply "running a notebook" to actually "building useful solutions."

With me?

Alright, let's go.

🧱 1. The Barebones Setup: Getting Started with Libraries

First things first, importing your go-to libraries. Everyone does this, and these are more or less the main ones.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Here is what each one does:

pandas is your table wrangler. It's how you handle, clean, and manipulate tabular data like spreadsheets or database tables.
numpy is your math person. It's for numerical operations, especially with arrays and matrices. (highly beneficial with deep learning too)
matplotlib.pyplot and seaborn are your default visualization libs. Patterns, distributions, and relationships etc. You can't understand what's going on without visualizing it.

📥 2. Loading and Peeking at the Data: The First Handshake

Right after importing, the very next step is always to load your data and take a quick look.

df = pd.read_excel("/content/Customer Churn Data.xlsx", sheet_name="data for churn")
print(df.head())

Output:

   AccountID  Churn Tenure  City_Tier  CC_Contacted_LY      Payment  Gender
0      20000      1      4        3.0              6.0   Debit Card  Female
1      20001      1      0        1.0              8.0          UPI    Male
2      20002      1      0        1.0             30.0   Debit Card    Male
3      20003      1      0        3.0             15.0   Debit Card    Male
4      20004      1      0        1.0             12.0  Credit Card    Male
...

What to note:

You're not just checking if the file loaded correctly here.

You're starting to build a mental model of the dataset.

What are the column names?

What kind of values are in the first few rows?

Are there any immediate red flags?

This quick check sets the stage for the whole project more or less.

📏 3. How Big Is This? Getting the Lay of the Land

Once you've had a quick look, the next question is scale.

How many rows?

How many columns?

print("Dataset Shape:", df.shape)

Output:

Dataset Shape: (11260, 19)

What to note:

Okay, over 11,000 accounts with 19 different pieces of information each.

That's a decent size.

Small enough that you can probably work with it comfortably on a standard laptop, but large enough to feel like a realistic business dataset.

🔎 4. What's Missing? And What's Weird?: The Data Audit

Now, let's get serious about inspecting the data.

print("Missing Values and Data Types:")
print(df.info())

Output (not sharing the full output, but you'd see the full output in your notebook):

Tenure                   11158 non-null  object
City_Tier                11148 non-null  float64
Payment                  11151 non-null  object
...
Login_device             11039 non-null  object

This is usually where you hit your first real "uh oh" moment.

Tenure is showing up as object (meaning Python thinks it's text) when it should be numbers?

Same for Account_user_count.

That means they're likely strings with some hidden characters or mixed types.

We'll definitely need to convert those.

Plus, a bunch of missing values across several columns.

This output is very close to what real datasets look like.

Dirty, inconsistent, and full of specific things.

If your data looks perfect after df.info(), you're probably working with a toy dataset or someone else has already done the hard work for you.

🧹 5. Light Data Cleaning Before Modelling

Data cleaning and preparing is the best habit to build to get really good outcomes.

from sklearn.impute import KNNImputer
knn_features = ['rev_per_month', 'cashback', 'Tenure', 'City_Tier',
                'Service_Score', 'Account_user_count', 'CC_Agent_Score',
                'Complain_ly', 'rev_growth_yoy', 'coupon_used_for_payment', 'Day_Since_CC_connect']

df_knn = df.copy()

knn_imputer = KNNImputer(n_neighbors=5)

df_knn[knn_features] = knn_imputer.fit_transform(df_knn[knn_features])

print(df_knn[knn_features].isnull().sum())

What to note:

In this specific dataset, I found no duplicates but there were a few missing values which I packed with KNN imputer before plotting the churn.

Is it optimal? Not really but since I used KNN imputer and slightly off data would not cause major business issues I did that instead of discarding the missing values.

If you're building data pipelines, including this step becomes even more important for data integrity.

📊 6. Let's Talk Class Imbalance

Before jumping into features, it's important to understand your target variable.

In churn prediction, this means looking at the distribution of "churned" vs. "not churned" customers.

sns.countplot(x='Churn', data=df)
plt.title("Churn Distribution")
plt.show()

Output:

Observation:

Only about 17% of the users in this dataset had actually churned. The vast majority were still active customers.

Teaching moment:

This is a classic case of an imbalanced classification problem.

This means you need to be extremely careful using basic accuracy as your main metric.

Why?

Because a model that simply predicts "not churned" for every single customer would still be about 83% accurate!

That sounds good on paper, but it would be completely useless for a business trying to identify and retain at-risk customers.

This is why we'll need to focus on other metrics later.

📈 7. Exploratory Data Analysis But Beyond Just Plotting

EDA is more than just making pretty charts.

It's about finding relationships and insights that directly help you make up your mind about the modeling and business recommendations.

Now I did distribution, heatmaps, boxplots, histograms - you name it.

Visualization is something that if done more than needed still doesn't hurt.

# Distribution across numerical columns
numerical_cols = df_knn.select_dtypes(include=['int64', 'float64']).columns

plt.figure(figsize=(15, 12))
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(4, 4, i)
    sns.histplot(df_knn[col], kde=True, bins=30)
    plt.title(f"Distribution of {col}")

plt.tight_layout()
plt.show()

Let's now try and understand the correlation across our key columns.

# Heatmap to understand the correlation if any

numerical_df = df_knn.select_dtypes(include=['number'])

plt.figure(figsize=(12, 8))
sns.heatmap(numerical_df.corr(), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

Now, we plot churn against other numerical columns in a boxplot.

# Box plot of churn vs other items
numerical_cols = ['rev_per_month', 'cashback', 'Tenure', 'City_Tier', 'Service_Score',
                  'Account_user_count', 'CC_Agent_Score', 'Complain_ly',
                  'rev_growth_yoy', 'coupon_used_for_payment', 'Day_Since_CC_connect']

plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_cols[:6], 1):
    plt.subplot(2, 3, i)
    sns.boxplot(x='Churn', y=col, data=df_knn, palette='coolwarm')
    plt.title(f"Churn vs {col}")
plt.tight_layout()
plt.show()

And finally bar charts of churn against categorical columns.

# Bar charts of churn against categorical columns
categorical_cols = ['Payment', 'Gender', 'account_segment', 'Marital_Status', 'Login_device']

plt.figure(figsize=(15, 8))
for i, col in enumerate(categorical_cols, 1):
    plt.subplot(2, 3, i)
    sns.countplot(x=col, hue='Churn', data=df_knn, palette='coolwarm')
    plt.xticks(rotation=45)
    plt.title(f"Churn by {col}")
plt.tight_layout()
plt.show()

Output:

As you can see, ton of work when it comes to visualization and btw I skipped a couple more images since I felt this one point would get too long.

Now, what to note:

Customers with lower revenue per month seem to be more likely to churn.

This makes intuitive business sense.

But I think this is also a good moment to talk about why feature engineering needs domain context and critical thinking.

Revenue might look predictive, but what if it's highly skewed, or volatile, or heavily influenced by seasonality?

Simply using raw rev_per_month might not capture the full story.

This is why, I also explored more nuanced behavioral markers like year-over-year revenue growth, the number of complaints filed, and cashback usage.

Metrics like these tell you more about a customer's opinion about the brand, their overall sentiment than just a raw amount.

It's about finding features that capture behavior and trends, not just static values.

📲 8. Devices Matter When it Comes to User Behaviour

Sometimes, EDA throws you a curveball.

Maybe an insight you didn't expect, but that opens up new lenses for understanding the problem.

I'm talking about plotting churn by device type, check the images above and you'll notice that the last bar chart is churn by login device.

Surprise Insight:

Surprisingly, desktop users in this dataset were more likely to churn than mobile users.

At first glance, this might seem odd.

But it doesn't hurt to then ask questions like:

Could it be related to the user experience on desktop vs. mobile?

Are desktop users a different demographic: maybe older users who aren't as "sticky" with the platform?

This is why EDA isn't just about crunching stats.

It's where a data scientist learns to think about product management, data analysis, and design in an intersection.

An insight like this isn't just another data point.

It's a potential feedback or prompt you can share with other teams to look into.

🧪 9. KNN Imputer for Missing Values

I know I already talked about the imputer but it deserves it's own section.

And I'm sure you might have noticed the wonky order of things in this project, but you do you boo.

Why KNN?

Traditional methods like simply using the mean or median to fill missing values (df.fillna(df.mean())) are fast, but they can distort the relationships within your data.

KNN (K-Nearest Neighbors) imputation is a bit smarter. It imputes missing values based on the values from the n_neighbors most similar rows.

This helps save the underlying data relationships much better, especially important in datasets that track customer behavior.

It assumes that customers who are "alike" in their known features will likely have similar values for their missing features.

It's a more nuanced way to fill in the blanks.

⚖️ 10. Handling Class Imbalance with SMOTE (and When Not To)

You saw how imbalanced our churn data was. If I just train a model on this skewed data, it might get "lazy" and just predict the majority class (non-churn).

To work through this, we often use techniques like SMOTE.

from imblearn.over_sampling import SMOTE

X = df_knn.drop(columns=["Churn"])
y = df_knn["Churn"]

categorical_cols = X.select_dtypes(include=["object"]).columns

label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

from collections import Counter
print(f"Class distribution before SMOTE: {Counter(y_train)}")
print(f"Class distribution after SMOTE: {Counter(y_train_smote)}")

Another Teaching moment:

SMOTE isn't a clean shot btw, and it's definitely not a "free win."

It works by creating synthetic examples of the minority class (churned customers) to balance the dataset.

It can work well, especially with simpler, more linear models.

However, you need to be cautious.

For tree-based models like Random Forests or XGBoost, SMOTE can sometimes introduce noisy, unrealistic data points that those models might get drunk on, leading to overfitting or worse performance than if you hadn't used it at all.

Always compare your model's performance with and without resampling techniques.

There's no one-size-fits-all solution for imbalanced data.

Sometimes, different evaluation metrics or cost-sensitive learning (where you tell the model that misclassifying a churner is "more expensive" than misclassifying a non-churner) are more effective.

🌲 11. Model Training: Building the Predictor

After all that data wrangling and prep work, it's finally time to train the model.

For this project, a Random Forest Classifier gave me a good output compared to the other models.

Check the other blog post to see the results table against all other models.

Here is a quick overview still:


results_list_smote = []

for name, model in models.items():
    model.fit(X_train_smote, y_train_smote)

    y_train_pred = model.predict(X_train_smote)
    y_test_pred = model.predict(X_test)

    y_train_prob = model.predict_proba(X_train_smote)[:, 1]
    y_test_prob = model.predict_proba(X_test)[:, 1]

    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred)
    test_recall = recall_score(y_test, y_test_pred)
    test_f1 = f1_score(y_test, y_test_pred)
    test_auc_roc = roc_auc_score(y_test, y_test_prob)

    train_accuracy = accuracy_score(y_train_smote, y_train_pred)
    train_precision = precision_score(y_train_smote, y_train_pred)
    train_recall = recall_score(y_train_smote, y_train_pred)
    train_f1 = f1_score(y_train_smote, y_train_pred)
    train_auc_roc = roc_auc_score(y_train_smote, y_train_prob)

    overfitting = "Yes" if (train_auc_roc - test_auc_roc) > 0.05 else "No"

    results_list_smote.append({
        "Model": name,
        "Train Accuracy": round(train_accuracy, 4),
        "Test Accuracy": round(test_accuracy, 4),
        "Train Precision": round(train_precision, 4),
        "Test Precision": round(test_precision, 4),
        "Train Recall": round(train_recall, 4),
        "Test Recall": round(test_recall, 4),
        "Train F1-score": round(train_f1, 4),
        "Test F1-score": round(test_f1, 4),
        "Train AUC-ROC": round(train_auc_roc, 4),
        "Test AUC-ROC": round(test_auc_roc, 4),
        "Overfitting": overfitting
    })

results_smote = pd.DataFrame(results_list_smote)

print(results_smote)

Output

          Model  Train Accuracy  Test Accuracy  Train Precision\
0  Decision Tree          1.0000         0.9312           1.0000
1  Random Forest          1.0000         0.9742           1.0000
2        XGBoost          0.9993         0.9694           0.9997
3       AdaBoost          0.8875         0.8708           0.8965
4    Naïve Bayes          0.7568         0.7127           0.7441
5            SVM          0.8020         0.7744           0.7916

   Test Precision  Train Recall  Test Recall  Train F1-score  Test F1-score\
0          0.7732        1.0000       0.8364          1.0000         0.8035
1          0.9397        1.0000       0.9050          1.0000         0.9220
2          0.9189        0.9988       0.8971          0.9993         0.9079
3          0.5957        0.8761       0.7230          0.8862         0.6532
4          0.3386        0.7828       0.7414          0.7629         0.4648
5          0.4122        0.8199       0.7995          0.8055         0.5440

   Train AUC-ROC  Test AUC-ROC Overfitting
0         1.0000        0.8934         Yes
1         1.0000        0.9936          No
2         1.0000        0.9908          No
3         0.9574        0.9025         Yes
4         0.8188        0.7770          No
5         0.8697        0.8526          No

🔥 12. What Actually Drives Churn?

Getting good predictions is one thing.

But understanding why the model makes those predictions is where machine learning makes actual business sense.

Feature importance scores are the direct link to actionable insights.

importances = model.feature_importances_
feature_names = X.columns

feature_importances_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
feature_importances_df = feature_importances_df.sort_values(by='importance', ascending=False)

print("Top 10 Feature Importances:")
print(feature_importances_df.head(10))

Output (Example Top Features):

feature  importance
Tenure    0.254321
Complaints    0.187654
Service_score    0.123456
Cashback_usage    0.098765

Key Takeaways (and where ML meets business):

For this model, the top features were:

Tenure: Customers with shorter tenure (new customers) were much more likely to churn. This is a common pattern.
Complaints: Customers who had filed complaints had a higher probability of churning.
Service Score: Lower service scores correlated with higher churn.
Cashback Usage: Interestingly, customers who used less cashback were more prone to churn.

This is where you move beyond just saying "the model performed well."

You translate these model insights into concrete retention actions.

For example:

Focus on onboarding: If new users (low tenure) churn quickly, improve the first 30-60 days of their experience.
Proactive issue resolution: If complaints are a big driver, how can we identify and resolve issues before a customer even complains?
Incentivize engagement: If cashback usage is a factor, can we encourage more use of incentives or loyalty programs?

📉 13. Evaluating Machine Learning Models the Right Way

Basic accuracy can be a deceptive metric, especially with imbalanced datasets.

So, for a churn model, what you really care about is catching churners (Recall) without excessively bugging loyal customers (Precision).

X_train_smote_df = pd.DataFrame(X_train_smote, columns=X_train.columns)

X_train = X_train[X_train_smote_df.columns]
X_test = X_test[X_train_smote_df.columns]

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_train.columns)

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for name, model in models.items():
    try:
        y_test_prob = model.predict_proba(X_test_scaled)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_test_prob)
        plt.plot(fpr, tpr, label=f"{name} (AUC = {roc_auc_score(y_test, y_test_prob):.3f})")
    except Exception as e:
        print(f"Skipping {name} due to error: {e}")

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison SMOTE")
plt.legend()

plt.subplot(1, 2, 2)
for name, model in models.items():
    try:
        y_test_prob = model.predict_proba(X_test_scaled)[:, 1]
        precision, recall, _ = precision_recall_curve(y_test, y_test_prob)
        plt.plot(recall, precision, label=name)
    except Exception as e:
        print(f"Skipping {name} due to error: {e}")

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve Comparison SMOTE")
plt.legend()

plt.tight_layout()
plt.show()

Output (Example Classification Report):

What to focus on:

Recall: In this example, 0.8760. This means the model caught 87% of the actual churners. Good for a business trying to identify at-risk customers.
Precision: In this example, 0.98. This means of all the customers the model predicted would churn, only 89% actually did. The other 2% were false alarms.
AUC-ROC: This gives you an overall measure of how well the model distinguishes between churners and non-churners across all possible thresholds. An AUC of 0.9960 is quite good.

I know this feels borderline overfit, but we'll take it as an example for now. I tested 6 models and then tuned the winner: Random Forest.

Pro tip: Always include the confusion matrix.

It's incredibly intuitive and helps explain model behavior to non-technical stakeholders.

Things like who we're catching (true positives, true negatives) and who we're missing (false negatives, false positives).

It literally shows the trade-offs.

Key Lessons From This Machine Learning Adventure

This project, and others like it, really taught me some lessons that go beyond just the syntax or algorithm theory:

Churn is Complex: You rarely get one single "magic" predictor. It's almost always a combination of behavioral factors.
Modeling is Only a Slice of the Pie: The actual coding and model training might be 30% of the work. Cleaning, exploring, feature engineering, and especially explaining your results - that's often the harder, more time-consuming part.
EDA is Massively Underrated: My best business insights didn't come from the final model's predictions. They came during the exploratory data analysis phase, long before I even picked an algorithm. Spend time here!
Don't Overfit to Metrics: Understand why you're using a particular metric. High recall isn't always better if your false alarms are very costly (e.g., flagging healthy patients as sick). Align your metrics with real world consequences.
Your Model is a Tool, Not the Product: The ultimate goal isn't a perfect model score. What truly matters is how your model helps people (product managers, marketing teams, executives) make better, data-informed decisions.

If this walkthrough helped you think more clearly about data science projects (or if you just love seeing DS in action), feel free to share it with a friend or drop a comment with your thoughts.

See you later.

A Primer to Framing Business Problems for Machine Learning

Shivam Chhuneja — Tue, 17 Jun 2025 08:41:27 +0000

A stakeholder comes to your desk. They're excited. "We need to use AI," they say, "to improve customer retention."

You nod, open your editor, and you start thinking.

Should I use XGBoost? Or maybe a neural network? How will I set up the pipeline?

Stop. Right there.

This is the single biggest mistake many of us make when we're starting out: we jump straight to thinking about solutions and algorithms.

We're so excited to play with the cool, technical thing that we forget to ask the most important question first:

What problem are we actually trying to solve?

The most valuable skill for a data scientist isn't knowing every algorithm under the sun.

It's easy, Google has ton of code to help with that, LLMs can help out too. So then what makes a Data Scientist more valuable than the readymade code example anywhere on Google?

Well, it's the ability to translate a vague business goal like "improve retention" into a specific, solvable machine learning problem.

This is called problem framing.

Here is what I'll cover:

The three main "lenses" for framing almost any ML problem: Classification, Regression, and Ranking.
Real-world examples and use-case templates for each.
A practical, 5-step checklist you can use on your next project.

Why "Framing" is 90% of the Machine Learning Engineering Battle

Let's be clear: getting the framing wrong is a recipe for disaster.

You can spend weeks building a model with 99% accuracy, just to find out that it doesn't actually help anyone make a better decision.

It's a model that's technically correct but practically useless. It answers a question nobody was asking.

Proper framing changes a high-level objective like "increase revenue" into a precise, machine-solvable question like, "What is the predicted purchase value of this specific user in the next 30 days?"

When you get the frame right, you know that the model you build will directly support a business decision.

It provides a clear target for you to aim at and a clear definition of success.

The Three Core Lenses: Classification, Regression, and Ranking

Think of these as the primary tools in your problem-framing toolkit. Almost every business problem can be viewed through one of these three lenses.

Classification: Which Category Does This Belong To?

The Core Question: Classification is about predicting a discrete category or a label.

The fundamental question you're asking is: "Is this A or B (or C, or D...)?"

Think of the Sorting Hat from Harry Potter. It takes an input (a new student) and assigns it to a specific, predefined bucket (Gryffindor, Hufflepuff, etc.)

Here are a few examples:

Business Problem: "We're losing too many customers and don't know who to focus our retention efforts on."
ML Framing (Binary Classification): For each active customer, will they churn in the next 30 days? (Yes/No)
What the Model Outputs: A probability (e.g., 85% chance of churn) and a final class label ("Yes").

Here are 3 more:

Spam Detection: Is this email spam or not spam?
Medical Diagnosis: Does this medical image show a malignant tumor or a benign one?
Lead Scoring: Based on user actions, is this new sales lead "hot" or "cold"?

For problems like these, you might use algorithms like Logistic Regression, Decision Trees, or Support Vector Machines.

You'd measure success with metrics like Accuracy, Precision, Recall, and F1-Score.

Regression: How Much or How Many?

The Core Question: Regression is all about predicting a continuous numerical value. The question you're asking is: "What is the specific number?"

If classification is a sorting hat, regression is a crystal ball that gives you a precise number, not a category.

Now towards our examples:

Business Problem: "We need to set our budget accurately and manage inventory for the upcoming quarter."
ML Framing (Regression): How much revenue, in dollars, will we generate next month?
What the Model Outputs: A continuous value (e.g., $1,254,300, or 4,521 units sold).

A few more yet again:

Real Estate: What is the predicted market price of this house?
Demand Forecasting: How many units of this product will we sell next week?
Customer Value: What is the estimated lifetime value (LTV) of this new customer?

Here, you'd work with algorithms like Linear Regression or tree-based models like XGBoost and Random Forests.

You'll measure success with metrics like Root Mean Squared Error (RMSE) or Mean Absolute Error (MAE).

Ranking: What's Most Relevant or Important?

The Core Question: Ranking is about predicting an order or a sequence. The question is: "What should I show the user first, second, third...?"

This one is more or less like a hyper-personalized "Top 10" list generator for every single user.

On to our examples now:

Business Problem: "Users are searching on our e-commerce site but aren't finding relevant products and are leaving."
ML Framing (Ranking): Given a user's search query, what is the optimal order to display the available products to maximize the chance of a click or purchase?
What the Model Outputs: An ordered list of items (e.g., [Product_ID_5, Product_ID_12, Product_ID_2, ...]).

A few more, here we go:

Search Engines: The order of results on a Google search page.
Social Media: The personalized posts in your Instagram, X, or TikTok feed.
Recommendations: The "Top Picks for You" section on Netflix or Spotify.

This is a more specialized field often called "learning-to-rank," and depending on the use case, it can use metrics like nDCG, Precision@k, or Hit Rate - each helping measure how well the ordering aligns with what the user actually wants.

Your Practical 5-Step Machine Learning Problem Framing Checklist

Next time you get a vague request, walk through these five steps.

Start with the Business Decision.
Ask: "What specific action will someone take based on this model's prediction?" If there's no clear action, the model might not really be needed at all.
Example: We will send a 10% discount to customers predicted to churn.
Define the Ideal Output.
Ask: "What, specifically, does the decision-maker need to see?" A Yes/No answer? A dollar amount? An ordered list of the top 5 items?
Example: We need a "Yes" or "No" for each customer.
Choose Your ML Lens.
Based on the output from Step 2, map it directly to a frame.
Example: A "Yes/No" output means this is a Classification problem.
Identify the Unit of Analysis.
Ask: "What, exactly, are we making a prediction for?" Is it for a single customer? A product? A transaction? A single day? This defines the rows in your dataset.
Example: We are making one prediction per customer.
Determine the Success Metrics (Business & Technical).
Ask: "How will we know if the model is good?" Define both the technical goal (e.g., achieve an F1-score > 0.8) and the business goal (e.g., successfully reduce overall churn by 5%).

Framing in Machine Learning is Where the Real Value is Created#

The best algorithm in the world can't save a poorly framed problem.

Our value as a data scientist doesn't just come from our ability to build complex models; it comes from our ability to ask the right questions and structure a problem in a way that leads to a truly useful solution. (LLMs can build these models in seconds these days, but they do fail to ask these business specific questions!)

Why ARIMA and SARIMA Still Matter: A Technical Guide to Time Series Forecasting

Shivam Chhuneja — Mon, 16 Jun 2025 12:35:48 +0000

Deep Learning Gets the Spotlight, But Time Series Still Solves Real Problems

In the machine learning landscape today, deep learning models - transformers, LSTMs, and other neural networks steal the show.

They're impressive, powerful, celebrated and make you feel smart too when you use them.

However, when it comes to forecasting business metrics like sales, demand, or inventory, deep learning isn't always the answer.

Traditional time series models, especially ARIMA (AutoRegressive Integrated Moving Average) and its seasonal extension SARIMA, are some of the most effective and interpretable methods for forecasting structured temporal data.

These models are not flashy, but they are robust, straight forward to explain, and often outperform complex black-box models on smaller datasets with clear seasonality and trends.

IMHO, ARIMA and SARIMA deserve a comeback in your ML toolkit.

Alright let's touch upon these and then I'll walk you through a practical example forecasting monthly Rosé wine sales.

The Case for ARIMA and SARIMA

Why should you care about ARIMA and SARIMA in 2025?

Here are a few reasons according to me:

Interpretability: Unlike many deep learning models, ARIMA and SARIMA provide explicit parameters that correspond to real-world phenomena - trends, seasonality, and noise - making it easier to explain forecasts to stakeholders. And this is a really important point. Stakeholders more often than not prefer things they understand.
Effectiveness on Limited Data: Many business contexts offer only a few years of monthly or weekly data. Deep learning models often require huge datasets and ARIMA/SARIMA can perform well on smaller samples.
Explicit Seasonality Handling: SARIMA extends ARIMA by explicitly modeling seasonal patterns, which are common in sales, demand, and other business metrics.
Built-in Confidence Intervals: These models naturally provide prediction intervals, giving a sense of uncertainty - which is important for risk-aware business decision-making.
Computational Efficiency: They are lightweight and fast to train hence highly practical for quick iterations.

That said, ARIMA/SARIMA are not the end all, be all.

They assume linear relationships and stationarity (more on that later), so if your data is extremely noisy, non-linear, or high-dimensional, other approaches might be better.

But for many real-world forecasting problems, they definitely are a solid choice.

Core Concepts Behind ARIMA and SARIMA

Before we dive into code, let's try and understand the building blocks for these models.

Stationarity

A stationary time series has statistical properties (mean, variance) that do not change over time.

Stationarity is crucial because ARIMA models assume the underlying process is stable.

If your data trends upward or has changing variance, you often need to difference it (subtract previous values) to get stationarity.

Autoregression (AR)

The autoregressive part models the current value of the series as a linear combination of its previous values. The parameter p in ARIMA controls how many past observations are used.

Moving Average (MA)

The moving average component models the current value as a linear combination of past forecast errors (noise). The parameter q controls how many lagged errors are included.

Integration (I)

Integration is simply differencing the series to make it stationary. The parameter d denotes how many times differencing is applied.

Seasonality in SARIMA

SARIMA adds seasonal components (P, D, Q, m) to capture repeating patterns at regular intervals (e.g., monthly seasonality with m=12). This allows the model to handle complex seasonal dynamics beyond simple trend and noise.

Hands-on Example: Forecasting Rosé Wine Sales with SARIMA

Now, let's have a quick look at an example. (snippets are from my project during the first year of masters in data science)

The goal was to forecast the next 12 months with confidence intervals.

Step 1: Load and Visualize the Data

import pandas as pd
import matplotlib.pyplot as plt

# Load the data
rose_data = pd.read_csv('rose.csv', parse_dates=['YearMonth'], index_col='YearMonth')

# Plot the time series
plt.figure(figsize=(14, 6))
plt.plot(rose_data.index, rose_data['Rose'], label='Rosé Wine Sales')
plt.title('Rosé Wine Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Sales Volume')
plt.legend()
plt.show()

You can see clear seasonality with peaks recurring about every 12 months, and an overall downward trend.

Step 2: Selecting SARIMA Parameters with Auto-ARIMA

Choosing the right (p, d, q)(P, D, Q, m) parameters can be tricky when you start.

But pmdarima library's auto_arima automates this by searching for the best combination based on information criteria like AIC.

print("Auto ARIMA and Auto SARIMA for Rose Wine")

# Auto ARIMA for Rose
auto_arima_rose = auto_arima(
    train_rose['Rose'],
    seasonal=False,
    trace=True,
    error_action='ignore',
    suppress_warnings=True,
    stepwise=True
)
print("Best Auto ARIMA Model for Rose Wine:")
print(auto_arima_rose.summary())

# Auto SARIMA for Rose
auto_sarima_rose = auto_arima(
    train_rose['Rose'],
    seasonal=True,
    m=12,
    trace=True,
    error_action='ignore',
    suppress_warnings=True,
    stepwise=True
)
print("Best Auto SARIMA Model for Rose Wine:")
print(auto_sarima_rose.summary())

What we get are estimated model parameters and diagnostics, helping us understand the model structure and fit quality.

                                      SARIMAX Results
============================================================================================
Dep. Variable:                                    y   No. Observations:                  149
Model:             SARIMAX(2, 1, 2)x(1, 0, [1], 12)   Log Likelihood                -663.811
Date:                              Sat, 16 Nov 2024   AIC                           1341.622
Time:                                      14:25:44   BIC                           1362.602
Sample:                                  01-01-1980   HQIC                          1350.146
                                       - 05-01-1992
Covariance Type:                                opg
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.4717      0.235     -2.006      0.045      -0.933      -0.011
ar.L2         -0.0623      0.102     -0.610      0.542      -0.263       0.138
ma.L1         -0.2522      0.236     -1.071      0.284      -0.714       0.209
ma.L2         -0.6059      0.238     -2.551      0.011      -1.071      -0.140
ar.S.L12       0.9854      0.017     58.956      0.000       0.953       1.018
ma.S.L12      -0.7863      0.118     -6.652      0.000      -1.018      -0.555
sigma2       402.8422     45.772      8.801      0.000     313.131     492.553
===================================================================================
Ljung-Box (L1) (Q):                   0.05   Jarque-Bera (JB):                77.80
Prob(Q):                              0.83   Prob(JB):                         0.00
Heteroskedasticity (H):               0.20   Skew:                             0.80
Prob(H) (two-sided):                  0.00   Kurtosis:                         6.17
===================================================================================

Step 3: Forecasting with Confidence Intervals

A lot of DS and ML projects are going to require you to try a ton of things and pick the winners, so tons of trial and error.

This is what I had for RMSE across different params:

| Model            | Order     | Seasonal Order   | RMSE       |
|------------------|-----------|------------------|------------|
| AR               | (2, 0, 0) | None             | 50.691553  |
| ARMA             | (2, 0, 2) | None             | 26.107300  |
| ARIMA            | (1, 1, 2) | None             | 20.889614  |
| SARIMA           | (1, 1, 2) | (0, 1, 1, 12)     | 11.258172  |
| Best Auto ARIMA  | (2, 1, 3) | None             | 19.700298  |
| Best Auto SARIMA | (2, 1, 2) | (1, 0, 1, 12)     | 10.474901  |

After fitting, forecasting is straightforward.

best_auto_sarima_model_full = SARIMAX(rose_data['Rose'], order=(2, 1, 2), seasonal_order=(1, 0, 1, 12)).fit(disp=False)

# 12 month forecast
forecast = best_auto_sarima_model_full.get_forecast(steps=12)
forecast_index = pd.date_range(rose_data.index[-1] + pd.DateOffset(months=1), periods=12, freq='M')
forecast_mean = forecast.predicted_mean

# 95% and 99% confidence intervals
forecast_conf_int_95 = forecast.conf_int(alpha=0.05)
forecast_conf_int_99 = forecast.conf_int(alpha=0.01)

forecast_mean.index = forecast_index
forecast_conf_int_95.index = forecast_index
forecast_conf_int_99.index = forecast_index

# Forecast table
forecast_table = pd.DataFrame({
    'Forecast': forecast_mean,
    'Lower 95%': forecast_conf_int_95.iloc[:, 0],
    'Upper 95%': forecast_conf_int_95.iloc[:, 1],
    'Lower 99%': forecast_conf_int_99.iloc[:, 0],
    'Upper 99%': forecast_conf_int_99.iloc[:, 1]
})

forecast_table.reset_index(inplace=True)
forecast_table.rename(columns={'index': 'Date'}, inplace=True)
print("12-Month Forecast with 95% and 99% Confidence Intervals for Rose Wine:")
print(forecast_table)

This is the actual forecast with the confidence intervals.

| Date       | Forecast   | Lower 95% | Upper 95% | Lower 99% | Upper 99% |
|------------|------------|-----------|-----------|-----------|-----------|
| 1995-08-31 | 43.32      | 7.38      | 79.27      | -3.92     | 90.57     |
| 1995-09-30 | 42.48      | 5.11      | 79.84      | -6.63     | 91.58     |
| 1995-10-31 | 45.28      | 7.88      | 82.67      | -3.87     | 94.43     |
| 1995-11-30 | 54.49      | 16.67     | 92.31      | 4.79      | 104.19    |
| 1995-12-31 | 81.95      | 44.02     | 119.88     | 32.10     | 131.80    |
| 1996-01-31 | 21.22      | -16.90    | 59.34      | -28.88    | 71.31     |
| 1996-02-29 | 29.91      | -8.37     | 68.19      | -20.40    | 80.22     |
| 1996-03-31 | 36.13      | -2.31     | 74.58      | -14.39    | 86.66     |
| 1996-04-30 | 36.91      | -1.70     | 75.53      | -13.83    | 87.66     |
| 1996-05-31 | 30.52      | -8.26     | 69.30      | -20.44    | 81.49     |
| 1996-06-30 | 36.76      | -2.18     | 75.70      | -14.42    | 87.94     |
| 1996-07-31 | 46.87      | 7.77      | 85.98      | -4.52     | 98.27     |

And the following code gives us our forecast chart with the confidence intervals.

plt.figure(figsize=(12, 6))
plt.plot(rose_data.index, rose_data['Rose'], label='Historical Sales', color='blue')
plt.plot(forecast_index, forecast_mean, label='12-Month Forecast (Auto SARIMA)', color='purple')

# 95% Confidence Interval
plt.fill_between(forecast_index, forecast_conf_int_95.iloc[:, 0], forecast_conf_int_95.iloc[:, 1],
                 color='purple', alpha=0.2, label='95% Confidence Interval')

# 99% Confidence Interval
plt.fill_between(forecast_index, forecast_conf_int_99.iloc[:, 0], forecast_conf_int_99.iloc[:, 1],
                 color='purple', alpha=0.1, label='99% Confidence Interval')

plt.title('12-Month Future Forecast for Rose Wine with 95% and 99% Confidence Intervals')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()

The confidence intervals are extremely important, they show the stakeholders how much inherent risk exists in these predictions. And of course this tends to be missing in a ton of black-box models.

Common Issues and Tips When Using ARIMA/SARIMA

Misinterpreting Confidence Intervals: The intervals give us model uncertainty assuming the model is correct; they do not guarantee future observations will fall within them.
Ignoring Differencing: Not differencing non-stationary data will give you poor model fit and unreliable forecasts.
Mistaking Noise for Trend: Overfitting to noise can happen if parameters are too large; use information criteria and diagnostics to avoid this.
Seasonality Mismatch: Seasonal period m must match your data frequency (e.g., 12 for monthly, 4 for quarterly).
Data Quality: Missing values or outliers will distort model estimation; preprocess like a pro please.
Overreliance on Automation: Auto-ARIMA is helpful but understand the underlying assumptions and validate results.

Time Series Deserves a Comeback in the ML Toolkit

I'll be honest - I love deep learning. It's exciting, and super fun to play with. But when it comes to time series forecasting in real-world business scenarios?

It's not always the best fit.

Most business data is small, seasonal, and needs to be explained to someone who doesn't care about neural networks. That's where ARIMA and SARIMA really help out.

They're simple, fast, and surprisingly effective.

And they make it easy to say why something was forecasted the way it was.

Want to Dive Deeper?

If you want to build stronger intuition around time series forecasting, try applying ARIMA/SARIMA to your own datasets and experiment with parameter tuning and of course let me know how it goes!

22 Lessons from 1 year in Data Science and Machine Learning

Shivam Chhuneja — Thu, 12 Jun 2025 09:29:07 +0000

It's been a year in data science and machine learning.

Okay, I lied.

Technically a full year and a few months since I officially splooted (wanted to show off my extensive vocabulary) into the world of data science and machine learning with my master's program.

In late 2023 I started learning data science through a Udemy course and in January of 2024 I gave up.

Well, not exactly per say.

I realized it will be difficult to manage all that with a life, full time work and family time so, the solution was to do something that holds me accountable and forces me to study, and I enrolled for another masters degree.

Data Science with specialization in Machine Learning.

And looking back, what started as "I need to have some structure, so I'll enroll into a DS Master's program" has turned into a deeply satisfying but a sweaty, sometimes highly frustrating journey. It hasn't just been about attending lectures or submitting assignments.

The real learning, the kind that actually matters, happened when I was diving into actual data, working on pipelines that seemed determined to break, and training countless models that initially flopped.

This isn't intended as a technical tutorial or a showcase. Instead, think of it as a reflection - a download of the most significant lessons that have shaped my understanding of data science and machine learning principles over the Look, my reason for writing this article are not completely altruistic.

I want to cement these lessons in my mind as well and writing about these is exactly how I plan to do that.

These are insights not just about data and algorithms, but about working with ambiguity, balancing the want for speed with the need for depth, and trying to understand what truly makes someone effective as an ML engineer.

If you're starting out, perhaps in a similar program, a bootcamp, or even teaching yourself, I hope sharing these reflections resonates and maybe helps you navigate your own path with a bit more clarity (or at least, a little less of that early-days panic).

The lessons below aren't neatly categorized or presented in some grand order of importance.

They're simply the insights that have consistently showed up, whether I was building churn prediction models, exploring clustering techniques, or trying to get a deployment pipeline to behave.

Some are technical, some are more about mindset, but all of them are born from real experience.

One last thing before we go ahead with the tips: I have been doing additonal work apart from what is in my curriculum and I think that is very important.

Juggling learning Golang, Rust and doing DL before college actually covers it to trying to read and implement machine learning research papers, just a few things I have been working with.

Focused effort might have gotten me where I am much faster but I think playing around has kept things much more interesting and fresh for someone like me. (classic ADHD behaviour)

Alright, this intro was longer than most blog articles on the internet. Oops.

Let's get into it.

1. Data Cleaning: It Really Is 80% of the Work (And Your Most Important Teacher)

If there's one truth that I realized early on, and am very happy about that fact then it's this: data cleaning isn't just a preliminary step; it's a massive chunk of the actual work.

You might have heard the "80/20" rule thrown around, but I don't think you'll truly understand the weight of it until you face your first few real-world datasets.

Forget the crystal clear, ready-to-model CSVs you often get in introductory courses. Real data rarely arrives gift-wrapped.

I remember working on a project with customer transaction data. Excited to get to modeling, I opened the files and well, it was a significant challenge.

Missing values weren't just occasional; they were substantial gaps. Formatting was all over the place - dates in multiple styles, numbers being picked up as strings with embedded characters, categorical labels with stupidly minor variations ("Region A," "region_a," "A Region") that all meant the same thing but would confuse the algorithms.

And outliers? Well Well Well.

Initially, it's easy to think you're doing something wrong, or that your technical skills are lacking because the data isn't behaving. At least I felt that way.

It felt like I was constantly hitting blocks before I could even begin what I thought was the "real" data science. But I've come to realize that dealing with this initial mess is fundamental.

This isn't just "janitorial work," it's where you truly start to understand "data science".

Why This Tedious Data Cleaning Process is Invaluable

The process of cleaning and preparing data forces you to engage deeply with it. When you play with missing values, you're not just deciding whether to impute or remove them; you also end up asking why they're missing. (at least I hope you do)

Is it a systematic issue with data collection? Is the absence of a value a signal in itself? These questions often lead to valuable insights. For example, in one dataset, a pattern of missing information in a particular field actually correlated with a specific customer segment we hadn't initially considered.

Similarly, standardizing categories or correcting entries requires you to understand the context behind the data points.

It's an iterative process of hypothesizing, checking, and refining.

This back-and-forth dialogue with the data is where real understanding is built. It's less about mindlessly applying functions and more about being a good detective.

Through the effort of working with broken datasets, I didn't just end up with a cleaner dataset. I gained a much better feel for the structure, limitations, and its potential. You start to see patterns, anomalies, and relationships you would have completely missed otherwise.

I know this might feel like I am romanticizing it too much, in fact I feel the same as I am writing this, but what's true is true.

Practical Approaches and Key Takeaways for Data Handling

So, how do you approach this effectively?

Here's what I've learned:

Master Your Tools (Especially Pandas): For anyone working in Python, a strong hands-on on Pandas is a non-negotiable. Functions for inspecting data (.info(), .describe(), .value_counts(), .isnull().sum()), for cleaning (.fillna(), .dropna(), .astype(), .replace()), and for transformation (.apply(), .map(), merging/joining dataframes) are going to be your bread and butter.

A very underappreciated resource these days is the official documentation of any library, especially by beginners. Don't skip on Pandas documentation - it's an invaluable resource.

Visualize for Diagnosis: Often, a quick plot can show issues more effectively than staring at rows of numbers. A histogram can highlight outliers or skew in numerical data.

A bar chart of categorical counts can quickly show inconsistent labeling or imbalanced classes. Libraries like Matplotlib and Seaborn are vital for this kind of exploratory and diagnostic visualization.
Iterate and Document: Data cleaning is rarely a linear process. You'll clean some, explore further, pull your hair on new issues, and refine your cleaning steps.

One huge mistake that I did was I did not document things enough from day 1, so document every transformation you make and why you made it.

Use comments in your code or take detailed notes in a Jupyter notebook, this is important for reproducibility and for your own understanding when you come back to the project later.

I learned this the hard way after trying to backtrack my steps for a project I'd set aside for a few weeks.

Assume the person who will maintain your code next is a pissed off version of yourself from three months in the future and document your code for that person.

Make the future you's life easy.
Domain Knowledge is Key: If possible, talk to people who are either know how the data was generated or what it really means. Understanding the business process or the context can help clear ambiguities and make your cleaning decisions for you far better than guessing what is important and what is not.
"Garbage In, Garbage Out" Principle: This is absolutely central to data science. No matter how next level your model is, if it's trained on garbage data, its predictions will be, well you guessed it: garbage.

The effort you put in cleaning and preprocessing is a direct investment in the quality of your results.

Learning to "really" clean and prepare data is about developing a critical mindset and a deeper appreciation for the raw material.

And I am not overexaggerating when I say, "It's foundational to everything that follows."

2. Feature Engineering Matters More Than Fancy Models (Often, Anyway)

When I first started, like everyone else I was only interested in complex algorithms. Why? Because they make you feel smart. And what good is data science if you feel stupid?

Right at the start I spent a good amount of time thinking about which model to choose, how to tune all those hyperparameters, and whether I needed ensembles or what not.

But a funny thing happened as I was aiming for State-of-the-Art performance: I realized that simpler models often performed surprisingly well, sometimes even better, when I focused more on creating really solid features. (infact my capstone Churn Analysis project for year 1 is a great example of this)

I had evaluated 6 different machine learning models for churn prediction:

Model	Test Accuracy	Precision	Recall	F1-Score	AUC-ROC	Overfitting
Decision Tree	0.9312	0.7732	0.8364	0.8035	0.8934	Yes
Random Forest	0.9742	0.9397	0.9050	0.9220	0.9936	No
XGBoost	0.9694	0.9189	0.8971	0.9079	0.9908	No
AdaBoost	0.8708	0.5957	0.7230	0.6532	0.9025	Yes
Naïve Bayes	0.7127	0.3386	0.7414	0.4648	0.7770	No
SVM	0.7744	0.4122	0.7995	0.5440	0.8526	No
Tuned RF	0.9765	0.9822	0.8760	0.9261	0.9960	No

Random Forest & XGBoost were the top performers with high precision and AUC scores, and no signs of overfitting, so I ended up tuning the random forest further.
I also experimented with SMOTE for handling class imbalance, but performance dropped significantly across models so oversampling isn't always the solution.

Is XGBoost the most complex model you can ever work on? No. There are 50 more things I could have done here with the models.

Same goes for random forest, even though they aren't technically simple as other basic ML models but they are relatively simple to what could have been done here.

So, what's feature engineering?

It's basically using your understanding of the data and the problem you're tackling, along with some creativity, to transform raw data into signals your model can effectively learn from.

Remember I told you about domain knowledge earlier.

Knowing which ratios to create in a digital marketing dataset will really be effective if you spend a little bit of time to understand the domain and the problem you are trying to solve.

The goal is to make the underlying patterns in the data much clearer.

Think of it like this: you could have a powerful telescope, but if the sky is full of clouds - meaning your raw data is noisy or doesn't offer much useful information - you won't see very much.

Good feature engineering clears away those clouds, or even better, it highlights the specific constellations you need your model to focus on.

I was working on a project trying to predict equipment failure. Initially, I threw a bunch of raw sensor readings into a complex model and got mediocre results. Then, I started thinking about what might actually indicate an impending failure. Instead of just temperature_reading_time_X, what if I created features like rate_of_temperature_change_last_hour, or time_since_last_maintenance, or deviation_from_average_reading_for_this_sensor_type?

Even a simpler logistic regression model started showing much better results. These new features directly tied to hypotheses about the problem.

The Art and Science of Feature Creation

Good feature engineering is a blend of art and science.

Domain Knowledge is Key: Yes I am repeating myself, but if there was 1 thing you should take away from behemoth of this article then it should be this. Understanding the problem space is crucial. If you're predicting customer churn, understanding customer behavior, common pain points, or product usage lifecycles will give you ideas for features (e.g., days_since_last_login, number_of_support_tickets_opened, change_in_usage_frequency).
Transformations are Your Friend: This can be as simple as taking the log of a skewed variable, creating interaction terms (multiplying two features if you think their combined effect is important), or binning continuous variables into categories (e.g., age groups). For date/time data, features like day_of_week, month, is_holiday, or time_elapsed_since_X can be also be powerful. (is_holiday is a sneaky one btw, especially for ecommerce problems)
It's Iterative: You won't come up with all your best features on day one. It's a process of brainstorming, creating, testing, and refining. You'll build a model, see where it struggles (error analysis is also your friend here), and then go back to think if there are new features that could help it understand those tricky cases.
Don't Underestimate Simplicity: Sometimes the most impactful features are surprisingly straightforward. It's not always about complex polynomial expansions or deep learning-based embeddings.

I now see feature engineering as a form of storytelling. You're guiding your model, pointing out the most important clues in the data, and helping it build a more robust understanding of the relationships you're trying to predict.

When you get this right, you often find that a well understood, interpretable model like logistic regression or a decision tree can outperform a black box that's struggling with suboptimal features.

Spend time here.

It almost always pays off.

Check out books like "Feature Engineering for Machine Learning" by Alice Zheng & Amanda Casari, or even just browse Kaggle to see how top competitors approach feature creation for similar problems.

The book I shared is an old one but the principles behind feature engineering don't really change even if the actual tech, libraries, frameworks might change year on year.

3. Imbalanced Data: Uncomfortable & Frustrating But Its The Reality We Must Face

Most of the really interesting, real-world problems I worked on involved imbalanced datasets (minus the first ML problem - that was magical even though I was simply training a linear regression model).

For example fraud detection (most transactions are not fraudulent), rare disease prediction (most people are healthy), or even certain types of customer churn (if you have very low churn, thankfully!). In these scenarios, the "thing" you're trying to predict - the minority class - is heavily outnumbered by the "normal" or majority class.

This was a huge learning curve. Why? Because everything you think you know about model evaluation, especially accuracy, gets thrown out the window.

Think about it: you have a dataset where only 1% of cases are fraudulent.

A model that simply predicts "not fraudulent" for every single case would still be 99% accurate!

Sounds great, right? But it's completely useless because it hasn't identified a single true case of fraud.

This is where I had my "aha!" (or "oh damn!") moment.

Working with imbalanced data forces you to think much more deeply about what "good performance" actually means in the context of your problem.

Metrics Beyond Accuracy: Learn that precision, recall, and the F1-score exist as well as soon as possible.
- Precision: Of all the times your model predicted as positive (e.g., fraudulent), how many actually were positive?
- Recall (Sensitivity): Of all the actual positive instances, how many did your model correctly identify? (You want to catch as much of the "bad stuff" as possible).
- The F1-score is the harmonic mean of precision and recall, basically a single metric to balance them.
- ROC curves and AUC (Area Under the Curve), and especially Precision-Recall curves (PR-AUC), are important for understanding the trade-offs at different decision thresholds. PR-AUC is often more helpful than ROC-AUC for highly imbalanced datasets.
Understanding Business Costs: Imbalanced problems force you to think about the cost of different errors. Is a false positive (flagging a good transaction as fraud) more or less costly than a false negative (missing an actual fraudulent transaction)? The answer will of course influence how you'll tune your model and choose your thresholds.
Specialized Techniques: You start exploring techniques specifically designed for imbalanced data. This might include:
- Resampling: Oversampling the minority class (like with SMOTE - Synthetic Minority Over-sampling Technique, which creates new synthetic examples of the minority class) or undersampling the majority class. Each has its pros and cons. I found that just randomly oversampling often led to overfitting, so techniques like SMOTE, which are a bit smarter, were more helpful.
- Cost-Sensitive Learning: Some algorithms allow you to assign different weights to misclassification errors, effectively telling the model that misclassifying a minority class instance is "worse."
- Ensemble Methods: Sometimes, ensemble can be adapted to work well with imbalanced data.

More than anything, working with imbalanced datasets taught me to be a more critical thinker.

It pushed me away from chasing a single accuracy number and forced me to engage with the core of the problem, the limitations of the data, and the practical implications of my model's predictions.

If you're facing this, I'd recommend looking up research papers on "imbalanced learning" or specific techniques like SMOTE to get a deeper understanding. On second though, a proper research paper might be overkill for this, there are many great blog posts and articles explaining these concepts with practical code examples too, so check those out. (Official documentation is a great way to go)

4. Always Work With Real-World Data (Or As Soon As You Can)

Clean, synthetic datasets like Iris or the Titanic dataset are great when you're first learning the about a new algorithm or library. They allow you to focus on the mechanics without getting pulled down.

But, and this is a big but, the faster you can transition to working with messy, real-world datasets, the faster you'll grasp what data science is actually like in the real world.

What I have come to understand is that data science isn't primarily about solving perfectly defined puzzles with clean inputs. It's about making sense of ambiguity, working through complexity, and dealing with often incomplete information.

Real-world datasets are going to throw all of this at you and you will love it. (at least after pulling some of hair you will)

Context is Everything: Real data comes with real context. You're not just predicting a number; you're trying to solve a problem for a business, a scientific puzzle, or a need.

This forces you to ask better questions: Who generated this data? Why? What are the known limitations? What does a "good" outcome actually mean for the stakeholders? This mind shift from a purely technical to a problem-solving one is really important.
Embrace the Mess (Again): As I mentioned above in data cleaning section, real data will test your patience and your skills.

This is where the most valuable learning happens. It's one thing to read about handling missing data; it's another to stare at a column with 30% missing values for a critical predictor and have to decide on a strategy. (or should you give up and retire into a forest completely offgrid)
Humility: Real world data doesn't care about your perfectly crafted sklearn pipeline or how well your model performed on a Kaggle competition with pre cleaned data.

It will expose the flaws in your assumptions and challenge your creativity. This is a good thing! It keeps you grounded and focused on robust, practical solutions.

The first time I was given a dataset from an ongoing business process. It was a collection of spreadsheets, shopify exports, google ad campaign exports and even some manually entered logs for various SKUs for a company with 3 ecom stores.

The documentation was 0. The task was somewhat vaguely defined: can you help us with your 'data science' to get better ROAS, which is return on advertising spend just in case you are not from marketing side of the world.

My initial reaction was simply me getting a bit flustered. But as I started to dig in, asking questions, and trying to piece things together, I learned more in those few weeks than in months of structured tutorials.

I learned about data governance (maybe, I guess I did), the importance of clear communication with data owners, and the iterative nature of problem definition.

So, my advice is: try out real-world projects.

This could be through internships, capstone projects, contributing to open source projects that use real data, or even finding public datasets from government agencies, non-profits, or platforms like Kaggle (but look for the less polished "real" datasets, not just the clean competition ones).

The experience will be invaluable.

5. Model Evaluation is So Much More Than Accuracy

For a surprisingly long time, I equated "good model" with "high accuracy." And I think most students in my class did too.

It's an easy metric to understand and often the default in many courses.

Then came the churn project where a model with really good accuracy was, in reality, failing to identify most of the customers who were actually churning.

That was a cruel but necessary wake up call: accuracy can be incredibly misleading, especially with imbalanced classes or when the costs of different errors vary significantly.

Metrics like precision, recall, and the F1-score are not just academic concepts; they are important for understanding a model's performance in a way that aligns with actual business or real world objectives.

The Confusion Matrix is Cool: Don't just look at the aggregate metrics.

Dig into the confusion matrix.

How many true positives, true negatives, false positives, and false negatives are you getting?

Seeing these raw numbers usually tells a much clearer story.

For that churn model, the confusion matrix clearly showed that we were correctly identifying most non-churners (true negatives), but we were missing a huge chunk of actual churners (false negatives).
Align with Goals: A model that identifies fewer churners but is very precise about the ones it does identify might be more useful if the intervention to prevent churn is expensive.

If the intervention is cheap and catching as many potential churners as possible is the most important thing, recall becomes more critical.

These trade-offs are always present.
Simulate Business Impact: Whenever possible, try to translate model performance into potential business impact.

If your fraud model catches an extra X% of fraud, what does that mean in terms of dollars saved?

If your recommendation engine improves click-through by Y%, what's the potential revenue impact?

This will help bridge the gap between technical metrics and stakeholder understanding.

Now, when I evaluate a model, I go through a checklist like this:

What is the actual goal? What problem are we trying to solve?
What are the costs of different types of errors (false positives vs. false negatives)?
Is the dataset imbalanced? If so, accuracy is probably not the primary metric.
What do the confusion matrix, precision, recall, F1-score, ROC-AUC, and PR-AUC tell me?
Can I look at predictions for specific segments or edge cases to understand where the model excels and where it fails?
Who will use this model, and what decisions will it drive? Does my evaluation provide them with the information they need to trust and use it effectively?

Thinking beyond a single number and about the broader context as well as consequences of a model's predictions is what separates a data analyst from a data scientist who can deliver real strategic value.

6. EDA Isn't Just a Checkbox

In many tutorials and college class settings, Exploratory Data Analysis can sometimes feel like a "let's just get done with it fast so we can move to the cool stuff" step: generate a few bar charts, a heatmap, calculate some summary statistics, and then quickly move on to modeling.

And look, I am guilty of doing this too. "Let's get through the charts so I can build the actual model." I now realize how shortsighted that was. In the real world, EDA is often where your actual understanding of the problem and the data really begins. It's where the data starts to speak to you, if you listen carefully. Okay maybe "data speaks to you" is a bit too much, but you get the point, right?

Deep EDA has saved me from going down wrong paths and has often helped me find insights that were far more valuable than the eventual model itself. Yep, that happens. Ask any data scientist working with real world data.

Finding Hidden Stuff: EDA is your chance to find the "unknown unknowns." It helped me find outliers that weren't just errors, but represented genuinely unusual (and important) events. It meant surprising correlations between variables I hadn't initially thought were related. I've found seasonality trends, data entry biases, and missing value patterns that told a story about the underlying processes generating the data.
Boost to Feature Engineering: The insights from EDA are direct inputs into your feature engineering process. If you see a non-linear relationship between a variable and your target, well, a transformation might be needed. If certain categories in a feature behave very differently, that might mean you need a new grouping strategy.
Refining the Problem: Sometimes, what you discover during EDA can lead you to reframe the entire problem. You might realize the target variable isn't what you thought it was, or that the available data is better to answer a slightly different, but maybe equally valuable, question.
Helping Build Intuition: Simply spending time visualizing and summarizing your data in different ways builds an intuition for it. You start to "feel" it, its quirks, its strengths, and its weaknesses. This intuition is invaluable when it comes to modeling and interpreting results.

My EDA process now is much more patient and curious.

I use a combination of tools, even though they are basic and fundamental:

Pandas or similar libraries can give a great automated overview to start with, highlighting potential issues quickly.
Matplotlib and Seaborn (and sometimes Plotly for interactive visuals) are my go-to for custom plots. Histograms, box plots, scatter plots, pair plots, heatmaps of correlations - each tells a different part of the story.
I ask a lot of questions: "What does the distribution of this variable look like?" "How does this variable relate to the target?" "Are there any strange patterns or clusters?" "What happens if I segment the data by X or Y?"

If you're rushing through EDA, you're likely leaving insights on the table.

The better your EDA, the better your hypotheses, your features, your models, and ultimately, your ability to solve the problem. Period.

It's an investment that pays dividends throughout the entire project lifecycle.

7. Version Control Isn't Just for Software Engineers, It's For Data Scientists Too

I cannot emphasize this enough: learn Git. Use Git. Make Git your friend.

Even if you're working solo on a small project.

Even if you're just experimenting with a new dataset in a Jupyter Notebook.

Version control isn't just a tool for large software development teams pushing production code; it's a great practice for tracking your thought process, managing your experiments, and saving yourself from countless headaches.

There were times, especially early on, when I'd be tweaking a script or a notebook, trying different approaches, and suddenly realize I'd broken something that was working the day before.

And I'd have no clear idea what I changed or how to get back.

That feeling is awful.

Git is the safety net that saves you from this.

Your Personal Time Machine: With Git, you can commit your changes at meaningful checkpoints. Think of commits as snapshots of your project. If you mess something up, you can always revert to a previous working version.
Understanding Your Own Work: Writing good commit messages (e.g., "Implemented feature X," "Fixed bug in data cleaning for Y," "Experimented with Z hyperparameter for model A") forces you to articulate what you've done. When you look back at your commit history, it becomes a log of your project's evolution and your decision making process. It's surprisingly helpful for remembering your own train of thought.
Branching for Experiments: Want to try a totally different approach to feature engineering without messing up your main working version? Create a new branch! You can experiment freely in that branch, and if it works out, merge it back. If it doesn't, you can just discard the branch. This is fantastic for trying out new ideas.
Collaboration (Even with Future You): If you ever work with others, Git is non-negotiable for managing contributions. But even for solo projects, "future you" will thank "past you" for a well maintained Git repo when you need to come back to an old project.
It Feels Professional: Honestly, getting comfortable with Git and GitHub also just makes you feel more competent and smart. And who doesn't love a bit of an ego boost.

Learning Git has a bit of a curve, I won't lie. Concepts like staging, committing, branching, merging, and remotes can be confusing at first. Start simple: git init, git add ., git commit -m "My awesome work", git push.

Gradually add more commands as you need them. As I started to contributing to open source and other repos I started to learn more and more about Git and diffrent workflows and commands.

I'm not sure I still know it well but more often than not I'm able to troubleshoot if I get stuck somewhere with Git now, and then there are always AI tools around to help explain the errors.

8. One Deep Project Beats Ten Shallow Ones

In the rush to build a portfolio or feel productive, it's easy to fall into the trap of trying to complete many small, quick projects. The "10 projects in 10 weeks" mentality. While there's some value in getting exposure to different datasets and problems, I've found that you will learn the most from going deep on a single, substantial project.

My capstone project, which involved developing a customer segmentation model for an e-commerce dataset, taught me more than a dozen smaller tutorial-based projects combined. Why? Because depth forces you to deal with the complexities that shallow projects often just skim over.

End-to-End Experience: A deep project usually means taking something from raw data all the way to a final output, whether that's a deployed model, a detailed report, or a presentation of insights. This means you have to work with every stage: data acquisition, cleaning, EDA, feature engineering, multiple iterations of modeling, rigorous evaluation, interpretation, and communicating your findings.
Real Problem-Solving: Superficial projects often have well-defined paths. Deeper projects are more like real-world assignments - they tend to have ambiguity built in.

You have to define the scope, make choices with incomplete information, and iterate based on what you discover. You might hit dead ends and have to backtrack. This is where real problem solving skills are built.
Nuance and Trade-offs: When you spend months on a project, you have the time to explore the nuances. You can try different modeling approaches, experiment with a wide set of features, and really dig into why certain things work and others don't. You're more or less forced to think about trade-offs - model complexity vs. interpretability, precision vs. recall, development time vs. performance.
Communication and Justification: If you're working on a significant project, you'll most probably have to explain your methodology and results to others (even if it's just for a course). This forces you to clarify your thinking, justify your decisions, and learn how to communicate complex technical ideas effectively. And this is one of the important skills for a data scientist I hear.

A portfolio filled with quick, similar projects might look busy, it's the project where you can talk passionately about the challenges you faced, the iterations you went through, and the lessons you learned along the way that will actually impress and demonstrate your capabilities.

Don't be afraid to dedicate significant time to one or two complex projects. The depth of understanding you gain will be far more valuable in the long run. IMHO.

9. Your "Best" Model Might Not Be the "Smartest" One

There's a certain pull to using the latest, most complex, or highest-performing algorithm from some leaderboard. We all want to build "smart" models.

Simply because it makes us feel smart.

However, one of the most practical lessons I've learned is that the theoretically "smartest" or most complex model isn't always the best one for the job in a real-world context.

Practicality, interpretability, and maintainability often matter more than raw predictive power at times.

I've had situations where a well tuned XGBoost model achieved a slightly higher AUC score than a simpler logistic regression or decision tree.

But when it came time to explain why the model was making certain predictions, or how a particular input was influencing the outcome, the simpler model was far better.

And often, stakeholders (especially non-technical ones) need that understanding to trust and act on a model's output.

Interpretability Matters: If a model is a complete "black box," it's hard for users to trust it, especially in high-stakes domains like finance, healthcare, or legal applications. Simpler models are often inherently more interpretable.
Ease of Deployment and Maintenance: Simpler models are generally easier to deploy, faster to make predictions with, and require less compute. They are also often easier to debug and maintain over time. If a model is going into a production system, these factors can be just as important as a marginal gain in accuracy.
Stability and Robustness: Sometimes, highly complex models can be "brittle" - they might perform exceptionally well on the specific dataset they were trained on but fail to generalize well to slightly different or new data. Simpler models can sometimes be more robust.
"Good Enough" is Often Good Enough: What's the incremental business value of that extra 0.5% in F1 score? If it comes at the cost of significant complexity, longer development time, or reduced interpretability, is it worth it? The answer isn't always yes. We need to ask: what level of performance is actually needed to solve the business problem effectively?

This doesn't mean you shouldn't explore and learn about advanced models.

Absolutely do! But always weigh the benefits of added complexity against the practical costs.

I now ask myself: "Do we really need this level of complexity here? What are we gaining, and what are we potentially losing in terms of clarity, trust, and deployability?"

Sometimes, a well-featured logistic regression is exactly the hero you need.

10. Just Because It's in the Data Doesn't Mean It Should Be Used

This one ties into both data cleaning and feature engineering, but it deserves its own spotlight because it touches on an important aspect of responsible data science: feature selection is just as important as feature creation, and sometimes what you exclude is as vital as what you include.

Beyond obvious target leakage, there are other reasons to be selective:

Noise and Redundancy: Including too many irrelevant or redundant features can actually hurt your model's performance. They can add noise, making it harder for the algorithm to discern the true signals. They can also increase computational cost and training time.
Overfitting: Models with too many features (especially relative to the number of observations) are more prone to overfitting - learning the noise in the training data too well and failing to generalize to new, unseen data.
Interpretability (Again): Fewer, well-chosen features generally make a model easier to understand and explain.
Ethical Considerations and Bias: This is a huge one. Some features, while potentially predictive, might be proxies for sensitive attributes like race, gender, or socioeconomic status, leading to biased or unfair outcomes. For example, using zip code as a feature in loan applications could inadvertently perpetuate historical redlining if not handled with care and awareness. It's our responsibility to critically examine our features and ask: "Does this feature make sense to use? Is it fair? Could it lead to harmful discrimination?" This is something I learnt while I was a part of Omdena's non profit machine learning project to introduce machine learning and AI into a legal scenario for US courts.

Learning to critically evaluate and remove features is a skill.

You might get attached to a feature you spent a lot of time engineering. Yep, you can get attached to features.

But if it's not adding real, generalizable predictive power, or if it poses an ethical risk, it needs to go.

Ask yourself: "Would I, as a human, use this piece of information to make this decision in a fair and logical way?"

If the answer is no, your model probably shouldn't either.

This often requires not just statistical analysis (like looking at feature importance scores or running recursive feature elimination), but also strong domain knowledge and ethical reasoning.

This is slightly different than the approach you might take for a deep learning unsupervised model but then again even with those it makes sense to prepare data well in order for your model to work as you intend it to.

11. Machine Learning Projects Are Communication Projects in Disguise

This was a slow-burn realization for me, but it's become one of my core beliefs about this field.

You can build the most technically brilliant, highest-performing model in the world, but if you can't explain it clearly, persuasively, and with empathy, its impact will be limited.

Building the model is often just step one; effectively communicating its value, insights, and limitations is where most of the real-world success lies.

Stakeholders, especially non-technicals ones most of the times don't care about log loss or gradient descent. They care about what the model meant for their business goals: How will this help us reduce costs? How can this improve customer satisfaction? Can we trust these predictions to make better decisions?

I now dedicate a significant portion of my project time to thinking about communication:

Know Your Audience: Are you talking to other data scientists, product managers, engineers, or executives? Tailor your language, level of technical detail, and visualizations accordingly. Ask "What are their priorities and concerns?"
Focus on the "So What?": Don't just present results; explain their implications. Connect the model's output directly to the business problem and the decisions it can help make. Instead of saying "The model has an F1-score of 0.75," try "This model can help us identify 75% of high-risk customers, allowing us to proactively engage them and potentially reduce churn by X%."
Visualize for Clarity: Good charts and visuals are incredibly powerful for conveying complex information simply. A well-designed graph can often tell a story much more effectively than a table of numbers or a dense paragraph of text. Think about dashboards, clear summary slides, and intuitive ways to show model predictions or feature importance.
Tell a Story: Frame your insights as a narrative. What was the problem? What did you discover in the data? How does your model address the problem? What are the key insights and recommended actions? A good story is engaging and memorable.
Be Honest About Limitations: No model is perfect. Clearly communicate the assumptions you made, the potential biases in the data or model, and the situations where the model might not perform well. This builds trust and manages expectations.

Learning to be a good "translator" between the technical(ML) and the practical world of business or domain experts is a skill that will make you incredibly valuable.

It's not just about Python and algorithms; it's about empathy, clarity, and persuasion.

12. A Good Model That No One Understands is a Bad Model (Usually)

This builds directly on the previous point but I wanted to take a couple paragraphs to focus specifically on interpretability.

Especially in applications where decisions have significant consequences - like loan approvals, medical diagnoses, hiring, or churn management - if your model is a complete "black box" that spits out predictions without any discernible logic, people will be hesitant to trust it, adopt it, or rely on it.

And rightly so. Business stakeholders like predictability and a black box does not reinforce predictability and stability.

I used to think that as long as the model was accurate, its internal workings weren't that important for the end-user.

I've now realized that interpretability isn't just a "nice-to-have"; it's more or less essential for building trust, enabling debugging, ensuring fairness, and facilitating human oversight.

Building Trust: If stakeholders can understand why a model makes a certain prediction, they are much more likely to trust its outputs and integrate it into their workflows. If a customer service agent is told to offer a specific retention incentive to a customer flagged by a churn model, understanding the key factors driving that flag (e.g., "decreased usage," "recent support complaint") makes the recommendation more actionable and credible.
Debugging and Improvement: If a model is making strange or incorrect predictions, interpretability can help you diagnose the problem. Are there specific features that are having an undue influence? Is it learning spurious correlations? Understanding the "reasoning" can guide your efforts to improve it.
Fairness and Bias Detection: Interpretability tools can help uncover if a model is relying on sensitive or proxy variables in an unfair way. If you can see that a certain demographic group is consistently receiving adverse predictions due to a particular feature, you can investigate and mitigate that bias.
Regulatory Compliance and Accountability: In some industries (like finance with credit scoring), there are legal requirements for models to be explainable. Even without explicit regulation, being able to explain how a decision was made is crucial for accountability.

Simpler models like linear regression or decision trees are inherently more interpretable, but even for more complex models (like ensemble methods or neural networks), there are tools and techniques to help explain their behavior simply.

SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) can help explain individual predictions by showing the contribution of each feature. I've found these quite useful for understanding what pushes my models, even if just for my own sanity.
Partial Dependence Plots (PDPs) or Individual Conditional Expectation (ICE) plots can help visualize how a model's prediction changes as you vary a specific feature.

The key mindset shift for me was moving from "Is this model accurate?" to "Is this model useful, trustworthy, and actionable?"

A model's job isn't just to predict; it's to provide insight and support better human decision-making.

And that almost always requires a degree of clarity and interpretability.

13. MLOps is Real, and It Will Come for You (Eventually, and Sooner Than You Think)

When I first started, "MLOps" sounded like something only big tech companies with armies of engineers needed to worry about.

My focus was on getting my models to work in a Jupyter Notebook.

Reproducibility? Versioning data? Tracking experiments systematically? That felt like overkill for my relatively small student projects.

The wake-up call came when I had to come back to a project after a few months because I wanted to build on it as a portfolio project.

I needed to re-run a model, possibly with slightly updated data.

And I realized that I couldn't quite remember all the specific preprocessing steps I'd taken before landing on the one that was now in my notebook.

It was a mess.

That's when I understood that MLOps isn't just for massive scale; it's about professionalism, reproducibility, and efficiency, even for personal projects.

MLOps is a broad field, but here are a few foundational concepts that I've started incorporating, even into my own learning projects, that have made a huge difference:

Experiment Tracking: Tools like MLflow (which is open source and relatively easy to get started with) or Weights & Biases allow you to log your model parameters, metrics, code versions, and even output artifacts for every run. This creates an organized history of your experiments, making it easy to compare results and recall what worked. No more model_final_v2_actually_final_really_final.ipynb!
Code Structure and Modularity: Moving away from monolithic Jupyter Notebooks for anything more than initial exploration. Structuring code into reusable Python scripts for data preprocessing, feature engineering, training, and evaluation makes it much more manageable, testable, and easier to integrate into a pipeline.
Data and Model Versioning: Just like you version your code with Git, you should also think about versioning your datasets and your trained models. This ensures that if you need to reproduce a result or understand why performance changed, you can trace back the exact data and model artifact used.
Basic Automation/Pipelines: Even a simple shell script or a Python script that runs your sequence of data processing, training, and evaluation steps can save a lot of manual effort and reduce errors.

You don't need to become an MLOps expert overnight.

But starting to think about these principles early and incorporating even simple practices will make your life much easier, your work more reliable, and your skills more valuable.

There are tons of great blogs and video tutorials on MLOps - just start small and build up.

14. Churn (or Any Target Variable) Isn't Just a Number. It's a Story

When I built my first churn prediction model, my primary focus was, naturally, on the prediction itself: Will this customer churn, yes or no? How accurate is my model at making this binary classification?

But as I spent more time with the problem and the data, I realized: understanding why customers churn, and what the underlying drivers are, is often far more valuable than just predicting the event itself.

The target variable isn't just a '0' or a '1'; it represents a story, an outcome of different factors and experiences.

This shift in perspective changed how I approached the entire project.

Prediction to Insight: I started paying much more attention to feature importance scores from my models. Which factors were most strongly correlated with a customer's decision to leave?
- Was it poor customer service interactions (e.g., high number of support tickets)?
- Was it a lack of engagement with key product features?
- Was it price sensitivity in a particular segment?
Actionable Insights for Business Strategy: These insights, derived from the model, could then be translated into potential business actions. If "number of unresolved support tickets" is a major churn driver, the business can focus on improving customer service response times or first-call resolution. If "low usage of new premium features" is a flag, perhaps better onboarding or targeted marketing for those features is needed.
Segmentation and Personalization: Understanding churn drivers can also help segment customers. Are there different "churn personas" who leave for different reasons? This allows for more targeted and effective retention strategies rather than a one-size-fits-all approach.
Narrative Building: The model and its EDA became tools for building a narrative about the customer journey and the points where trust or value might be leaking. This is often what works well with stakeholders and drives change.

I came to see that my role wasn't just to identify which customers were likely to leave, but to help the business understand the "story of churn" within their customer base.

This meant looking at patterns, understanding the context, and connecting the dots in a way that could inform proactive strategies.

So, whatever you're trying to predict, remember to look beyond the prediction itself.

Dig into the "why." That's often where the best insights and the opportunities for impact tend to lie.

15. You Will Feel Like an Imposter. And You Are Not Alone

Let's be honest: there have been tons of moments this past year where I've felt like a complete fraud.

When I couldn't get a model to work after hours of trying.

When someone in a lecture asked a question about the underlying math of an algorithm that made me freeze right there in my shorts (remote lectures so no pants lol).

When I read a great blog post or a cutting-edge research paper and felt hopelessly behind because I felt like I was reading gibberish.

That little voice whispering, "You don't belong here. You're not smart enough for this. You're a marketer, you can't do this. You've always been behind at Maths, what are you going to do in ML. So what if you did an ML project as your graduation thesis, its been a decade, don't be naive, quit this and go back to being a marketing guy."

That's impostor syndrome, and it's incredibly common in this field (and many others, especially for learners).

What I've slowly, and painfully, come to realize is that almost everyone feels this way at some point, even the people you look up to as experts.

In tech, a field that's constantly evolving, with new tools, techniques, and research appearing daily, it's virtually impossible to know everything.

It Means You're Learning: Often, that feeling of inadequacy pops up when you're learning something new, tackling something new and challenging. And that's exactly where growth happens. If you always felt completely comfortable, you probably wouldn't be learning much.
The Field is Vast: Data science and machine learning are incredibly broad. There are specialists in NLP, computer vision, reinforcement learning, MLOps, data engineering, causal inference, ethics... No one is an expert in all of it. It's okay to have areas where you're still learning (which is, for most of us, most areas!).
Comparison is the Thief of Joy (and Confidence): It's so easy to look at someone else's curated GitHub profile or their impressive job title and feel like you're falling short. But you're only seeing their highlight reel, not their struggles, their own moments of doubt, or the years of work that got them there. Focus on your own journey and your own progress.

So, when that feeling of impostor syndrome hits (and it probably will, multiple times), try to acknowledge it without letting it derail you.

Remind yourself that it's a common part of the learning process.

Focus on what you have learned, celebrate your small wins, and keep putting one foot in front of the other.

It means you care about what you're doing and that you're challenging yourself.

And that, in itself, is something to be proud of.

16. Learning is Bigger Than Specific Tools

Python. Pandas, NumPy, Scikit-learn. Matplotlib, Seaborn. TensorFlow, PyTorch. SQL. Git. Docker. Cloud platforms like AWS or GCP.

The list of tools and tech in the data science ecosystem can seem endless and, a bit daunting when you're starting out.

I definitely felt pressure to learn all the things.

While proficiency in relevant tools is important (and you will need to get comfortable with a core set like Python and the core data science libraries), I've come to believe that the underlying concepts and the problem-solving mindset are far more critical and enduring than mastery of any single tool.

Tools Evolve, Fundamentals Stay Strong: The hot new library or framework today might get dumped on by something else in a few years. But the fundamental principles of statistics, how to frame a problem, how to critically evaluate data, how to design experiments, and how to interpret results - these are timeless.
Focus on "How to Think": Can you take an ambiguous problem and break it down into manageable, data-addressable questions?
- Can you think about potential biases and limitations?
- Can you choose an appropriate methodology for the task at hand?
- Can you clearly articulate your assumptions and your findings?
These "thinking" skills are tool-agnostic and will serve you well regardless of the specific software you're using.
Transferable Skills: Once you understand why you're doing something (e.g., why you need to normalize features or why a certain evaluation metric is appropriate), learning the syntax to do it in a new tool is often much easier. The conceptual understanding is the harder, more valuable part.
Avoid "Tool-Driven" Problem Solving: Don't let your comfort with a specific tool dictate how you approach every problem ("I know how to use random forests, so I'll use a random forest here!"). Instead, let the problem decide your choice of methods and, consequently, tools.

So yes, learn the tools.

Get your hands dirty with code.

Build things.

But don't define yourself or your abilities solely by the list of software on your resume.

Prioritize understanding the "why" behind the "what," and cultivate your analytical and critical thinking.

Those are the skills that truly make a data scientist adaptable and valuable in the long run.

17. It's Okay to Go Slow on the Math. But Please, Don't Skip It

There were definitely moments, particularly when going into the theory behind certain algorithms, where the math felt crazy.

Eigenvectors and eigenvalues for PCA?

The calculus behind backpropagation in neural networks?

The statistical assumptions underpinning linear regression?

I'd stare at the equations, and my eyes would start watering.

The temptation to just import the library, run .fit() and .predict()was strong.

While you don't necessarily need to be able to derive every formula from scratch to be an effective applied data scientist, I've learned that taking the time to understand the underlying mathematical and statistical concepts - even at an intuitive level - is really helpful.

Skipping it entirely can leave you vulnerable.

Deeper Understanding and Confidence: When you have at least a conceptual grasp of how an algorithm works, you become more confident in using it appropriately. You understand its strengths, its weaknesses, and the assumptions it's making. This helps you to make more informed choices about which algorithm to use and how to interpret its results.
Better Debugging: If your model is behaving unexpectedly, having some understanding of its internal mechanics can help you troubleshoot. Is it struggling with outliers because of the way its loss function is defined? Are your features scaled inappropriately for an algorithm that's sensitive to feature magnitudes?
Informed Hyperparameter Tuning: Many hyperparameters directly relate to the underlying math. Understanding what they control allows you to tune them more intelligently than just randomly trying values.
Explaining to Others (and Yourself): If you need to explain why a model is making certain predictions, or why you chose a particular approach, having that foundational knowledge (even if you simplify it for your audience) adds a lot of credibility.
Moving Beyond Black Boxes: Relying solely on high-level library calls without any understanding of what's happening underneath can limit your ability to customize, adapt, or innovate.

It's okay if it takes time.

It took me weeks to get an intuitive feel for some concepts and months for others.

I re-read explanations, watched videos (shoutout to resources like 3Blue1Brown on YouTube for amazing visual explanations of complex math), and worked through examples slowly.

For math specifically I also took the specialization of mathematics for machine learning on coursera which was damn helpful - and I finished it twice in 1 year, and probably will do it once more in the next couple of months.

The goal isn't necessarily to become a research mathematician, but to build enough intuition so that the algorithms aren't just magic.

Go at your own pace, use different resources, but don't shy away from the math.

18. The Best Learning Comes From Projects, Not Just Courses

Courses, bootcamps, and textbooks are great for providing structure, introducing new concepts, and giving you a foundational roadmap.

But I've found that the most impactful, sticky learning comes from applying that knowledge by working on end-to-end projects.

There's a world of difference between watching a lecture on logistic regression and actually trying to implement it on a messy dataset, struggling with feature scaling, interpreting the coefficients, and then trying to explain your findings.

Active vs. Passive Learning: Courses can sometimes be a passive experience. Projects force you into active problem-solving. You have to make decisions, encounter errors, search for solutions, and iterate. This struggle is where "deep learning" happens. (did you see what I just did there?)
Connecting the Dots: Projects require you to integrate knowledge from different areas. You're not just learning about data cleaning in one module and modeling in another; you're doing it all together, seeing how the pieces fit and influence each other.
Dealing with Reality: Real (or realistic) project data is messy and ambiguous. Courses usually simplify this. Projects throw you into the complexities, forcing you to adapt and develop a more robust skillset.
Building a Portfolio (and Confidence): Completing a project, especially one you're passionate about or that solves a real problem, gives you something tangible to show for your efforts. More importantly, it builds your confidence in your ability to actually do data science.

If you feel like you're stuck in "tutorial hell" - passively consuming content without feeling like your skills are truly growing - the best antidote is to start a project.

Pick a dataset that interests you (Kaggle, government open data sites, or even data you collect yourself).

Define a question you want to answer or a problem you want to solve.

And then try to build something, however small, from start to finish.

Break it. Fix it. Learn from it.

That cycle of building, struggling, and overcoming is the fast track to mastery.

19. Just-in-Time Learning Usually Beats Just-in-Case Learning

When I first got serious about data science, I was a resource hoarder. I had tons of bookmarked articles, downloaded PDFs of research papers, saved tutorials - all on topics I thought I might need "just in case."

My thinking was, "I need to learn all of this before I can be effective."

The result? Burnout and a lot of unread material.

I've since moved much more towards a "just-in-time" learning approach.

This means focusing my learning efforts on what I need to solve the specific problem or understand the specific concept I'm working with right now.

Motivation and Relevance: When you're trying to solve an immediate problem, your motivation to learn the skills or concepts is much higher. The information feels directly relevant and applicable, which helps with comprehension and retention. Trying to learn about, say, advanced time series models "just in case" is much harder than learning about them when you're actively working on a forecasting project.
Reduces Overwhelm: The sheer volume of information in data science can be paralyzing. A just-in-time approach allows you to learn in more manageable chunks, focusing on depth in the area you currently need, rather than trying to achieve breadth across everything simultaneously.
Faster Application, Better Retention: When you learn something and immediately apply it, you're much more likely to remember it. The practical application solidifies the theoretical knowledge.
Builds Resourcefulness: This approach also builds the skill of knowing how to find information and learn new things efficiently when you need them. You become more confident in your ability to tackle unfamiliar challenges because you trust your capacity to learn what's required on the fly.

This doesn't mean foundational learning isn't important.

You still need that core understanding from courses and broader study.

But for specialized topics or new tools, learning them in the context of a current need is often far more effective and less stressful than trying to master everything beforehand.

It's about being lean and targeted in your learning efforts.

20. The AI Hype is Real. So is the Grind of Fundamentals

It's an exciting time to be in data science and machine learning.

Every week or maybe every hour, it seems there's a new groundbreaking paper, a powerful new AI model, or a new tool that promises to revolutionize some part of the workflow.

The hype is crazy, and it's easy to feel like you need to be constantly chasing the latest and greatest.

It's important to stay aware of these and even experiment with new tools or the latest foundation models, I've also come to appreciate that the foundational skills of data science remain timeless and essential.

The "grind" of learning how to properly clean data, engineer meaningful features, understand statistical principles, evaluate models rigorously, and communicate insights effectively - these things don't go out of style.

Fundamentals are Your Anchor: The tools and specific algorithms will change. The core principles of how to approach data-driven problem solving will endure. A strong foundation makes you more adaptable when new technologies emerge. You'll be better equipped to understand how these new tools work and where they fit into the broader landscape.
Don't Neglect the Basics: It can be tempting to jump straight to playing with advanced AI models. But if you haven't mastered the fundamentals of data preprocessing or evaluation, you'll struggle to use those advanced tools effectively or critically.

Garbage in still means garbage out, no matter how sophisticated the model.
Hype vs. Practical Application: Some new tech is genuinely transformative, others might be overhyped or not yet mature enough for practical application in all contexts. Being able to understand this requires a solid understanding of the underlying challenges and trade-offs in data science.
Pace Yourself and Avoid Burnout: Trying to keep up with every single new development is a recipe for burnout. It's okay to focus on mastering the fundamentals and then strategically explore new areas that are relevant to your interests or projects. This is a marathon, not a sprint.

So, by all means, engage with the hype.

Be curious.

Experiment.

But don't do it at the expense of building a rock-solid foundation in the core principles and practices of data science.

That foundation is what will allow you to navigate the changing landscape with confidence and make a real impact.

21. Stop Comparing Your Timeline to Someone Else's

This is a lesson I have to remind myself of constantly.

In a field that attracts people from such diverse backgrounds - some with computer science PhDs, some transitioning from completely unrelated careers (my sister is a former chef turned data scientist now in a big 4/5 companies in tech consulting), some starting their journey at 18, others at 40 - it is incredibly easy to fall into the comparison trap.

You see someone who seems to be learning faster, achieving more, or landing impressive roles, and you start to question your own progress and abilities.

I've learned that progress in this field is deeply personal, and everyone's timeline and path will look different.

Comparison is not only unproductive; it can be really begative for your motivation and self-esteem.

Different Starting Points, Different Goals: Everyone comes in with a unique set of prior experiences, skills, and learning styles. Your background in, say, biology might give you a unique perspective on certain datasets that someone with a pure math background wouldn't have, and vice-versa. Your career goals might also be different, leading you to focus on different areas.
Invisible Struggles: You rarely see the full picture of someone else's journey - the late nights, the failed attempts, the moments of doubt, the lucky breaks, or the specific support systems they might have. You're often comparing your "behind the scenes" with their "highlight reel."
Focus on Your Own Growth: The most important metric is your own progress over time. Are you learning more than you knew last month? Are you building skills? Are you tackling more challenging problems?

That's what truly matters.
Celebrate Your Milestones: Acknowledge and celebrate your own achievements, however small they may seem. Finishing a tough project, finally understanding a complex concept, getting a piece of code to work - these are all small wins worth recognizing.

This journey is your own.

Be inspired by others, learn from them, but don't measure your worth or your pace against theirs.

Stay focused on your learning, be persistent, and trust that you're moving forward.

You're probably doing much better than you think.

22. Build in Public (A Lesson I'm Learning Now)

This final point is one that, if I could go back and tell my Day 1 self something, this would be high on the list.

For a long time, I kept my learning journey pretty private.

I worked on projects, struggled through concepts, but was hesitant to share what I was doing, what I was learning, or (especially) what I was confused about.

My thinking was, "I'll wait until I'm an expert, until I have something polished to show."

Especially since I had seen reddit comments for "learn publically."

I now believe that was a missed opportunity. Building in public - sharing your learning process, your projects (even the messy ones), your insights, and your questions along the way - can be an incredible catalyst for growth, connection, and even helping others.

Accelerated Learning: The act of trying to explain something you're learning to others (even in a blog post, a tweet thread, or a small presentation to peers) forces you to clarify your own understanding. (classic Feynman technique) It exposes gaps in your knowledge that you can then address.

Teaching is one of the best ways to learn.
Building a Network: When you share what you're working on, you connect with other people who are interested in the same things.

You can find mentors, collaborators, and peers who can offer support, advice, and different perspectives.
Documenting Your Journey: Sharing your progress creates a public record of your growth. This can be incredibly motivating to look back on. It also acts as an informal portfolio that shows your skills and your passion.
Helping Others (Even as a Beginner): You don't need to be an expert to provide value. Sharing your struggles and how you overcame them as a beginner can be incredibly helpful and relatable to others who are just starting out. Your "dumb questions" might be the exact questions someone else is too afraid to ask.
Feedback and Improvement: Putting your work out there opens you up to feedback, which can be invaluable for improvement (once you learn to filter the constructive from the noise).

So, this blog you're reading? This is me trying to get into that "build in public" philosophy, even if it feels a bit vulnerable.

If you're on this journey too, I encourage you to consider sharing it.

Write about what you're learning.

Post your small projects on GitHub.

Ask questions in online communities.

You might be surprised by how much it helps your own experience and how many people you can connect with and help along the way.

And that's the brain dump after a year in this awesome, challenging, and a bit crazy field.

It's been a ride, and I know I'm still just scratching the surface.

But these are the lessons that have felt most significant so far.

Hopefully, some of them resonate with you, wherever you are on your own data science and machine learning adventure.

Keep learning, keep building, and keep asking questions!

Reducing Churn in E-Commerce: My End-to-End Capstone Project in Predictive Modeling

Shivam Chhuneja — Mon, 09 Jun 2025 07:14:37 +0000

Customer churn isn't just a marketing problem - it's a business survival issue. In competitive industries like e-commerce, losing one customer often means losing several revenue streams, especially when one account can represent multiple users.

This post is a breakdown of my churn prediction capstone project for the postgraduate data science program at UT Austin - also tied to my master's in data science at Deakin U. The project was closed-source, so I can't release the full notebook, but I'll walk you through everything I did including code snippets, results, charts, what I learned, and where this project fits in my larger journey into machine learning and MLOps.

🧩 The Problem: Why Churn Matters So Much

The e-commerce company in question faced intense competitive pressure and a rising churn rate. Each lost account didn't just represent a single user - it impacted groups of customers. That meant multiplicative revenue loss.

Our goals were:

Predict churn at the account level
Identify top drivers of churn
Segment customers for targeted retention
Recommend cost-effective, revenue-positive strategies

🧪 Dataset Overview & Initial Processing

The dataset had ~11,000 rows and 19 columns, combining both categorical and numerical variables. A few key points:

Heavy class imbalance: Only ~17% of the accounts had churned
Several columns had missing values which were handled via KNN imputation
Mix of behavioral data (logins, payments, complaints), service scores, and engagement metrics

🔍 Exploratory Data Analysis (EDA)

Key univariate and bivariate findings:

Device Used: Desktop users churned more (~20%) than mobile (~16%)
Revenue Patterns: Lower revenue/month was linked to higher churn
Complaints: Past complaints were strongly correlated with churn
Tenure: Shorter tenure customers churned more
Payment Method: Debit and UPI users churned more than Credit Card users

Interestingly, some expected predictors (like cashback or service score) didn't show strong correlation individually - indicating multi-factorial churn behavior.

📊 Clustering for Segmentation

Rather than just modeling churn, we wanted actionable segmentation. Using the elbow method and KMeans, we identified 4 customer clusters.

Each cluster showed different churn profiles, behavior patterns, and required tailored business strategies.

📌 Cluster Highlights & Strategic Takeaways

Cluster 1 -- Large Segment, High Churn (~20%)

Male-dominated, married, mobile-first
Mid-to-high revenue, but frequent complaints
Strategy: Premium customer care, non-discount loyalty programs, debit-card reward partnerships

Cluster 3 -- Smallest, but Worst Churn (~21%)

Newer users with high cashback dependency
Low service scores, high complaints, revenue decline
Strategy: Early engagement triggers, reduce cashback reliance, introduce tiered perks

Cluster 0 -- Mid-Tier Churn (~13%)

HNI and Super users, higher spenders
Credit card and UPI users, moderate complaints
Strategy: VIP retention plans, gamified reward systems, re-engagement emails

Cluster 2 -- Best Segment (~5% Churn)

Long-tenure, high-revenue customers
Highest satisfaction, lowest complaints
Strategy: Just keep them happy! Introduce premium perks and incentivize digital payments

🤖 ML Model Evaluation & Results

This wasn't just a segmentation project - I evaluated 6 different machine learning models for churn prediction:

Model	Test Accuracy	Precision	Recall	F1-Score	AUC-ROC	Overfitting
Decision Tree	0.9312	0.7732	0.8364	0.8035	0.8934	Yes
Random Forest	0.9742	0.9397	0.9050	0.9220	0.9936	No
XGBoost	0.9694	0.9189	0.8971	0.9079	0.9908	No
AdaBoost	0.8708	0.5957	0.7230	0.6532	0.9025	Yes
Naïve Bayes	0.7127	0.3386	0.7414	0.4648	0.7770	No
SVM	0.7744	0.4122	0.7995	0.5440	0.8526	No
Tuned RF	0.9765	0.9822	0.8760	0.9261	0.9960	No

Random Forest & XGBoost were the top performers with high precision and AUC scores, and no signs of overfitting, so I ended up tuning the random forest further.
I also experimented with SMOTE for handling class imbalance, but performance dropped significantly across models so oversampling isn't always the solution.

🔍 Feature Importance (XGBoost)

Top predictors of churn from the model:

Tenure and Complaints were the most influential features.
Other strong indicators: customer care interaction history, cashback dependency, and service score.

🎯 Business Outcomes & Learnings

This wasn't just a modeling exercise - it was a strategy recommendation engine.

Key takeaways:

Churn is complex: No one variable dominates, but multiple small signals do.
Clustering works: It lets us build segment-specific retention tactics.
Complaints matter: Past complaints + low satisfaction are gold for early churn signals.
Cashback ≠ loyalty: High discount reliance often signals at-risk behavior.

🧭 Business Recommendations by Cluster

Cluster 1 (Large Segment, High Churn ~20%)

Premium support: Faster complaint resolution and priority escalation.
Loyalty rewards without discounts: e.g., early access perks.
Debit-card partner offers & wallet payment nudges.
Predictive churn alerts + personalized retention bonuses.

Cluster 3 (Smallest, Highest Churn ~21%)

High-touch onboarding: Loyalty bonuses in first 3--6 months.
Reduce cashback reliance: Use membership perks instead.
VIP service model for high revenue churners.
Improve care response time: Add real-time channels.

Cluster 2 (Lowest Churn ~5%)

Maintain delight: Premium memberships, early sales access.
Encourage digital payments with UPI/wallet incentives.
Reward repeat purchases via tiered cashback.

Cluster 0 (Moderate Churn ~13%)

HNI support: Personalized account managers.
Gamified engagement: Spend & Earn reward mechanics.
Proactive re-engagement emails.
Introduce premium subscription loyalty tier.

🚀 What's Next?

This project helped me:

Think beyond models and into business strategy
Practice handling imbalanced data
Understand how to build insights*with both EDA and clustering

It's part of a broader set of real-world projects I'll be tackling throughout my second year of the master's program - with more blogs and eventually open-sourced templates (when possible).

If you're interested in just nerding out - feel free to connect or drop a note.

Thanks for reading! 🧠📊

Day 4 of Learning: Switching Between Golang and Deep Learning

Shivam Chhuneja — Sat, 17 May 2025 07:17:12 +0000

It's Day 4, and today was a little different - not heavy on code, but still a strong day of learning. I'm trying to balance two learning paths right now: backend fundamentals with Go and ML + deep learning using PyTorch.

Wrapping My Head Around Pointers and Structs in Go

Most of today's Go time was spent clarifying how data is passed and accessed when using pointers and structs.

type User struct {
    FirstName string
    LastName  string
    BirthDate string
    createdAt time.Time
}

func Struct_fn() {
    FirstName, _ := helpers.StrUserInput("Please enter your first name: ")
    LastName, _ := helpers.StrUserInput("Please enter your last name: ")
    BirthDate, _ := helpers.StrUserInput("Please enter your birthdate (MM/DD/YYYY): ")

    appUser := User{
        FirstName: FirstName,
        LastName:  LastName,
        BirthDate: BirthDate,
        createdAt: time.Now(),
    }

    outputsUserDetails(&appUser)
}

func outputsUserDetails(u *User) {
    fmt.Println(u.FirstName, u.LastName, u.BirthDate)
}

func (u *User) ClearUserName() {
    u.FirstName = ""
    u.LastName = ""
}

Even after completing the basics, I had to go back to ChatGPT, asking follow-up questions like:

When does Go actually copy a struct?
What's the best way to mutate values without triggering memory issues?
Is referencing a pointer enough if my struct has nested fields?

This kind of back-and-forth learning is slower than a typical tutorial. But the mental model is forming. I'm not rushing to build things - I'm trying to deeply understand how Go works under the hood. Especially when it comes to memory management.

Ten years ago, I would've given up on a concept like this. Today, I'm just taking it step by step.

Re-entering the Deep Learning World

Alongside Go, I've started brushing up on deep learning - mostly because my second year of the master's program is around the corner, and it's going to be all about machine learning, MLOps, and system-level thinking.

I've started tinkering with PyTorch again. Right now, it's about understanding artificial neural networks (ANNs) from the ground up. Not just watching videos --- but opening up a notebook and writing everything manually.

I like PyTorch. Maybe because I've touched computer vision before during my internship in Denmark --- when I was at Aalborg University. We were working with OpenCV, and although it was research-focused, it gave me an early glimpse into how machines can see.

I guess that curiosity never really went away.

A Bit More Structure This Time

I've also been thinking about structure --- not just in neural networks, but in how I organize my codebases.

Since I'm keeping all my Go learning in a single repository, I had to restructure the project to keep things clean. Instead of one massive file, I split everything into packages: calculators, helpers, pointers, and so on.

That required learning how to export functions properly in Go, how to keep imports clean, and how to avoid circular dependencies. It wasn't difficult, but it did take a bit of trial and error. And honestly? It's these small structural lessons that make you feel like you're learning how to think like a developer.

(P.S. I wrote more about this restructuring in Day 3's post.)

What's Next

Over the coming week, I plan to keep switching tracks --- Go for backend fundamentals and PyTorch for deep learning.

I'll be diving deeper into structs, interfaces, and generics in Go (based on the course I'm following).
I'll continue reading Concurrency in Go, trying to wrap my head around things like deadlocks, livelocks, and atomic memory access.
I'll also keep going with Building an Interpreter in Go --- that repo is where I try to apply my Go learning to something more complex.

On the deep learning side:

I'll keep working through artificial neural networks using PyTorch --- and slowly move toward feedforward networks and autoencoders.
There's a whole MLFlow module I'm yet to touch --- which includes tracking experiments and integrating ML models.
I'll also revisit cross-validation, generalization, and overfitting, including hands-on experimentation with scikit-learn and PyTorch Dataloaders.

And of course, I'll continue reading a few chapters of AI Engineering by Chip Huyen --- to understand how real-world AI systems are built and deployed.

It's going to be a packed week, but I'm looking forward to it.

That's All for Day 4

Today wasn't flashy. I didn't finish a big project or write hundreds of lines of code.

But I did get clarity. On how Go handles memory. And on how I want to structure my learning going forward.

That's progress in my book.

Why I Chose to Learn Swift After the Age of 30

Shivam Chhuneja — Thu, 24 Apr 2025 12:44:16 +0000

The urge to create something beautiful for iPhone

I didn’t think I’d be learning how to build iPhone apps at 30. Not while juggling a product marketing career, masters in data science, a family life, and the sinking feeling that I really should have taken coding more seriously 10 years ago.

The unfortunate reality is that the way coding was taught (electronics engineering degree), especially how "pointers" in C++ were taught made me believe coding wasn't for me. It was too boring, too restrictive.

But now I know those "restrictive" rules are what open unlimited creative freedom!

What wasn't working for me

Even though I’ve been a content creator for a long time - writing, building brands, running my own business - I hadn’t built something in a long time that made me feel creatively proud.

Sure, I’ve shipped valuable content. My master’s in data science lets me dig deeper into the numbers. But something was still missing: that creative spark - the kind you get when you’re not just expressing ideas, but building something tangible.

That itch had been quietly growing. I missed the feeling of watching something come to life because I made it happen. And weirdly enough, it was coding that could make it happen - the thing I always thought was “too late or too boring” to learn.

I didn’t want to code just for the sake of it. I wanted visual feedback. That immediate “aha!” when something runs. Backend programming? Too abstract for my brain. Web dev? Maybe.

But it was iOS - the smoothness, the consistency, the craft of apps that just work - that speaks to me.

I’ve been an Apple user for years, not just for aesthetics but because things just… work. My iPhone is 6 years old & runs as smoothly today as it did on day one - no lag, no weird bugs, just clean user experience minus the storage issues lol (it's 64GB).

iPhones last a long time. Even Abe Lincoln had one.

Compare that to a similarly aged Android phone, and the difference is night and day.

That quality made me curious. Who builds iOS apps and how?

And more importantly: Why not me?

Discovering Swift & iOS

When I finally started exploring iOS development, I wasn’t exactly sure what I was walking into. Swift sounded elegant. Fast. Clean. And honestly, it felt… inviting in a way that many other programming languages didn’t.

I wasn’t looking for the easiest route - I’ve been through enough to know that “easy” rarely leads to meaningful. What I was looking for was something I could see myself growing with. And Swift felt native, not just to the platform, but to the kind of builder I wanted to become. Maybe its the declarative nature.

Anyways, what really tipped it over for me was the creative control Swift and SwiftUI gave me. I could design and code in the same space. I could visualize my layout, see it update in real-time, and tweak it with intention. It didn’t feel like hacking things together - it felt like crafting something, view by view.

Swift made me feel like I could build beautiful things - not just functional ones. And that mattered to me more than I expected.

This wasn’t just about learning how to code. It was about finally having a creative outlet that felt aligned with both my tech side and my artistic instincts.

My Early Struggles and First Wins with Swift

I’ve only been learning Swift for about a month and a half as of writing this post - so let’s be real: I’m still figuring things out.

I have coded with Python for about 2.5 years or so, all of it has been primarily in and around machine learning and data science.

And that experience with Python makes it slightly easier for me to pick Swift up - since I get the logic.

Anyways, the first few days were messy. Xcode felt like...well you know how it feels..it's weird, peculiar but to be honest it is a really really good ennvironment.

I kept wondering if I had installed things wrong. Errors - specific errors in and around Swift Data and what not.

But at the same time, there were these tiny wins that kept me going.

Like the first time I got a button to actually change light mode to dark mode, with a gradient background, ahh, it was beautiful.

It sounds basic - and it is - but seeing that change happen because I wrote the code? That hit different.

Another small win? Just pushing my first little SwiftUI app to GitHub. Nothing fancy, just a few screens stitched together. But it felt like crossing a line. Like I went from someone watching tutorials to someone doing stuff. I had done this one to understand how XCode works more than actually building the app

I still run into weird layout bugs. I still forget what some modifiers are called. But I’m not frozen anymore. I’m moving. And that feels good.

And btw the AI suggestions or ChatGPT or even Apple docs or even Apple Developer YouTube channel are super helpful for all this!

Why This Journey Matters to Me

This isn’t just about learning to code or adding another skill to my resume.

For me, this is about building something of my own. Something that doesn’t rely on algorithms or client briefs. Something that doesn’t need to perform on social media (or maybe it does). Just something real - made by me, for me, for now.

It’s also about proving to myself that I can start something new, even now, even after 30. That I don’t have to “catch up” to people who started at 18. That I can learn at my own pace and still make progress.

I’ve always been curious. I’ve always wanted to understand how things work. And now I’m finally channeling that into something I can shape with my own hands (well, keyboard).

Swift and iOS just happen to be the tools I’m using. The real reason I’m doing this is to reconnect with that builder in me - the one I kind of lost along the way while chasing growth, deadlines, and whatever came next.

This feels like coming back to something I didn’t know I needed.

You’re Not Too Late Either

If you’re in your 30s (or beyond) and thinking about learning something new - especially something like iOS development - I just want to say this: you’re not too late.

I know it feels like everyone’s already ahead. I know it’s easy to scroll past yet another 22-year-old who “just shipped their fifth app or is doing $23k MRR as an indie dev” and feel like you missed the boat.

But you haven’t.

You’re bringing life experience, focus, and a different kind of drive to the table. You’re not doing this to show off. You’re doing this because you want to build. And that counts for something.

I’m not an expert. I’m still learning.

So if you’ve been sitting on the fence, unsure whether to dive in — maybe this is your sign.

You’re just getting started too, just like me.

The 2024 DORA Report: State of DevOps Breakdown Summary

Shivam Chhuneja — Wed, 06 Nov 2024 12:49:05 +0000

For the past 10 years, we have seen Accelerate State of DevOps: Report released annually, built upon the insights and data from thousands of industry respondents. The 2024 DORA Report was published recently.

As you can probably guess, the report is packed with data and insights that not only give a deep dive into software delivery and operations but also pack a lot of value for Engineering Managers, Engineering Leaders and Software Engineers.

So, we thought why not take our notes from this year's DORA Report, expand on them a little bit and share them with you in an article. This blog provides a summarized view of the report's findings, highlighting what matters most for team productivity, AI integration, and platform engineering.

Of course since this is a summary it cannot cover all the nuances that the DORA team have covered in their complete report, so while we do think this article is extremely valuable we still believe that just going through this blog will not be enough especially when it comes to context and methodology.

"Granny sighed. "You have learned something," she said, and thought it safe to insert a touch of sternness into her voice. "They say a little knowledge is a dangerous thing, but it is not one half so bad as a lot of ignorance."\
-- Terry Pratchett, Equal Rites

If you want exceptional visibility & actionable insights for your org's software delivery process, Middleware is your go to tool to get started within minutes & start a process transformation!

1. State of Software Delivery Performance

Performance benchmarks this year look as crazy as they do every year if we compare elite teams to low performing teams.

Chart from The 2024 DORA Report, Pg 14

Elite teams show unparalleled efficiency and recovery rates, which definitely say something about the value of mature DevOps practices.

127x faster lead time for changes
182x more frequent deployments
8x lower change failure rate
2293x faster failure recovery time

Elite teams set a high bar, showing how consistent, high-velocity performance can dramatically improve software delivery outcomes.

However, we loved what the report says about elite performance: "The best teams are those that achieve elite improvement, not necessarily elite performance"

This is something we have mentioned in our articles around DORA Metrics as well. 2 distinct teams, of distinct size, delivering distinct software, to distinct users should not be compared in absolute terms without context.

2. AI's Role in Software Delivery

AI adoption is picking up rapidly, yet concerns around trust in AI-generated code remain.

Chart from The 2024 DORA Report, Pg 20

This year's report dove into how teams are adapting to AI integration and its impact on productivity.

39% of developers report little or no trust in AI-generated code quality
81% say their organizations have shifted priorities to increase AI incorporation into applications
At an individual level 75.9% are relying on AI in things like writing code, summarizing info, documentation, writing tests etc.

Building trust and transparency in AI tools is essential as organizations increasingly adopt AI in development workflows. Even though AI is helping people do meaningful work with increased productivity the overall sentiment still remains somewhat of concern.

One participant even likened the need to evaluate and modify the outputs of AI-generated code to "the early days of StackOverflow, [when] you always thought people on StackOverflow are really experienced, you know, that they will know exactly what to do. And then, you just copy and paste the stuff, and things explode " (P2). --- 2024 DORA Report, Pg 24

3. Platform Engineering's Impact on Developer Experience

Platform engineering transforms developer workflows, enabling self-service and reducing friction.

Platforms behave in a similar way as transformation efforts, the early effects tend to be positive with a dip in the mid-term and recovery as the internal platform matures.

Chart from The 2024 DORA Report, Pg 50

8% increase in individual productivity with internal platforms
6% gain in productivity at the team level
And here we go: an 8% decrease in throughput & 14% decrease in change stability!

While platforms smoothen out the processes, they introduce new layers of complexity that can impact throughput and stability. Increased handoffs and dependencies may hinder speed.

Platform teams should balance automation with flexibility to prevent these operational slowdowns.

Overall, internal platforms show great promise in boosting productivity and team efficiency across development organizations however they are not a cure-all!

4. Developer Independence and Self-Service Workflows

Developer autonomy, a core principle of platform engineering, correlates strongly with productivity gains.

Self-service capabilities reduce dependencies on enabling teams and accelerate project timelines(owing to less handoffs and touchpoints within the process).

5% productivity gain at individual and team levels for developers without an "enabling team"

Interestingly, the impact on productivity of having a dedicated platform team was negligible for individuals. However, it resulted in a 6% gain in productivity at the team level. This finding is surprising because of its uneven impact, suggesting that having a dedicated platform team is useful to individuals, but the dedicated platform team is more impactful for teams overall.

Since teams have multiple developers with different responsibilities and skills, they naturally have a more diverse set of tasks when compared to an individual engineer. It is possible that having a dedicated platform engineering team allows the platform to be more supportive of the diversity in tasks represented by a team. --- 2024 DORA Report, Pg 52

5. The Role of Documentation in Developer Productivity

While Agile emphasizes "working software over documentation," DORA's findings highlight that quality documentation is essential for effective development.

Strong documentation isn't just about quantity but ensuring content is findable and reliable. A user-centered approach to documentation supports developer independence and enables smoother workflows.

Chart from The 2024 DORA Report, Pg 63

Focus on findability and reliability to keep documentation useful
Promote a sustainable documentation culture that maintains relevance
User-focused documentation amplifies technical capabilities and organizational impact

In short, well-maintained, user-centered documentation is foundational to productivity.

6. User-Centric Focus and Transformational Leadership

Focusing on the user in software development yields notable gains, with transformational leadership playing a significant role.

Leaders who empower teams and align projects with user needs increase productivity, satisfaction, and overall organizational performance.

User-centered development correlates with a 40% boost in organizational performance(2023 metric)
Strong transformational leadership(increasing transformational leadership by 25%) leads to a 9% productivity increase

Organizations combining user-centric practices with transformational leadership push team towards success and impactful, user-aligned products.

Let's Wrap This

The 2024 DORA Report highlights that robust DevOps practices, responsible AI integration, and thoughtful platform engineering are key to high performance.

For teams looking to achieve these benchmarks, Middleware offers a straightforward, quick way to measure and refine your DORA metrics, Project Flow Metrics, Bottleneck analysis out of the box to refine your software delivery practices.

Optimizing OpenStreetMap’s DevOps Efficiency: A Data-Driven DORA Metrics Analysis

Shivam Chhuneja — Mon, 23 Sep 2024 10:47:28 +0000

With the core goal to build a free, editable map of the world, OpenStreetMap's website repo is where all the magic happens.

This open-source project, powered by Ruby, runs on a great community of developers and cartographers across the planet.

We've been looking at some interesting repositories in our 100 days of Dora case studies series and have uncovered some interesting stuff already.

In this case study, we break down OpenStreetMap's DevOps game using DORA metrics, diving into real-world data to uncover how often code ships, how fast changes go live, and what's driving those numbers.

Understanding DORA Metrics

DORA metrics are the go-to for measuring how well software delivery and operations are performing within DevOps teams.

The four key DORA metrics are:

- Deployment Frequency: How often code makes it to production.

- Lead Time for Changes: How long it takes for a commit to land in production.

- Change Failure Rate: The percentage of deployments that break something and need an immediate fix.

- Mean Time to Restore (MTTR): How fast the team recovers from a production failure.

In this case study, we're zeroing in on Deployment Frequency and Lead Time for Changes---two metrics that directly reflect how fast and efficiently an organization delivers.

These numbers give a clear view of engineering speed and process slowdowns, which are crucial to keep improving in today's fast-paced development world.

If you want to dive a bit deeper into what Dora Metrics are and how you can leverage them for your team, you can check out one of our articles.

Key Findings

1. Deployment Frequency

The OpenStreetMap website repository pushed an average of 58 deployments per month over the last three months, signaling a robust culture of continuous delivery and rapid iteration.

Albeit these DF numbers aren't the best, especially considering the other God level repos that we saw, almost 60 average DFs are nothing to be shy about either.

Key Drivers of Deployment Frequency

Robust Automation Pipelines: The repo leans heavily on automation tools like docker.yml, lint.yml, and tests.yml to keep the build, test, and deployment processes running smoothly. With less manual effort and fewer human errors, they've cut down release times significantly.

Efficient Pull Request Handling: June 2024 saw an almost-instantaneous average merge time of just 10.08 seconds! Even with a slight increase in the following months, merge times remained under 6 hours---proof of an agile review process that keeps things moving fast.

Engaged Reviewer Community: Contributors like gravitystorm and kcne are on the ball when it comes to code reviews, keeping the process fast and thorough. Prompt code reviews facilitate quick integration of changes, maintain code quality, and maintain a collaborative development environment.

Pull requests such as #5056 and #5053 are great examples of this active engagement.

2. Lead Time for Changes

While the repository excels in deployment frequency, the Lead Time for Changes---averaging around 13.26 hours over the past three months---shows room for improvement. The first response time is really good as well compared to the averages.

Although a half-day lead time is impressive, especially for an open-source project with global contributors, shortening this could further boost development speed and efficiency.

Key Influencing Factors

First Response Time Variability: The time between a pull request's submission and the first reviewer's response fluctuated significantly.

July saw an average response time of 6.77 hours, compared to just 1.47 hours in June. That's a 4.6x increase. This multiple looks great on an investment portfolio but not much when it comes to the lead time of a repo ;)

These delays in initial feedback can slow down progress and it does compound, adding unnecessary time to the overall process.

Rework Time Fluctuations: Rework time---the period spent revising code after the initial review---dropped from 11.28 hours in June to 2.94 hours in August. It's bittersweet

While the improved rework time shows an improvement in code quality or review efficiency, rework still adds to the overall lead time and offers an opportunity for further optimization.

For example, PR #5016 ("Allow to edit revoked blocks") required significant rework due to its complexity, in turn extending its lead time.

While the repository maintains great lead times, placing more focus on reducing first response times and streamlining the rework process could drive even faster delivery cycles, enhancing both efficiency and development speed.

Diverse Contributions Pushing Growth

The OpenStreetMap website repository thrives on a diverse range of contributions, showcasing a vibrant and healthy open-source ecosystem.

Feature Development (50%): Innovation is at the forefront, with new features driving user engagement and functionality. For instance, PR #5056 added an "og:description" meta tag to diary entries, improving social media sharing and enhancing the user experience.

Bug Fixes (30%): Stability is key, with bug fixes ensuring the platform remains reliable. Notably, PR #5016 resolved a critical issue with user permissions by enabling editing of revoked blocks, improving system integrity.

Documentation (10%): Clear and accessible documentation is vital for the community. PR #5031 updated the "GPX Import email in text format," making it easier for new contributors to onboard and stay informed.

Testing and Quality Assurance (10%): Testing contributions are crucial for maintaining code quality. By focusing on tests, the project ensures that new changes don't introduce regressions, keeping the codebase robust and dependable

Positive Impact on the Project

The deployment frequency and streamlined workflows in the OpenStreetMap repository deliver substantial benefits to both the project and its community of contributors and users.

Accelerated Innovation: With rapid deployment cycles, new features and improvements are rolled out quickly, enhancing platform functionality and user satisfaction. This speed allows the project to stay responsive to evolving user needs and technological shifts.

Enhanced Contributor Experience: Swift integration of contributions motivates open-source developers by validating their work. The efficient review and merge processes foster a positive, collaborative environment that encourages ongoing participation from both new and experienced contributors.

Quality Assurance: Automated testing and continuous integration maintain stability, ensuring that the fast deployment pace doesn't compromise the platform's reliability. Issues are caught early in the process, reinforcing high-quality standards across releases.

Community Trust and Engagement: Regular updates build trust among users, reassuring them of the project's commitment to reliability and progress. This trust strengthens both the user base and contributor engagement, helping the project flourish.

These practices demonstrate how strong DevOps strategies can fuel innovation, improve community involvement, and elevate the overall success of an open-source project.

Strategic Recommendations for Enhanced Performance

To elevate the OpenStreetMap repository's DevOps practices from great to exceptional, here are some targeted actions:

Standardize and Expedite First Response Times

Implement SLA Policies: Set service level agreements (SLAs) for code reviews, such as committing to an initial response within 4 hours.

Automated Alert Systems: Leverage automation to notify reviewers when new pull requests (PRs) are submitted or pending beyond a certain period

Expand Reviewer Pool: Encourage more contributors to join code reviews by offering clear guidelines and training to reduce pressure on a small group.

Reduce Rework Through Enhanced Code Quality

Adopt Pre-Commit Hooks and Checks: Enforce coding standards with tools like linters or static code analyzers before PRs are submitted.

Code Review Guidelines: Create robust guidelines to set clear expectations for contributors, minimizing the need for back-and-forth revisions.

Peer Programming and Mentorship: Promote collaborative development by pairing experienced developers with newcomers, ensuring better initial code submissions.

Foster and Sustain Active Code Review Culture

Recognition Programs: Acknowledge top reviewers with leaderboards, badges, or shout-outs during community meetings to incentivize participation.

Contributor Onboarding: Streamline the process for new contributors to become reviewers, providing resources and tools to ease their transition.

Feedback Loops: Enable contributors to provide feedback on the review process, creating a culture of continuous improvement.

Leverage Analytics for Continuous Improvement

Monitor DORA Metrics Regularly: Track key metrics continuously to detect trends and pinpoint areas needing optimization.

Set Performance Targets: Establish clear goals for deployment frequency, lead time, and other metrics to align the team toward common objectives.

Share Insights with the Community: Promote transparency by sharing performance data and achievements, fostering collective ownership of the project.

By adopting these strategies, OpenStreetMap can fine-tune its DevOps performance, further reduce lead times, and streamline workflows, strengthening its status as a leading open-source initiative.

Let's Wrap This Up

The OpenStreetMap website repository is a shining example of how effective DevOps practices can drive success in an open-source environment. With its impressive deployment frequency and smooth workflows, the project consistently delivers value to a global user base while maintaining high standards of reliability.

The strong engagement from contributors and maintainers creates a thriving community that continuously pushes innovation forward.

That said, there's always room to optimize. Focusing on cutting down lead times by implementing standardized response procedures and improving code quality could make the repository's delivery pipeline even faster. By incorporating these strategies, the project can enhance performance, boost contributor satisfaction, and ensure long-term sustainability.

Is Next.js the Next Evolution or Just a Passing Trend? A Dora Metrics Case Study

Shivam Chhuneja — Sat, 21 Sep 2024 05:46:10 +0000

Next.js is a repo that is a relatively recent yet powerful JavaScript framework that's taking the modern web development scene by storm. Next.js strengthens React with its server-side rendering, static site generation, and SEO optimization.

Developers love its versatility, and the numbers speak for themselves. In a recent survey, 60% of developers said they preferred Next.js for building production-ready applications due to its ease of deployment and scalability. When comparing deployment times, Next.js consistently outperforms React in speed and efficiency, making it a go-to choice for developers aiming for a seamless build process.

This case study focuses on the OpenSource Next.js repository, particularly highlighting its exceptional deployment frequency.

As developers who've tinkered with Next.js for building dynamic e-commerce sites and quick-launching blogs, we couldn't help but wonder---how did they create and manage such a legendary repo with such efficiency? So, we dug deeper using Middleware Open-Source.

If you're excited to explore these insights further and connect with fellow engineering leaders, come join us in The Middle Out Community or subscribe to our newsletter for exclusive case studies and more!

But, first things first. Let's understand what Dora Metrics are.

What is Dora Metrics?

Dora Metrics are key indicators that show how efficiently a project progresses from start to final production in software delivery. Feel free to read through our detailed blog on Dora Metrics and how they can help your engineering processes!

Deployment Frequency: How often code hits production.
Lead Time for Changes: How long it takes for a commit to go live.
Mean Time to Restore (MTTR): Time taken to recover from failure.
Change Failure Rate: How many of those deployments break something.

Now that we're all on the same page, let's see how Next.js measures up.

Our Key Findings

Next.js: Setting the Bar High with Exceptional Deployment Frequency

The OpenSource Next.js repository stands out for its exceptional deployment frequency, far surpassing industry standards. Over the past three months, the repo has consistently pushed a high volume of deployments, reflecting the team's efficiency and automation prowess.

In June 2024, there were 247 deployments, followed by 261 in July, and an impressive 279 in August. These numbers highlight the repository's commitment to continuous integration and rapid feature releases, making it a benchmark for deployment frequency in the open-source community.

That's nearly one deployment every 3 hours! This high-frequency deployment isn't just for show; it keeps the project evolving at an incredible speed. How do they do it? A couple of key factors stand out:

Automated CI/CD Pipeline: With automation in place, deployments are quick and smooth---almost like magic.
Frequent, Small PRs: By breaking down changes into bite-sized pull requests, they can merge and deploy faster with less risk.
Fast Issue Resolution: Bugs get squashed quickly, and failed deployments are rare.

Cycle Time leaves room for Improvement

Despite a high deployment frequency, the repository shows notable fluctuations in Cycle Time, which impacts the overall Lead Time for Changes. Cycle time includes all PRs, while Lead Time only includes PRs with available deployment data. In June 2024, the average Cycle Time was around 3.5 days, slightly increasing to 3.6 days in July, and then spiking to 5.3 days by August.

Cycle time has been on the rise, increasing from 3.5 to 5.3 days, indicating some delays in the development process. Identifying bottlenecks, whether in code reviews, testing, or manual tasks, and implementing automated testing or improved sprint planning could help reduce this.

Other Key Metrics that may use some work

First Response Time

In June 2024, the average first response time was 1.6 days (38.4 hours), indicating quick and efficient handling of issues and pull requests in the Next.js repository. By July 2024, this time increased slightly to 1.8 days (43.2 hours), hinting at a minor delay, likely due to a growing number of issues or reduced team availability.

However, in August 2024, there was a significant jump to 2.6 days (62.4 hours), suggesting noticeable delays, possibly caused by higher workloads, bottlenecks in reviewer assignments, or fewer team members due to vacations.

To improve First Response Time, several strategies can be implemented to address the recent increase in delays. Automating notifications for issues and pull requests can ensure faster assignment to reviewers, reducing wait times.

Introducing SLAs (Service Level Agreements) for response times could create accountability and encourage quicker engagement. Moreover, reviewing team allocation and workload distribution could help balance responsibilities, especially during periods of high demand or reduced team availability, such as vacations.

Encouraging cross-functional collaboration could also enable quicker responses by spreading the load across the team more evenly. These changes could effectively bring response times back down and maintain efficiency.

Lead Time for Changes

From June to August 2024, the lead time for changes in the Next.js repository gradually increased from 3.3 days to 4.1 days. In June, the relatively quick 3.3-day turnaround reflected an efficient workflow with smooth coordination. By July, this increased slightly to 3.6 days, hinting at minor inefficiencies, possibly due to longer code reviews or more complex tasks.

However, by August, the lead time reached 4.1 days, indicating more significant delays likely caused by higher workloads, bottlenecks in reviews, or reduced team capacity due to vacations. This trend suggests a need to optimize processes and prevent further slowdowns.

Reasons?

Extended review times. For instance, PR #67498
Complex tasks that involve deep scrutiny and testing. PR #67391
Fluctuations in First Response Times (from 1.6 days to 2.6 days) cause uneven start times for reviews, delaying progress.

Nature of Work

The OpenSource Next.js repo includes a variety of activities, ranging from feature upgrades and bug squashing to improving documentation and refining tests. Here's a breakdown of some key insights:

Features and Improvements: Overhauls including performance optimizations (PR #67397), new functionalities (PR #67215).
Documentation: Significant contributions to documentation ensure clarity and easier adoption (PR #67056).
Bug Fixes: Addressing critical bug fixes, highlighted by 41.5 hours (~1.7 days) of rework time (PR #67022).
Performance Optimizations: Enhancing the speed and efficiency of the framework (PR #67065).

These updates position Next.js as a leader in web development. However, many of the areas needing improvement---cycle time, first response, and lead time---can benefit from increased automation in testing, code reviews, and notifications.

To further optimize, they can track bottlenecks in their pipeline, whether these bottlenecks occur during code review, testing, or deployment, and target those areas to boost overall performance. Also, fostering continuous feedback through more frequent stand-ups or retrospectives can help identify and resolve friction points. By focusing on these strategies, you can create a more efficient and streamlined development and deployment pipeline.

How Does This Affect the Next.js Community?

For internal contributors, the rapid deployment cycle is a dream. Features and fixes roll out quickly, meaning the team gets feedback in near real-time. However, the high lead time could make some contributors feel like they're stuck in a long queue, waiting for their work to go live.

For external contributors, understanding these bottlenecks can set clearer expectations. If you're contributing to Next.js, don't be surprised if your PR takes a bit of time to merge, even though deployments happen frequently.

Key Takeaways

Invest in Automation: A solid CI/CD pipeline can keep deployments flowing like water.
Encourage Small, Frequent PRs: Less complexity means quicker reviews and faster deployments.
Address Bottlenecks in Cycle Time: If lead time is lagging behind, dig into what's causing the delays---whether it's rework or review times.

DORA Score: 8/10

Next.js has an impressive deployment frequency that's soaring, but their lead time, cycle time, and first response times could use some attention. With a few strategic tweaks, they have the potential to be strong contenders against the top performers in Google's Annual Dora report,

Is Next.js Making Other Open-Source Projects Obsolete?

Next.js has set a new benchmark in deployment efficiency, making it the one to watch in the open-source world. With its unbeatable blend of speed, quality, and constant innovation, it leaves little room for competitors to catch up. Whether you're an active contributor or a curious onlooker, there's plenty to admire---and learn---from this trailblazing repo.

As Next.js continues to evolve, it's not just keeping up with the web's demands; it's shaping them. Get ready for more groundbreaking updates as it redefines open-source excellence!

If you're excited to explore these insights further and connect with fellow engineering leaders, come join us in The Middle Out Community. and subscribe to the newsletter for exclusive case studies and more!

Trivia

Next.js was created by Vercel (formerly Zeit) in 2016 and quickly gained popularity due to its powerful features for building server-side rendered React applications. It supports both static site generation (SSG) and server-side rendering (SSR), making it a flexible choice for developers. Major companies like Netflix, Twitch, and GitHub use Next.js to power their web apps, showcasing its reliability and scalability.

Further Resources

Is FreeCAD Nailing PRs or Just Blazing Through Merges?

Shivam Chhuneja — Fri, 20 Sep 2024 05:04:59 +0000

FreeCAD, the open-source 3D CAD modeler, has garnered quite a community with over 12,000 commits, 1,100 contributors, and more than 5,000 pull requests (PRs) to date. Known for its versatility in mechanical engineering and product design, FreeCAD is a go-to for makers and professionals alike.

But are they actually nailing their PR process, or just racing through merges? With an average PR merge time of 7.5 days, FreeCAD's workflow efficiency is up for debate. Let's break down the numbers and see what's really happening behind the scenes.

We are using Middleware Open Source to get the data and insights for these open source repos, so if you want to recreate or get data of your favorite God level repo you know where to go!

If you're curious to start a conversation with fellow engineering leaders, join The Middle Out Community! Don't forget to subscribe to our newsletter below for exclusive case studies!

Channeling Our Inner Sam Spade: Cracking the Case of FreeCAD's PR Efficiency

As we at Middleware dug into FreeCAD's Dora metrics, we couldn't resist going full "Sam Spade," curious about what made this repository tick. So, we dove in to explore the patterns, looking for what makes them tick!. What we found sheds light on some best practices that other repos can take notes from.

Background on Dora Metrics

Dora Metrics are a set of key performance indicators used to measure software development velocity and efficiency. They include:

Deployment Frequency
Lead Time for Changes
Mean Time to Restore (MTTR)
Change Failure Rate

These metrics help organizations understand their development processes and identify areas for improvement.

Key Findings

Exemplary Metric: Merge Time

One standout metric for FreeCAD is its Merge Time for PRs. Efficient merge times are a testament to streamlined code reviews and strong team coordination.

June 2024 -- Average Merge Time: 15.66 Hours

In June, FreeCAD's PR management was running relatively smoothly, with an average merge time of 15.66 hours.

July 2024 -- Average Merge Time: 20.4 Hours

July showed a bit of a slowdown, with an average merge time stretching to 20.4 hours. This change could be attributed to more complex PRs

August 2024 -- Average Merge Time: 8.04 Hours

In August, things really picked up pace, with the average merge time dropping significantly to just 8.04 hours. This lightning-fast rate suggests that the team and community were hyper-efficient in handling pull requests.

Average Merge Time: Approximately 11.02 hours.

Rapid Merges: PRs like #14693, #14691, and #14688 were merged within less than a day.

Factors Contributing to Efficiency

Ownership: Consistent authors like PaddleStroke, Roy-043, and wwmayer contribute to swift merges.
Structured PR Review: Automated checks via GH Actions, including workflows like CI_master.yml and sub_lint.yml, catch issues early.

Also read: Is freeCodeCamp Sacrificing Quality for Speed with Their Rapid Deployments?

Metric that needs Improvement: Cycle Time

While freeCAD has made commendable strides in reducing Cycle Time, with figures decreasing from 171.12 hours to 130.8 hours and then to 121.68 hours, there's still room for improvement.

Continuing this trend could further enhance their efficiency. Addressing bottlenecks and optimizing workflows could help freeCAD push these numbers even lower, achieving even faster turnaround times and reinforcing their reputation for swift, effective development.

Average Cycle Time: Approximately 21.42 hours.
Fluctuation: Inconsistencies arise, as seen in longer cycle times for complex PRs like #14674.

Factors Contributing to Delays

Complex PRs: More intricate PRs understandably take longer.
Reviewer Availability: Delays due to the availability and assignment of reviewers.

Other Key Metrics

Lead Time

The reduction in Lead Time from 2.51 days in June to 1.81 days in August is a clear indicator of enhanced efficiency in moving code through the development pipeline. This is a commendable achievement, suggesting that freeCAD has improved its processes for deploying code, leading to faster delivery of new features and fixes. This efficiency boost likely contributes positively to the overall project velocity and responsiveness.

First Commit to Open

The increase in the time taken from First Commit to Opening a Pull Request, from 3.77 hours in June to 12.01 hours in August, is a bit concerning. This rise might indicate bottlenecks or delays in the initial stages of the contribution process. Addressing this could involve streamlining the process for handling code submissions, possibly by improving workflows or increasing the availability of reviewers to handle initial requests more promptly.

First Response Time

Although there's a slight improvement from 11.42 hours in June to 10.15 hours in August, there's still potential for faster response times. This metric is crucial for maintaining an active and engaged contributor base, as prompt responses can enhance collaboration and prevent contributors from losing interest or moving on to other projects. Focusing on reducing this time further could improve contributor satisfaction and keep the development process moving smoothly.

Nature of Work: Feature-Focused Development

FreeCAD is rich in diverse contributions, spanning feature development, bug fixes, and optimization work.

Feature Development

Major focus, e.g., Assembly: TNP by PaddleStroke.

Bug Fixes

Substantial effort in maintaining and fixing issues, e.g., Fem: Fix height of reference list widget in spring constraint task panel by marioalexis84.

Optimization and Refactor

Significant, seen in PRs like Guard all uses of basic_random_generator for thread safety by bgbsww.

Each of these contributions highlights the diverse nature of work being done on FreeCAD---from feature development that solves long-standing issues to bug fixes that refine the user experience, to optimizations that ensure the software can handle future demands.

Impact on Project and Community

Efficient PR management leads to better code quality, more frequent releases, and higher contributor satisfaction. FreeCAD's emphasis on swift PR reviews encourages further contributions.

Takeaways

Structured Workflows: Implement CI/CD pipelines to automate checks, as FreeCAD's CI_master.yml ensures consistency.
Active PR Review Culture: Prioritize swift reviews to avoid PR pile-ups.
Clear Ownership Assignments: Assign code owners to specific modules for better PR management.
Balanced Work Distribution: Encourage a mix of new features, bug fixes, and refactoring for a well-maintained codebase.

DORA Score: 8/10

After taking a closer look at FreeCAD with our Dora Metrics toolkit, here's the rundown: FreeCAD scores a respectable 8/10. Its PR merge times are impressively quick, thanks to a committed community that's always in motion. But---and there's always a "but"---Cycle Time is where FreeCAD hits a few speed bumps, with complex PRs dragging down the average.

The takeaway? FreeCAD's team is doing a solid job, but a little change here and there could take them from great to exceptional. Our analysis, stacked against Google's Dora benchmarks, highlights FreeCAD's wins and its opportunities for improvement. Want to see how your project compares or optimize for efficiency? Analyze it using Middleware OSS.

Conclusion - FreeCAD: A PR Powerhouse or Just Polished?

FreeCAD's got some slick moves when it comes to handling PRs---those rapid merge times are nothing short of impressive. But like a dance partner with two left feet, its Cycle Time could use a little more rhythm. With just a bit of fine-tuning, FreeCAD could be on its way to even greater efficiency.

If you're curious to pick the brains of fellow engineering leaders, join The Middle Out Community!

Trivia

FreeCAD was started in 2002 by Jürgen Riegel, Werner Mayer, and Yorik van Havre, making it one of the oldest free and open-source 3D CAD modeling tools

DEV Community: Shivam Chhuneja

ARIMA vs. SARIMA: A Practical Guide to Choosing the Right Time Series Model

First, Let's Look at ARIMA

A good use case for ARIMA?

What Happens When There is Seasonality in Our Time Series Data?

A Quick Example for SARIMA

Real-World Scenarios for SARIMA vs ARIMA

Okay, Let's Look at the Code

When to Use Which?

Bonus Tip: Let the Model Do the Work

Let's Wrap This Up

Full Code Walkthrough - Reducing Churn in E-Commerce with Predictive Modelling

🧱 1. The Barebones Setup: Getting Started with Libraries

📥 2. Loading and Peeking at the Data: The First Handshake

📏 3. How Big Is This? Getting the Lay of the Land

🔎 4. What's Missing? And What's Weird?: The Data Audit

🧹 5. Light Data Cleaning Before Modelling

📊 6. Let's Talk Class Imbalance

📈 7. Exploratory Data Analysis But Beyond Just Plotting

📲 8. Devices Matter When it Comes to User Behaviour

🧪 9. KNN Imputer for Missing Values

⚖️ 10. Handling Class Imbalance with SMOTE (and When Not To)

🌲 11. Model Training: Building the Predictor

🔥 12. What Actually Drives Churn?

📉 13. Evaluating Machine Learning Models the Right Way

Key Lessons From This Machine Learning Adventure

A Primer to Framing Business Problems for Machine Learning

Why "Framing" is 90% of the Machine Learning Engineering Battle

The Three Core Lenses: Classification, Regression, and Ranking

Classification: Which Category Does This Belong To?

Regression: How Much or How Many?

Ranking: What's Most Relevant or Important?

Other Important Machine Learning Problem Frames

Your Practical 5-Step Machine Learning Problem Framing Checklist

Framing in Machine Learning is Where the Real Value is Created#

Why ARIMA and SARIMA Still Matter: A Technical Guide to Time Series Forecasting

Deep Learning Gets the Spotlight, But Time Series Still Solves Real Problems

The Case for ARIMA and SARIMA

Core Concepts Behind ARIMA and SARIMA

Stationarity

Autoregression (AR)

Moving Average (MA)

Integration (I)

Seasonality in SARIMA

Hands-on Example: Forecasting Rosé Wine Sales with SARIMA

Step 1: Load and Visualize the Data

Step 2: Selecting SARIMA Parameters with Auto-ARIMA

Step 3: Forecasting with Confidence Intervals

Common Issues and Tips When Using ARIMA/SARIMA

Time Series Deserves a Comeback in the ML Toolkit

Want to Dive Deeper?

22 Lessons from 1 year in Data Science and Machine Learning

1. Data Cleaning: It Really Is 80% of the Work (And Your Most Important Teacher)

Why This Tedious Data Cleaning Process is Invaluable

Practical Approaches and Key Takeaways for Data Handling

2. Feature Engineering Matters More Than Fancy Models (Often, Anyway)

The Art and Science of Feature Creation

3. Imbalanced Data: Uncomfortable & Frustrating But Its The Reality We Must Face

4. Always Work With Real-World Data (Or As Soon As You Can)

5. Model Evaluation is So Much More Than Accuracy

6. EDA Isn't Just a Checkbox

7. Version Control Isn't Just for Software Engineers, It's For Data Scientists Too

8. One Deep Project Beats Ten Shallow Ones

9. Your "Best" Model Might Not Be the "Smartest" One

10. Just Because It's in the Data Doesn't Mean It Should Be Used

11. Machine Learning Projects Are Communication Projects in Disguise

12. A Good Model That No One Understands is a Bad Model (Usually)

13. MLOps is Real, and It Will Come for You (Eventually, and Sooner Than You Think)

14. Churn (or Any Target Variable) Isn't Just a Number. It's a Story

15. You Will Feel Like an Imposter. And You Are Not Alone

16. Learning is Bigger Than Specific Tools

17. It's Okay to Go Slow on the Math. But Please, Don't Skip It

18. The Best Learning Comes From Projects, Not Just Courses

19. Just-in-Time Learning Usually Beats Just-in-Case Learning

20. The AI Hype is Real. So is the Grind of Fundamentals

21. Stop Comparing Your Timeline to Someone Else's

22. Build in Public (A Lesson I'm Learning Now)

Reducing Churn in E-Commerce: My End-to-End Capstone Project in Predictive Modeling

🧩 The Problem: Why Churn Matters So Much

🧪 Dataset Overview & Initial Processing