DEV Community

Cover image for Predicting Tea Sales With ML: Linear Regression, Gradient Descent & Regularization (Beginner Friendly + Code)
likhitha manikonda
likhitha manikonda

Posted on

Predicting Tea Sales With ML: Linear Regression, Gradient Descent & Regularization (Beginner Friendly + Code)

đŸ” Linear Regression for Absolute Beginners

With Code Examples, Tea‑Stall Stories

Machine learning terms like cost function, gradient descent, regularization, and overfitting can feel abstract. We’ll make them concrete with a tea stall story and then bring them to life with working Python code you can run.

What’s ahead

  • Linear Regression: predict tea sales from temperature
  • Cost Function (MSE): measure how wrong predictions are
  • Gradient Descent: improve step‑by‑step
  • Overfitting: when a model memorizes noise
  • Regularization (Ridge/Lasso): keep models simple and robust
  • Visualizing how penalties shrink coefficients
  • Handy forecast function

đŸ§Ș Setup (Run These First)

Explanation: We import the tools we’ll use. numpy and pandas handle numbers/data; matplotlib draws charts; sklearn gives us ready‑made ML models and utilities.

# Install if needed: pip install numpy pandas scikit-learn matplotlib

# Core libraries
import numpy as np                 # numerical computations
import pandas as pd               # data tables (light use here)
import matplotlib.pyplot as plt   # plotting

# Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# Utilities
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Make randomness repeatable (so results are the same each run)
np.random.seed(42)
Enter fullscreen mode Exit fullscreen mode

⭐ Scenario 1: Linear Regression (Tea Sales vs. Temperature)

Story: As temperature rises, fewer people want hot tea. We’ll fit a straight line to predict tea cups sold from °C.

Key idea: Linear regression finds the line y = m·x + c that best matches the data (best = smallest average error).

# Synthetic dataset: temperature (°C) → tea cups sold
temps = np.array([10, 12, 15, 18, 20, 22, 24, 26, 28]).reshape(-1, 1)
# reshape(-1,1) turns a 1D list into a column vector for sklearn

tea_sales = np.array([100, 95, 85, 70, 60, 55, 50, 45, 40])

# Create and train the model
lin = LinearRegression()
lin.fit(temps, tea_sales)

print("Slope (m):", lin.coef_[0])       # change in cups for +1°C
print("Intercept (c):", lin.intercept_) # cups at 0°C

# Predict tea sales for 21°C
tomorrow_temp = np.array([[21]])
pred_sales = lin.predict(tomorrow_temp)
print("Predicted tea cups at 21°C:", int(pred_sales[0]))

# Plot data and the fitted line
plt.scatter(temps, tea_sales, color="teal", label="Actual")
plt.plot(temps, lin.predict(temps), color="orange", label="Fitted line")
plt.xlabel("Temperature (°C)")
plt.ylabel("Tea cups sold")
plt.title("Linear Regression: Tea Sales vs. Temperature")
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

⭐ Scenario 2: Cost Function (How Wrong Are We?)

# MSE: Mean Squared Error = average of squared differences
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

# Evaluate our fitted line on the training data
y_pred = lin.predict(temps)
print("Mean Squared Error (MSE):", mse(tea_sales, y_pred))
# Lower MSE = better fit
Enter fullscreen mode Exit fullscreen mode

⭐ Scenario 3: Gradient Descent (Learning Step‑by‑Step)

Explanation: Gradient descent adjusts parameters (m and c) little by little to reduce MSE. Imagine tasting a tea recipe and tweaking sugar and milk in the direction that improves taste—repeat until it’s good enough.

# Manual gradient descent for y = m*x + c

X = temps.flatten()
y = tea_sales.astype(float)

m, c = 0.0, 0.0             # start with guesses
lr = 0.0005                 # learning rate: step size
epochs = 5000               # number of update steps

def predictions(m, c, X): 
    return m*X + c

def gradients(m, c, X, y):
    y_hat = predictions(m, c, X)
    # Partial derivatives of MSE w.r.t m and c
    dm = (-2/len(X)) * np.sum(X * (y - y_hat))
    dc = (-2/len(X)) * np.sum(y - y_hat)
    return dm, dc

history = []  # track the cost after each update

for _ in range(epochs):
    dm, dc = gradients(m, c, X, y)
    m -= lr * dm            # move m opposite the gradient
    c -= lr * dc            # move c opposite the gradient
    history.append(mse(y, predictions(m, c, X)))

print(f"GD learned slope m={m:.3f}, intercept c={c:.3f}, final MSE={history[-1]:.2f}")

# Visualize learning: MSE should go down over epochs
plt.plot(history)
plt.xlabel("Epoch")
plt.ylabel("MSE (Cost)")
plt.title("Gradient Descent: Cost vs. Epochs")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Tip: If lr is too big, the loss will bounce or explode. If it’s too small, learning is very slow.


⭐ Scenario 4: Overfitting (When a Model Memorizes Noise)

Explanation: Overfitting happens when the model learns not just the real pattern but also the random noise—so it looks great on training data but fails on new data. We’ll build a dataset with both useful features (temperature, rain, festival) and noisy ones (traffic, dog_barks) to see this.

n = 300

# Features: some useful, some noisy
temp = np.random.uniform(5, 35, size=n)                    # useful
rain = np.random.binomial(1, 0.3, size=n)                  # somewhat useful
festival = np.random.binomial(1, 0.1, size=n)              # occasionally useful
traffic = np.random.normal(0, 1, size=n)                   # mostly noise
dog_barks = np.random.normal(0, 1, size=n)                 # pure noise

# True relationship (what the world actually does)
true_sales = (120 - 2.5*temp + 10*rain + 15*festival 
              + np.random.normal(0, 3, size=n))            # irreducible noise

# Feature matrix
X = np.column_stack([temp, rain, festival, traffic, dog_barks])
feature_names = ["temp", "rain", "festival", "traffic", "dog_barks"]

# Split to detect overfitting (train vs test)
X_train, X_test, y_train, y_test = train_test_split(
    X, true_sales, test_size=0.25, random_state=42
)

# Plain Linear Regression (may overfit noisy features)
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

print("Linear Regression Coefficients:")
for name, coef in zip(feature_names, lr_model.coef_):
    print(f"  {name:<10} -> {coef: .3f}")

print("Train MSE:", mean_squared_error(y_train, lr_model.predict(X_train)))
print("Test  MSE:", mean_squared_error(y_test,  lr_model.predict(X_test)))
Enter fullscreen mode Exit fullscreen mode

What to look for:

  • Big coefficients on obviously noisy features (e.g., dog_barks)
  • Train MSE â‰Ș Test MSE → the model memorized training quirks

⭐ Scenario 5: Fixing Overfitting (Regularization Is the Hero)

Explanation: To fight overfitting, you can:

  • Remove useless features (domain knowledge)
  • Gather more data (less variance)
  • Add regularization (systematic, works even when noise isn’t obvious)

Regularization (the concept): Add a penalty to the loss for large coefficients. This discourages complex models that chase noise.


⭐ Scenario 6: Regularization (Penalty for Complexity)

Tea analogy: Tell your tea‑maker: “Use too many ingredients and you lose points.” The model then prefers simpler recipes that generalize better.

  • Ridge (L2): Penalizes the square of weights → smoothly shrinks them toward zero
  • Lasso (L1): Penalizes the absolute value → can push some weights exactly to zero, performing feature selection

⭐ Scenario 7: Regularized Linear Regression (Ridge & Lasso)

Explanation: We’ll fit both Ridge and Lasso to see how penalties change coefficients and test performance. (For real projects, consider scaling features and using cross‑validation to choose alpha.)

# Ridge (L2): Shrinks coefficients smoothly
ridge = Ridge(alpha=10.0)    # alpha = strength of penalty (λ)
ridge.fit(X_train, y_train)

print("\nRidge Coefficients (alpha=10):")
for name, coef in zip(feature_names, ridge.coef_):
    print(f"  {name:<10} -> {coef: .3f}")

print("Ridge Train MSE:", mean_squared_error(y_train, ridge.predict(X_train)))
print("Ridge Test  MSE:", mean_squared_error(y_test,  ridge.predict(X_test)))

# Lasso (L1): Can set some coefficients exactly to zero
lasso = Lasso(alpha=1.0)     # try 0.1, 0.5, 2.0 and compare
lasso.fit(X_train, y_train)

print("\nLasso Coefficients (alpha=1.0):")
for name, coef in zip(feature_names, lasso.coef_):
    print(f"  {name:<10} -> {coef: .3f}")

print("Lasso Train MSE:", mean_squared_error(y_train, lasso.predict(X_train)))
print("Lasso Test  MSE:", mean_squared_error(y_test,  lasso.predict(X_test)))
Enter fullscreen mode Exit fullscreen mode

Interpretation:

  • Ridge should shrink noisy coefficients closer to zero
  • Lasso may set truly useless features to zero
  • Test MSE often improves vs. plain linear regression

⭐ Scenario 8: How Regularization Fixes Overfitting (Deep Dive)

Explanation: We’ll vary alpha and visualize how coefficients shrink and how train vs test MSE behave. Look for the alpha that minimizes test MSE—that’s your sweet spot.

alphas = [0.0, 0.1, 1.0, 10.0, 50.0]  # 0.0 = no regularization baseline
coef_paths_ridge = []
train_mse_ridge, test_mse_ridge = [], []

for a in alphas:
    model = LinearRegression() if a == 0.0 else Ridge(alpha=a)
    model.fit(X_train, y_train)
    coef_paths_ridge.append(model.coef_)
    train_mse_ridge.append(mean_squared_error(y_train, model.predict(X_train)))
    test_mse_ridge.append(mean_squared_error(y_test,  model.predict(X_test)))

coef_paths_ridge = np.array(coef_paths_ridge)

# Coefficient shrinkage plot
plt.figure(figsize=(8, 5))
for i, name in enumerate(feature_names):
    plt.plot(alphas, coef_paths_ridge[:, i], marker="o", label=name)
plt.xscale("log")
plt.xlabel("alpha (log scale)")
plt.ylabel("Coefficient value")
plt.title("Ridge: Coefficient Shrinkage with Increasing Penalty")
plt.legend()
plt.show()

# Train vs Test MSE plot (watch for over/underfitting)
plt.figure(figsize=(8, 5))
plt.plot(alphas, train_mse_ridge, marker="o", label="Train MSE")
plt.plot(alphas, test_mse_ridge, marker="o", label="Test MSE")
plt.xscale("log")
plt.xlabel("alpha (log scale)")
plt.ylabel("MSE")
plt.title("Ridge: Train vs Test MSE Across Penalties")
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Reading the charts:

  • Low alpha → big coefficients, risk of overfitting (low train MSE, higher test MSE)
  • Moderate alpha → coefficients shrink, generalization improves
  • Too‑high alpha → model too simple (underfitting), both MSEs rise

🧠 Bonus: Simple Tea Forecast Function

Explanation: Once you’ve trained a good model (e.g., ridge), you can wrap it in a small function to quickly forecast cups for a given day.

def forecast_tea_cups(temp_c, rain=0, festival=0, model=ridge):
    """
    Predict tea cups for given conditions.
    We ignore traffic/dog_barks at prediction time since they were noise.
    """
    x = np.array([[temp_c, rain, festival, 0.0, 0.0]])
    return float(model.predict(x)[0])

print("Forecast for 18°C, raining, festival day:",
      round(forecast_tea_cups(18, rain=1, festival=1)))

print("Forecast for 30°C, no rain, normal day:",
      round(forecast_tea_cups(30, rain=0, festival=0)))
Enter fullscreen mode Exit fullscreen mode

✅ Final Takeaways

  • Linear Regression draws the best straight line between features and target.
  • Cost Function (MSE) penalizes prediction errors, especially big ones.
  • Gradient Descent iteratively improves parameters to minimize loss.
  • Overfitting = learning noise; great on training, poor on new data.
  • Regularization (Ridge/Lasso) shrinks weights, removes noise, and improves generalization
  • Choose α (lambda) carefully: too small → overfit; too large → underfit.

🎯 Practical Tips for Beginners

  • Scale features (e.g., StandardScaler) before Lasso/Ridge so alpha behaves consistently.
  • Use train/test split (and cross‑validation) to choose alpha that minimizes test error.
  • Start with Ridge for stability; try Lasso when you suspect some features are useless.
  • Plot residuals (actual − predicted); random scatter = good, patterns = model mis‑specification.

Top comments (0)