đ” Linear Regression for Absolute Beginners
With Code Examples, TeaâStall Stories
Machine learning terms like cost function, gradient descent, regularization, and overfitting can feel abstract. Weâll make them concrete with a tea stall story and then bring them to life with working Python code you can run.
Whatâs ahead
- Linear Regression: predict tea sales from temperature
- Cost Function (MSE): measure how wrong predictions are
- Gradient Descent: improve stepâbyâstep
- Overfitting: when a model memorizes noise
- Regularization (Ridge/Lasso): keep models simple and robust
- Visualizing how penalties shrink coefficients
- Handy forecast function
đ§Ș Setup (Run These First)
Explanation: We import the tools weâll use. numpy and pandas handle numbers/data; matplotlib draws charts; sklearn gives us readyâmade ML models and utilities.
# Install if needed: pip install numpy pandas scikit-learn matplotlib
# Core libraries
import numpy as np # numerical computations
import pandas as pd # data tables (light use here)
import matplotlib.pyplot as plt # plotting
# Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso
# Utilities
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Make randomness repeatable (so results are the same each run)
np.random.seed(42)
â Scenario 1: Linear Regression (Tea Sales vs. Temperature)
Story: As temperature rises, fewer people want hot tea. Weâll fit a straight line to predict tea cups sold from °C.
Key idea: Linear regression finds the line y = m·x + c that best matches the data (best = smallest average error).
# Synthetic dataset: temperature (°C) â tea cups sold
temps = np.array([10, 12, 15, 18, 20, 22, 24, 26, 28]).reshape(-1, 1)
# reshape(-1,1) turns a 1D list into a column vector for sklearn
tea_sales = np.array([100, 95, 85, 70, 60, 55, 50, 45, 40])
# Create and train the model
lin = LinearRegression()
lin.fit(temps, tea_sales)
print("Slope (m):", lin.coef_[0]) # change in cups for +1°C
print("Intercept (c):", lin.intercept_) # cups at 0°C
# Predict tea sales for 21°C
tomorrow_temp = np.array([[21]])
pred_sales = lin.predict(tomorrow_temp)
print("Predicted tea cups at 21°C:", int(pred_sales[0]))
# Plot data and the fitted line
plt.scatter(temps, tea_sales, color="teal", label="Actual")
plt.plot(temps, lin.predict(temps), color="orange", label="Fitted line")
plt.xlabel("Temperature (°C)")
plt.ylabel("Tea cups sold")
plt.title("Linear Regression: Tea Sales vs. Temperature")
plt.legend()
plt.show()
â Scenario 2: Cost Function (How Wrong Are We?)
# MSE: Mean Squared Error = average of squared differences
def mse(y_true, y_pred):
return np.mean((y_true - y_pred)**2)
# Evaluate our fitted line on the training data
y_pred = lin.predict(temps)
print("Mean Squared Error (MSE):", mse(tea_sales, y_pred))
# Lower MSE = better fit
â Scenario 3: Gradient Descent (Learning StepâbyâStep)
Explanation: Gradient descent adjusts parameters (m and c) little by little to reduce MSE. Imagine tasting a tea recipe and tweaking sugar and milk in the direction that improves tasteârepeat until itâs good enough.
# Manual gradient descent for y = m*x + c
X = temps.flatten()
y = tea_sales.astype(float)
m, c = 0.0, 0.0 # start with guesses
lr = 0.0005 # learning rate: step size
epochs = 5000 # number of update steps
def predictions(m, c, X):
return m*X + c
def gradients(m, c, X, y):
y_hat = predictions(m, c, X)
# Partial derivatives of MSE w.r.t m and c
dm = (-2/len(X)) * np.sum(X * (y - y_hat))
dc = (-2/len(X)) * np.sum(y - y_hat)
return dm, dc
history = [] # track the cost after each update
for _ in range(epochs):
dm, dc = gradients(m, c, X, y)
m -= lr * dm # move m opposite the gradient
c -= lr * dc # move c opposite the gradient
history.append(mse(y, predictions(m, c, X)))
print(f"GD learned slope m={m:.3f}, intercept c={c:.3f}, final MSE={history[-1]:.2f}")
# Visualize learning: MSE should go down over epochs
plt.plot(history)
plt.xlabel("Epoch")
plt.ylabel("MSE (Cost)")
plt.title("Gradient Descent: Cost vs. Epochs")
plt.show()
Tip: If
lris too big, the loss will bounce or explode. If itâs too small, learning is very slow.
â Scenario 4: Overfitting (When a Model Memorizes Noise)
Explanation: Overfitting happens when the model learns not just the real pattern but also the random noiseâso it looks great on training data but fails on new data. Weâll build a dataset with both useful features (temperature, rain, festival) and noisy ones (traffic, dog_barks) to see this.
n = 300
# Features: some useful, some noisy
temp = np.random.uniform(5, 35, size=n) # useful
rain = np.random.binomial(1, 0.3, size=n) # somewhat useful
festival = np.random.binomial(1, 0.1, size=n) # occasionally useful
traffic = np.random.normal(0, 1, size=n) # mostly noise
dog_barks = np.random.normal(0, 1, size=n) # pure noise
# True relationship (what the world actually does)
true_sales = (120 - 2.5*temp + 10*rain + 15*festival
+ np.random.normal(0, 3, size=n)) # irreducible noise
# Feature matrix
X = np.column_stack([temp, rain, festival, traffic, dog_barks])
feature_names = ["temp", "rain", "festival", "traffic", "dog_barks"]
# Split to detect overfitting (train vs test)
X_train, X_test, y_train, y_test = train_test_split(
X, true_sales, test_size=0.25, random_state=42
)
# Plain Linear Regression (may overfit noisy features)
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
print("Linear Regression Coefficients:")
for name, coef in zip(feature_names, lr_model.coef_):
print(f" {name:<10} -> {coef: .3f}")
print("Train MSE:", mean_squared_error(y_train, lr_model.predict(X_train)))
print("Test MSE:", mean_squared_error(y_test, lr_model.predict(X_test)))
What to look for:
- Big coefficients on obviously noisy features (e.g.,
dog_barks) - Train MSE âȘ Test MSE â the model memorized training quirks
â Scenario 5: Fixing Overfitting (Regularization Is the Hero)
Explanation: To fight overfitting, you can:
- Remove useless features (domain knowledge)
- Gather more data (less variance)
- Add regularization (systematic, works even when noise isnât obvious)
Regularization (the concept): Add a penalty to the loss for large coefficients. This discourages complex models that chase noise.
â Scenario 6: Regularization (Penalty for Complexity)
Tea analogy: Tell your teaâmaker: âUse too many ingredients and you lose points.â The model then prefers simpler recipes that generalize better.
- Ridge (L2): Penalizes the square of weights â smoothly shrinks them toward zero
- Lasso (L1): Penalizes the absolute value â can push some weights exactly to zero, performing feature selection
â Scenario 7: Regularized Linear Regression (Ridge & Lasso)
Explanation: Weâll fit both Ridge and Lasso to see how penalties change coefficients and test performance. (For real projects, consider scaling features and using crossâvalidation to choose alpha.)
# Ridge (L2): Shrinks coefficients smoothly
ridge = Ridge(alpha=10.0) # alpha = strength of penalty (λ)
ridge.fit(X_train, y_train)
print("\nRidge Coefficients (alpha=10):")
for name, coef in zip(feature_names, ridge.coef_):
print(f" {name:<10} -> {coef: .3f}")
print("Ridge Train MSE:", mean_squared_error(y_train, ridge.predict(X_train)))
print("Ridge Test MSE:", mean_squared_error(y_test, ridge.predict(X_test)))
# Lasso (L1): Can set some coefficients exactly to zero
lasso = Lasso(alpha=1.0) # try 0.1, 0.5, 2.0 and compare
lasso.fit(X_train, y_train)
print("\nLasso Coefficients (alpha=1.0):")
for name, coef in zip(feature_names, lasso.coef_):
print(f" {name:<10} -> {coef: .3f}")
print("Lasso Train MSE:", mean_squared_error(y_train, lasso.predict(X_train)))
print("Lasso Test MSE:", mean_squared_error(y_test, lasso.predict(X_test)))
Interpretation:
- Ridge should shrink noisy coefficients closer to zero
- Lasso may set truly useless features to zero
- Test MSE often improves vs. plain linear regression
â Scenario 8: How Regularization Fixes Overfitting (Deep Dive)
Explanation: Weâll vary alpha and visualize how coefficients shrink and how train vs test MSE behave. Look for the alpha that minimizes test MSEâthatâs your sweet spot.
alphas = [0.0, 0.1, 1.0, 10.0, 50.0] # 0.0 = no regularization baseline
coef_paths_ridge = []
train_mse_ridge, test_mse_ridge = [], []
for a in alphas:
model = LinearRegression() if a == 0.0 else Ridge(alpha=a)
model.fit(X_train, y_train)
coef_paths_ridge.append(model.coef_)
train_mse_ridge.append(mean_squared_error(y_train, model.predict(X_train)))
test_mse_ridge.append(mean_squared_error(y_test, model.predict(X_test)))
coef_paths_ridge = np.array(coef_paths_ridge)
# Coefficient shrinkage plot
plt.figure(figsize=(8, 5))
for i, name in enumerate(feature_names):
plt.plot(alphas, coef_paths_ridge[:, i], marker="o", label=name)
plt.xscale("log")
plt.xlabel("alpha (log scale)")
plt.ylabel("Coefficient value")
plt.title("Ridge: Coefficient Shrinkage with Increasing Penalty")
plt.legend()
plt.show()
# Train vs Test MSE plot (watch for over/underfitting)
plt.figure(figsize=(8, 5))
plt.plot(alphas, train_mse_ridge, marker="o", label="Train MSE")
plt.plot(alphas, test_mse_ridge, marker="o", label="Test MSE")
plt.xscale("log")
plt.xlabel("alpha (log scale)")
plt.ylabel("MSE")
plt.title("Ridge: Train vs Test MSE Across Penalties")
plt.legend()
plt.show()
Reading the charts:
- Low alpha â big coefficients, risk of overfitting (low train MSE, higher test MSE)
- Moderate alpha â coefficients shrink, generalization improves
- Tooâhigh alpha â model too simple (underfitting), both MSEs rise
đ§ Bonus: Simple Tea Forecast Function
Explanation: Once youâve trained a good model (e.g., ridge), you can wrap it in a small function to quickly forecast cups for a given day.
def forecast_tea_cups(temp_c, rain=0, festival=0, model=ridge):
"""
Predict tea cups for given conditions.
We ignore traffic/dog_barks at prediction time since they were noise.
"""
x = np.array([[temp_c, rain, festival, 0.0, 0.0]])
return float(model.predict(x)[0])
print("Forecast for 18°C, raining, festival day:",
round(forecast_tea_cups(18, rain=1, festival=1)))
print("Forecast for 30°C, no rain, normal day:",
round(forecast_tea_cups(30, rain=0, festival=0)))
â Final Takeaways
- Linear Regression draws the best straight line between features and target.
- Cost Function (MSE) penalizes prediction errors, especially big ones.
- Gradient Descent iteratively improves parameters to minimize loss.
- Overfitting = learning noise; great on training, poor on new data.
- Regularization (Ridge/Lasso) shrinks weights, removes noise, and improves generalization
- Choose α (lambda) carefully: too small â overfit; too large â underfit.
đŻ Practical Tips for Beginners
- Scale features (e.g.,
StandardScaler) before Lasso/Ridge soalphabehaves consistently. - Use train/test split (and crossâvalidation) to choose
alphathat minimizes test error. - Start with Ridge for stability; try Lasso when you suspect some features are useless.
- Plot residuals (actual â predicted); random scatter = good, patterns = model misâspecification.
Top comments (0)