DEV Community

Ertugrul
Ertugrul

Posted on • Edited on

5 Regression Projects in Python (with Full Code)

5 Regression Projects in Python (with Full Code)

Linear regression is one of the foundational algorithms in machine learning and statistics. But beyond the theory, real-world implementation matters — especially when it comes to building end-to-end predictive systems. In this post, I’ll walk you through five real-time linear regression projects I built using Python and scikit-learn, each solving a different problem using a different dataset.


📐 Introduction to Linear Regression & Evaluation Metrics

Linear regression is a statistical technique used to model the relationship between a dependent variable (target) and one or more independent variables (features). It's one of the simplest and most widely used regression techniques due to its interpretability and ease of implementation.

📈 Why Use Linear Regression?

  • Easy to implement and explain
  • Computationally efficient
  • Good for initial baseline models
  • Interpretable coefficients

In all of the projects below, linear regression served as a baseline approach to see how well basic relationships between features and targets could be modeled.


📐 Understanding R² and MSE

When evaluating a regression model, two of the most important metrics are R² (R-squared) and Mean Squared Error (MSE). Here's how they work and how to interpret them:

🔹 R² Score (Coefficient of Determination)

  • What it measures: The proportion of variance in the target variable that is predictable from the features.
  • Range: 0 to 1 (closer to 1 is better)
  • Interpretation:

    • An R² of 0.90 means 90% of the variance in the target is explained by the model.
    • A negative R² can occur when the model performs worse than simply predicting the mean — usually a sign of poor fit or flawed features.

🔹 Mean Squared Error (MSE)

  • What it measures: The average of the squares of the prediction errors.
  • Interpretation:

    • Lower MSE = better performance
    • Sensitive to outliers because it squares the error

🔍 How to Improve These Scores

  • Feature scaling and normalization
  • Removing or capping outliers
  • One-hot encoding for categorical variables
  • Feature selection or dimensionality reduction
  • Trying polynomial or regularized models (e.g., Ridge, Lasso)

📘 In my experiments, encoding quality and removing outliers had a huge impact on both metrics.


🔍 Projects Overview

📈 1. Salary Prediction

  • Dataset: Salary_dataset.csv

🧪 My Experience:
Model Accuracy (R²): 0.9024
Mean Squared Error: 49,830,096.86

Sample Predictions:

  • Predicted: 115,791.21, Actual: 112,636.00
  • Predicted: 71,499.28, Actual: 67,939.00
  • Predicted: 102,597.87, Actual: 113,813.00
  • Predicted: 75,268.80, Actual: 83,089.00
  • Predicted: 55,478.79, Actual: 64,446.00

Coefficient (Experience): 9423.82
Intercept: 24,380.20

Actual vs Predicted Salary

🚗 2. Car Price Estimation

  • Dataset: cars24-car-price-clean2.csv

🧪 My Experience:
Model Accuracy (R²): 0.6328
Mean Squared Error: 32.15

Sample Predictions:

  • Predicted: 6.72, Actual: 7.00
  • Predicted: 6.32, Actual: 4.75
  • Predicted: 8.28, Actual: 6.30
  • Predicted: 6.04, Actual: 5.25
  • Predicted: 0.53, Actual: 2.10

The model performs moderately well. It shows clear potential in capturing price trends but is sensitive to feature scaling and category encoding. Outliers affect accuracy.

Actual vs Predicted Price

🍷 3. Wine Price Prediction

  • Dataset: wine.csv

🧪 My Experience:
Model Accuracy (R²): 0.5271
Mean Squared Error: 0.18

Sample Predictions:

  • Predicted: 7.54, Actual: 7.39
  • Predicted: 7.32, Actual: 7.59
  • Predicted: 7.67, Actual: 7.50
  • Predicted: 6.98, Actual: 6.26
  • Predicted: 5.77, Actual: 6.25

The model performs reasonably well but struggles with subtle differences in wine composition. Could benefit from more domain-specific features or nonlinear modeling.

Actual vs Predicted Price

🏥 4. Insurance Charges Prediction

  • Dataset: insurance.csv

🧪 My Experience:
Model Accuracy (R²): 0.7836
Mean Squared Error: 33,596,915.85

Sample Predictions:

  • Predicted: 8969.55, Actual: 9095.07
  • Predicted: 7068.75, Actual: 5272.18
  • Predicted: 36858.41, Actual: 29330.98
  • Predicted: 9454.68, Actual: 9301.89
  • Predicted: 26973.17, Actual: 33750.29

Most influential features: smoker, BMI, and age. The model demonstrates high predictive power but shows deviation in extreme cases.

Actual vs Predicted Charge

🚖 5. Trip Duration Prediction

  • Dataset: train.csv

🧪 My Experience:
Model Accuracy (R²): 0.0227
Mean Squared Error: 23,060,274.73

Sample Predictions:

  • Predicted: 791.72, Actual: 473.00
  • Predicted: 531.07, Actual: 157.00
  • Predicted: 1706.45, Actual: 1862.00
  • Predicted: 1403.61, Actual: 1573.00
  • Predicted: 1381.69, Actual: 1318.00

Despite good intent, the model underperformed — likely due to high noise, unaccounted categorical values, or insufficient feature engineering.

Actual vs Predicted Trip Duration


🔄 Common Project Workflow

Each of the five projects follows a similar pipeline:

  1. Data Loading & Cleaning: Load CSV, remove nulls
  2. Feature Engineering: Encode categorical features using one-hot or binary encoding
  3. Train-Test Split: Typically 80/20 split
  4. Model Training: Use LinearRegression() from scikit-learn
  5. Evaluation: R² Score and Mean Squared Error (MSE)
  6. Visualization: Scatter plot of actual vs. predicted values

Actual vs Predicted Price


▶️ How to Run Each Project

Clone the repo, then run any project with:

python Project_X_name.py
Enter fullscreen mode Exit fullscreen mode

Replace Project_X_name.py with the script for the specific model.


📦 Installation

Ensure dependencies are installed:

pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

📂 Access the Code

You can find the complete repository here:

📎 https://github.com/Ertugrulmutlu/5-Linear-Regression-Projects/tree/main


🧠 What’s Next?

  • Upgrade the models with Ridge or Lasso Regression
  • Add interactive dashboards with Streamlit
  • Deploy the models via web or API

💬 Feedback?

Tried one of the projects? Found a bug? Want to share your own regression experiments?
Let’s chat in the comments or connect via GitHub!


Thanks for reading — and happy modeling!

Top comments (0)