5 Regression Projects in Python (with Full Code)
Linear regression is one of the foundational algorithms in machine learning and statistics. But beyond the theory, real-world implementation matters — especially when it comes to building end-to-end predictive systems. In this post, I’ll walk you through five real-time linear regression projects I built using Python and scikit-learn, each solving a different problem using a different dataset.
📐 Introduction to Linear Regression & Evaluation Metrics
Linear regression is a statistical technique used to model the relationship between a dependent variable (target) and one or more independent variables (features). It's one of the simplest and most widely used regression techniques due to its interpretability and ease of implementation.
📈 Why Use Linear Regression?
- Easy to implement and explain
- Computationally efficient
- Good for initial baseline models
- Interpretable coefficients
In all of the projects below, linear regression served as a baseline approach to see how well basic relationships between features and targets could be modeled.
📐 Understanding R² and MSE
When evaluating a regression model, two of the most important metrics are R² (R-squared) and Mean Squared Error (MSE). Here's how they work and how to interpret them:
🔹 R² Score (Coefficient of Determination)
- What it measures: The proportion of variance in the target variable that is predictable from the features.
- Range: 0 to 1 (closer to 1 is better)
-
Interpretation:
- An R² of 0.90 means 90% of the variance in the target is explained by the model.
- A negative R² can occur when the model performs worse than simply predicting the mean — usually a sign of poor fit or flawed features.
🔹 Mean Squared Error (MSE)
- What it measures: The average of the squares of the prediction errors.
-
Interpretation:
- Lower MSE = better performance
- Sensitive to outliers because it squares the error
🔍 How to Improve These Scores
- Feature scaling and normalization
- Removing or capping outliers
- One-hot encoding for categorical variables
- Feature selection or dimensionality reduction
- Trying polynomial or regularized models (e.g., Ridge, Lasso)
📘 In my experiments, encoding quality and removing outliers had a huge impact on both metrics.
🔍 Projects Overview
📈 1. Salary Prediction
-
Dataset:
Salary_dataset.csv
🧪 My Experience:
Model Accuracy (R²): 0.9024
Mean Squared Error: 49,830,096.86
Sample Predictions:
- Predicted: 115,791.21, Actual: 112,636.00
- Predicted: 71,499.28, Actual: 67,939.00
- Predicted: 102,597.87, Actual: 113,813.00
- Predicted: 75,268.80, Actual: 83,089.00
- Predicted: 55,478.79, Actual: 64,446.00
Coefficient (Experience): 9423.82
Intercept: 24,380.20
🚗 2. Car Price Estimation
-
Dataset:
cars24-car-price-clean2.csv
🧪 My Experience:
Model Accuracy (R²): 0.6328
Mean Squared Error: 32.15
Sample Predictions:
- Predicted: 6.72, Actual: 7.00
- Predicted: 6.32, Actual: 4.75
- Predicted: 8.28, Actual: 6.30
- Predicted: 6.04, Actual: 5.25
- Predicted: 0.53, Actual: 2.10
The model performs moderately well. It shows clear potential in capturing price trends but is sensitive to feature scaling and category encoding. Outliers affect accuracy.
🍷 3. Wine Price Prediction
-
Dataset:
wine.csv
🧪 My Experience:
Model Accuracy (R²): 0.5271
Mean Squared Error: 0.18
Sample Predictions:
- Predicted: 7.54, Actual: 7.39
- Predicted: 7.32, Actual: 7.59
- Predicted: 7.67, Actual: 7.50
- Predicted: 6.98, Actual: 6.26
- Predicted: 5.77, Actual: 6.25
The model performs reasonably well but struggles with subtle differences in wine composition. Could benefit from more domain-specific features or nonlinear modeling.
🏥 4. Insurance Charges Prediction
-
Dataset:
insurance.csv
🧪 My Experience:
Model Accuracy (R²): 0.7836
Mean Squared Error: 33,596,915.85
Sample Predictions:
- Predicted: 8969.55, Actual: 9095.07
- Predicted: 7068.75, Actual: 5272.18
- Predicted: 36858.41, Actual: 29330.98
- Predicted: 9454.68, Actual: 9301.89
- Predicted: 26973.17, Actual: 33750.29
Most influential features: smoker, BMI, and age. The model demonstrates high predictive power but shows deviation in extreme cases.
🚖 5. Trip Duration Prediction
-
Dataset:
train.csv
🧪 My Experience:
Model Accuracy (R²): 0.0227
Mean Squared Error: 23,060,274.73
Sample Predictions:
- Predicted: 791.72, Actual: 473.00
- Predicted: 531.07, Actual: 157.00
- Predicted: 1706.45, Actual: 1862.00
- Predicted: 1403.61, Actual: 1573.00
- Predicted: 1381.69, Actual: 1318.00
Despite good intent, the model underperformed — likely due to high noise, unaccounted categorical values, or insufficient feature engineering.
🔄 Common Project Workflow
Each of the five projects follows a similar pipeline:
- Data Loading & Cleaning: Load CSV, remove nulls
- Feature Engineering: Encode categorical features using one-hot or binary encoding
- Train-Test Split: Typically 80/20 split
-
Model Training: Use
LinearRegression()
from scikit-learn - Evaluation: R² Score and Mean Squared Error (MSE)
- Visualization: Scatter plot of actual vs. predicted values
▶️ How to Run Each Project
Clone the repo, then run any project with:
python Project_X_name.py
Replace Project_X_name.py
with the script for the specific model.
📦 Installation
Ensure dependencies are installed:
pip install -r requirements.txt
📂 Access the Code
You can find the complete repository here:
📎 https://github.com/Ertugrulmutlu/5-Linear-Regression-Projects/tree/main
🧠 What’s Next?
- Upgrade the models with Ridge or Lasso Regression
- Add interactive dashboards with Streamlit
- Deploy the models via web or API
💬 Feedback?
Tried one of the projects? Found a bug? Want to share your own regression experiments?
Let’s chat in the comments or connect via GitHub!
Thanks for reading — and happy modeling!
Top comments (0)