DEV Community

Cover image for How to Check if Linear Regression Works for Your Dataset
likhitha manikonda
likhitha manikonda

Posted on

How to Check if Linear Regression Works for Your Dataset

Imagine you have a dataset and want to see if linear regression is a good fit. Here’s what you do, step by step:

Step 1: Understand What Linear Regression Does

Linear regression tries to draw a straight line (or plane, if you have more features) that best fits your data.
It predicts a number (like house price) based on input features (like number of rooms, location, etc.).

Step 2: Split Your Data

Divide your data into two parts:
Training set: Used to teach the model.
Testing set: Used to check how well the model predicts new data.

Step 3: Train the Model
Use the training set to let the model learn the relationship between features and the target value.

Step 4: Make Predictions
Use the trained model to predict values for the testing set.

Step 5: Check Model Performance
Compare the predicted values to the actual values in the testing set.

Use these simple scores:
R-squared (R²): Tells you how much of the variation in your target value is explained by the model. Closer to 1 is better.
RMSE (Root Mean Squared Error): Tells you, on average, how far off your predictions are from the actual values. Lower is better.
MAE (Mean Absolute Error): Another way to measure average error. Lower is better.

Step 6: Visualize the Results
Make a scatter plot of actual vs predicted values.
If the points are close to a straight diagonal line, your model is doing well.

How to Know If Linear Regression Works

Good fit: R² is high (close to 1), and errors (RMSE, MAE) are low.
Poor fit: R² is low (close to 0), and errors are high. The scatter plot looks random, not like a line.

Simple Checklist
Split your data (trained data,test data)
Train the model.
Predict and compare.
Check R², RMSE, MAE.
Visualize actual vs predicted.
If results look good, linear regression works for your data!

This template covers all the key steps for checking if linear regression works for a dataset

# 1. Imports the necessary libraries.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt

# 2. Load Your Dataset
df = pd.read_csv('your_dataset.csv')  # Replace with your file name

# 3. Choose Features and Target (Selects which columns to use for prediction
X = df[['feature1', 'feature2', 'feature3']]  # Replace with your feature columns
y = df['target']  # Replace with your target column

# 4. Split Data into Training and Testing Sets (Splits the data so you can  train and test the model)
# X_train, y_train: Data used to train the model (80% of your data).
# X_test, y_test: Data used to test the model (20% of your data).
# test_size=0.2 means 20% for testing, 80% for training.
# random_state=42 ensures the split is reproducible (you get the same split every time).


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Train the Linear Regression Model(Trains the model to learn from your data.)
model = LinearRegression()
model.fit(X_train, y_train)

# 6. Make Predictions(Predicts values for the test set.)
y_pred = model.predict(X_test)

# 7. Evaluate the Model(Evaluates how well the model did.)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"R-squared (R²): {r2:.4f}")

# 8. Interpret Coefficients and Intercept(Shows what the model learned 
 (coefficients and intercept)).
print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)

# 9. Visualize Actual vs Predicted Values(Visualizes the results so you 
 can see if predictions match actual values.)
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red')  # Diagonal line
plt.show()
Enter fullscreen mode Exit fullscreen mode

How to Use This Above Template

Replace 'your_dataset.csv' with your actual file name.
Replace ['feature1', 'feature2', 'feature3'] with the column names you want to use as features.
Replace 'target' with the column name you want to predict.

Tip: If linear regression doesn’t work well, try adding more features, cleaning your data, or using a different algorithm.

1. How to Improve the Model

Add More Features: Use more relevant columns from your dataset that might affect the target value.
Feature Engineering: Create new features from existing ones (e.g., combine two columns, create ratios).
Data Cleaning: Remove or fix missing, incorrect, or outlier values.
Scale/Normalize Data: Make sure all features are on similar scales so the model treats them fairly.
Try Polynomial Regression: If the relationship isn’t straight, try fitting a curve (polynomial features).
Regularization: Use Ridge or Lasso regression to prevent overfitting (when the model memorizes training data).

2. Common Mistakes to Avoid

Not Splitting Data: Always split into training and testing sets.
Using Irrelevant Features: Only use columns that make sense for prediction.
Ignoring Data Quality: Check for missing or weird values.
Overfitting: Don’t use too many features or fit too closely to training data.
Not Checking Assumptions: Linear regression works best when the relationship is roughly straight (linear) and errors are randomly spread.

3. How to Interpret Coefficients and Intercept
Suppose your model learned these:
In simple terms:


Positive coefficient = feature increases the prediction
Negative coefficient = feature decreases the prediction
Intercept = starting point for predictions

4. How to Visualize Model Predictions
Here’s a Python code example using matplotlib:

import matplotlib.pyplot as plt

# y_test: actual values, y_pred: predicted values
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red')  # Diagonal line
plt.show()
Enter fullscreen mode Exit fullscreen mode

How to read the plot:

Points close to the red diagonal line mean good predictions.
Points far from the line mean bigger errors.


🚀 You’ve launched through this content like a comet — don’t stop now, the next galaxy of ideas is just a scroll away! 🌌💫https://dev.to/codeneuron/logistic-regression-4mlc

Top comments (0)