DEV Community

Cover image for Beyond the Straight Line: A Guide to Polynomial Regression
Vikas Gulia
Vikas Gulia

Posted on

Beyond the Straight Line: A Guide to Polynomial Regression

When we first step into the world of machine learning, Linear Regression is often our first stop. It's simple, intuitive, and incredibly useful for modeling a straight-line relationship between two variables. But what happens when a straight line just doesn't cut it? 🤔

Real-world data is rarely that neat. Sometimes, the relationship between your input and output is curved, like a "U" or an "S" shape. Forcing a straight line through such data will give you a poor model and inaccurate predictions.

This is where Polynomial Regression comes to the rescue! It's a powerful technique that allows us to model non-linear relationships using a linear model. Let's dive in.


Why Simple Linear Regression Sometimes Fails

Simple linear regression tries to find the best straight line that fits our data. The equation is simple:

y=β0​ + β1​x + β2​x2 +ϵ

Where:

  • y is the dependent variable (what we're predicting).
  • x is the independent variable (our input).
  • beta_1 is the slope of the line.
  • beta_0 is the y-intercept.

This works perfectly when the data points look something like this:

Image of points where linear regression would work better

But what if your data looks more like this?

A straight line would clearly miss the mark. This is a classic case where we need to model a curve, not a line.


How Polynomial Regression Creates the Curve 🪄

Polynomial Regression builds on the linear regression model by adding new features that are powers of the original independent variable. Instead of just x, we introduce x2, x3, x4, and so on.

The "degree" of the polynomial determines how many new features we create. For example, if we choose a degree of 2, our model won't just use the feature x; it will use three features:

  • x0 (which is always 1)
  • x1 (the original feature, x)
  • x2 (the squared feature)

The regression equation then becomes:

y= β0 + β1​x + β2​x2 + ϵ

Even though this equation produces a curved line (a parabola, in this case), it's still considered a linear model. Why? Because the equation is linear in its coefficients (beta_0,beta_1,beta_2). We are still just finding the optimal weights for our features, it's just that our features are now polynomial.


Hands-On Example with Python 🐍

Let's see this in action. We'll generate some non-linear data and compare how simple linear regression and polynomial regression perform.

Step 1: Import Libraries and Create Data

First, let's set up our environment and create some sample data that follows a quadratic (x^2) pattern.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Create some non-linear data based on a quadratic equation
np.random.seed(0)
X = 2 - 3 * np.random.normal(0, 1, 100)
y = X - 2 * (X ** 2) + np.random.normal(-3, 3, 100)

# Reshape for scikit-learn
X = X[:, np.newaxis]
y = y[:, np.newaxis]

# Plot the data
plt.figure(figsize=(10, 6))
plt.scatter(X, y, s=20)
plt.title('Sample Non-Linear Data')
plt.xlabel('Independent Variable (X)')
plt.ylabel('Dependent Variable (y)')
plt.grid(True)
plt.show()
Enter fullscreen mode Exit fullscreen mode

This code gives us a scatter plot that clearly shows a curved, "U"-shaped relationship.

Image of scatter plot of above code

Step 2: Fit a Simple Linear Regression Model (For Comparison)

Let's see what happens when we try to fit a simple straight line to this data.

# Fit Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X, y)

# Visualize the Linear Regression line
plt.figure(figsize=(10, 6))
plt.scatter(X, y, s=20)
plt.plot(X, lin_reg.predict(X), color='red', linewidth=2)
plt.title('Simple Linear Regression Fit (Poor)')
plt.xlabel('Independent Variable (X)')
plt.ylabel('Dependent Variable (y)')
plt.grid(True)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image of linear regression on non linear data
As expected, the straight red line is a terrible fit for our data points. It fails to capture the underlying trend.

Step 3: Fit a Polynomial Regression Model

Now for the magic. We'll use Scikit-Learn's PolynomialFeatures to transform our X data, and then feed it into the same LinearRegression model.

# 1. Create Polynomial Features (degree=2)
polynomial_features = PolynomialFeatures(degree=2)
X_poly = polynomial_features.fit_transform(X)

# 2. Fit the Linear Regression model on the transformed features
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)

# 3. Visualize the results
# To get a smooth curve, we'll sort the X values before predicting
X_grid = np.arange(min(X), max(X), 0.1)
X_grid = X_grid.reshape((len(X_grid), 1))
X_poly_grid = polynomial_features.transform(X_grid)
y_poly_pred = poly_reg.predict(X_poly_grid)

plt.figure(figsize=(10, 6))
plt.scatter(X, y, s=20, label='Data Points')
plt.plot(X_grid, y_poly_pred, color='green', linewidth=3, label='Polynomial Regression (degree 2)')
plt.title('Polynomial Regression Fit (Excellent!)')
plt.xlabel('Independent Variable (X)')
plt.ylabel('Dependent Variable (y)')
plt.legend()
plt.grid(True)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image of polynomial regression on non linear data
Look at that! The green curve from our polynomial model fits the data beautifully. It successfully captures the non-linear trend, which will lead to much more accurate predictions.


A Word of Caution: Choosing the Right Degree

While it might be tempting to use a very high degree to fit the data perfectly, this can lead to a problem called overfitting. A model with too high a degree will twist and turn to pass through as many training points as possible, but it will fail miserably on new, unseen data.

  • Underfitting (Low Degree): The model is too simple and doesn't capture the data's trend. (Our simple linear regression example).
  • Good Fit (Just Right Degree): The model captures the underlying trend and generalizes well. (Our degree=2 example).
  • Overfitting (High Degree): The model is too complex and learns the noise in the data, not just the signal.

Finding the right degree is a balancing act, often determined through experimentation and techniques like cross-validation.

Conclusion

Polynomial Regression is a fantastic tool to have in your machine learning toolkit. It extends the simplicity of linear regression to handle much more complex, non-linear scenarios. By transforming your features, you can fit curves to your data, leading to more robust and accurate models.

So next time you see data that doesn't follow a straight line, remember to look beyond linear and give Polynomial Regression a try! 🚀

Top comments (0)