DEV Community

user
user

Posted on

Linear Regression with Scikit-learn (Part 1)

First off let's start with the questions on your mind:

What is Scikit-learn?

Scikit-learn is a Python framework for machine learning. It features various algorithms like support vector machines, random forests, and k-neighbors, which you are going to learn here.

What is Linear Regression?

A statistical way of measuring the relationship between variables. Just know that with Linear Regression, you can predict the future.

There are two types of Linear Regression:

  • Simple Linear Regression
  • Multiple Linear Regression

Just know that Multiple Linear Regression is an extension of Simple Linear Regression. It is used when we want to predict the value of a variable based on the value of two or more other variables.

That's enough information for now. We're gonna start coding.

This first article is for Simple Linear Regression the second part is for Multiple Linear Regression.

We have to install the following libraries using pip:

pip install pandas
pip install numpy
pip install sklearn
Enter fullscreen mode Exit fullscreen mode

Click here to install the dataset we're gonna use. Then extract the Salary_Data.csv file inside it.

You should see a .csv file like this:

   YearsExperience   Salary
0              1.1  39343.0
1              1.3  46205.0
2              1.5  37731.0
3              2.0  43525.0
4              2.2  39891.0
Enter fullscreen mode Exit fullscreen mode

The data explanation:
As you can see there is a column called YearsExperience. This is the feature. In ML a feature is an individual measurable property or characteristic of a phenomenon being observed.
Also
there is a column called Salary. This is the Label. In ML a label is the thing we're predicting. It's the y variable in Simple Linear Regression.

Open your Code Editor and make a new Python file called: linear_regression.py or you could open a Jupyter Notebook.

Importing the needed libraries

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
Enter fullscreen mode Exit fullscreen mode

We use the as keyword to give the imported module an alias to make our code shorter.

Load and view dataset

df = pd.read_csv('Salary_Data.csv')
print(df.head())
Enter fullscreen mode Exit fullscreen mode

OUTPUT:

   YearsExperience   Salary
0              1.1  39343.0
1              1.3  46205.0
2              1.5  37731.0
3              2.0  43525.0
4              2.2  39891.0
Enter fullscreen mode Exit fullscreen mode

Feature Extraction

x = df['YearsExperience']
x = x.values.reshape(-1, 1)
y = df['Salary']
Enter fullscreen mode Exit fullscreen mode

Making a regression model

model = LinearRegression()
model.fit(x,y)
print(model.score(x,y))
Enter fullscreen mode Exit fullscreen mode

Just know that the last line print(model.score(x,y)) is done to check how accurate your model is.
Below is the output of the print() statement above. The .score() function is used to get the accuracy of your model.

OUTPUT

0.9569566641435086
Enter fullscreen mode Exit fullscreen mode

The closer it is to 1 the more accurate it is.

Making predictions with your model

print(model.predict([[3]]))
print(model.predict([[4]]))
print(model.predict([[5]]))
Enter fullscreen mode Exit fullscreen mode

OUTPUT

[54142.08716303]
[63592.04948449]
[73042.01180594]
Enter fullscreen mode Exit fullscreen mode

That's how simple it is. What you've done now is that you've predicted the salary of a person from their years of experience

You can visit Kaggle to find more datasets that you can perform Linear Regression on.

Feel free to ask questions.
GOOD LUCK 👍

Top comments (0)