user

Posted on Sep 14, 2020

Linear Regression with Scikit-learn (Part 1)

First off let's start with the questions on your mind:

What is Scikit-learn?

Scikit-learn is a Python framework for machine learning. It features various algorithms like support vector machines, random forests, and k-neighbors, which you are going to learn here.

What is Linear Regression?

A statistical way of measuring the relationship between variables. Just know that with Linear Regression, you can predict the future.

There are two types of Linear Regression:

Simple Linear Regression
Multiple Linear Regression

Just know that Multiple Linear Regression is an extension of Simple Linear Regression. It is used when we want to predict the value of a variable based on the value of two or more other variables.

That's enough information for now. We're gonna start coding.

This first article is for Simple Linear Regression the second part is for Multiple Linear Regression.

We have to install the following libraries using pip:

pip install pandas
pip install numpy
pip install sklearn

Click here to install the dataset we're gonna use. Then extract the Salary_Data.csv file inside it.

You should see a .csv file like this:

   YearsExperience   Salary
0              1.1  39343.0
1              1.3  46205.0
2              1.5  37731.0
3              2.0  43525.0
4              2.2  39891.0

The data explanation:
As you can see there is a column called YearsExperience. This is the feature. In ML a feature is an individual measurable property or characteristic of a phenomenon being observed.
Also
there is a column called Salary. This is the Label. In ML a label is the thing we're predicting. It's the y variable in Simple Linear Regression.

Open your Code Editor and make a new Python file called: linear_regression.py or you could open a Jupyter Notebook.

Importing the needed libraries

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

We use the as keyword to give the imported module an alias to make our code shorter.

Load and view dataset

df = pd.read_csv('Salary_Data.csv')
print(df.head())

OUTPUT:

   YearsExperience   Salary
0              1.1  39343.0
1              1.3  46205.0
2              1.5  37731.0
3              2.0  43525.0
4              2.2  39891.0

Feature Extraction

x = df['YearsExperience']
x = x.values.reshape(-1, 1)
y = df['Salary']

Making a regression model

model = LinearRegression()
model.fit(x,y)
print(model.score(x,y))

Just know that the last line print(model.score(x,y)) is done to check how accurate your model is.
Below is the output of the print() statement above. The .score() function is used to get the accuracy of your model.

OUTPUT

0.9569566641435086

The closer it is to 1 the more accurate it is.

Making predictions with your model

print(model.predict([[3]]))
print(model.predict([[4]]))
print(model.predict([[5]]))

OUTPUT

[54142.08716303]
[63592.04948449]
[73042.01180594]

That's how simple it is. What you've done now is that you've predicted the salary of a person from their years of experience

You can visit Kaggle to find more datasets that you can perform Linear Regression on.

Feel free to ask questions.
GOOD LUCK 👍

DEV Community