MohammadReza Mahdian

Posted on Nov 22

Simple Regression Linear

#regression #machinelearning #python #tutorial

Loading the Required Packages

To proceed with reading the data, performing numerical operations, and visualizing relationships, we need the following libraries:

pandas – for reading and handling CSV files
numpy – for working with arrays and numerical transformations
matplotlib – for plotting and visual exploration of the data

Installation (run once):

pip install pandas numpy matplotlib

Dataset Introduction – The Classic Advertising Dataset

This is the famous Advertising dataset from the book Introduction to Statistical Learning (ISLR).

All monetary values are in thousands of dollars
TV – advertising budget spent on television
Radio – advertising budget spent on radio
Newspaper – advertising budget spent on newspapers
Sales – actual sales (target variable)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Level 2: Loading and Initial Inspection of the Dataset

Loading the Dataset

Reads the CSV file and stores it in a pandas DataFrame called df.

(If your file has a different name or path, adjust the string accordingly.)

df = pd.read_csv("/home/pyrz-tech/Desktop/MachineLearning/advertising.csv")

Preview the First Rows

‍df.head()Displays the first 5 rows of the DataFrame, allowing a quick visual verification of the loaded data.

df.head()

Dataset Dimensions

df.shape Returns the total number of rows and columns in the dataset.

df.shape

(200, 4)

Column Information

df.info() Shows column names, data types, non-null counts, and memory usage.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB

Descriptive Statistics

df.describe() Provides summary statistics (count, mean, std, min, quartiles, max) for numerical columns.

df.describe()

Quick Summary of What We’ve Seen So Far

After running the basic checks, we confirmed:

Shape: 200 rows × 4 columns
All feature columns (TV, Radio, Newspaper) and the target (Sales) are of type float64
No missing values

Visual Inspection of Individual Feature–Sales Relationships

We now carefully examine the relationship between each advertising channel and Sales using individual scatter plots with regression lines. The goal is to visually assess:

Strength of the linear relationship
Density and spread of points around the fitted line
Which feature appears to have the strongest and most compact linear relationship with Sales

Visual Inspection Using Matplotlib’s scatter() Method

We now plot the relationship between each advertising feature and Sales using pure matplotlib.scatter() (no seaborn regplot) so that we can fully control the appearance and clearly see the raw data points.

plt.scatter(df.TV, df.Sales)

plt.scatter(df.Radio, df.Sales)

plt.scatter(df.Newspaper, df.Sales)

Visual Analysis Summary and Feature Selection for Simple Linear Regression

As observed in the scatter plots above:

All three advertising channels (TV, Radio, Newspaper) show a positive relationship with Sales.
The TV advertising budget exhibits the strongest, most densely clustered, and clearest linear relationship with Sales.
The TV feature has the steepest slope, the tightest spread around the trend, and the fewest apparent outliers.

Therefore, based on visual inspection and exploratory analysis, we select TV as the single predictor variable for our Simple Linear Regression model.

Selected Feature

Feature: TV

Target: Sales

Creating a Clean Subset for Focused Analysis

To work more cleanly and concentrate only on the selected feature (TV) and the target (Sales), we create a new DataFrame called cdf (clean DataFrame) containing just these two columns.

From now on, we will perform all subsequent steps (visualization, modeling, evaluation) using cdf instead of the full df. This keeps our workspace focused and readable.

cdf = df[['TV', 'Sales']]

Train-Test Split (Manual Random Split)

We now split the clean dataset (cdf) into training and test sets using a simple random mask.

Approximately 80 % of the data will be used for training and the remaining 20 % for testing.

This is a common manual approach when we want full control over the splitting process without importing train_test_split from scikit-learn.

train and test DataFrames are ready for model training and evaluation.

msk = np.random.rand(len(cdf)) < 0.8
train = cdf[msk]
test = cdf[~msk]

print(f'msk => {msk[:4]} ...')
print(f'train => {train.head()}')
print('...')
print(f'test => {test.head()} ...')
print('...')
print(f'len(train) => {len(train)}')
print(f'len(test) => {len(test)}')

msk => [ True  True  True False] ...
train =>       TV  Sales
0  230.1   22.1
1   44.5   10.4
2   17.2   12.0
5    8.7    7.2
6   57.5   11.8
...
test =>        TV  Sales
3   151.5   16.5
4   180.8   17.9
8     8.6    4.8
9   199.8   15.6
10   66.1   12.6 ...
...
len(train) => 156
len(test) => 44

### Visualizing the Training and Test Sets on the Same Plot

Before training the model, we plot both the training and test data points on the same scatter plot (with different colors) to visually confirm that:

The split appears random
Both sets cover the same range of TV and Sales values
There is no systematic bias in the split

plt.scatter(train.TV, train.Sales)
plt.scatter(test.TV, test.Sales, color='green')

#### Converting Training Data to NumPy Arrays

For the scikit-learn LinearRegression model, we need the feature and target variables as NumPy arrays (or array-like objects).

We use np.asanyarray() to convert the pandas columns from the training set into the required format.

train_x = np.asanyarray(train[['TV']])
train_y = np.asanyarray(train[['Sales']])

Fitting the Simple Linear Regression Model

We now import the LinearRegression class from scikit-learn, create a model instance, and train it using the prepared training arrays (train_x and train_y).

After running, the simple linear regression model is fully trained using only the TV advertising budget to predict Sales.

The coefficient tells us how much Sales increases (in thousand units) for every additional thousand dollars spent on TV advertising.

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(train_x, train_y)

Visualizing the Fitted Regression Line

In this step we plot the training data points together with the regression line found by the model. This allows us to visually verify that the fitted line reasonably captures the linear relationship between TV advertising and Sales.

The line is drawn using the learned parameters:

model.coef_[0] → slope of the line
model.intercept_ → y-intercept

plt.scatter(train_x, train_y)
plt.plot(train_x, reg.coef_[0][0] * train_x + reg.intercept_[0], '-g')

Preparing Test Data and Making Predictions

We convert the test set to NumPy arrays (required format for scikit-learn) and use the trained model to predict Sales values for the test observations.

test_x = np.asanyarray(test[['TV']])
test_y = np.asanyarray(test[['Sales']])
predict_y = np.asanyarray(reg.predict(test_x))

Evaluating Model Performance with R² Score

We import the r2_score metric from scikit-learn to measure how well our Simple Linear Regression model performs on the test set.

The R² score (coefficient of determination) tells us the proportion of variance in Sales that is explained by the TV advertising budget.

R² ≈ 1.0 → perfect fit
R² ≈ 0 → model explains nothing

from sklearn.metrics import r2_score

Computing and Displaying the R² Score

We use the imported r2_score function to calculate the coefficient of determination on the test data and print the result directly.

This single line gives us the final performance metric: the higher the value (closer to 1.0),
the better our simple linear regression model using only TV advertising explains the variation in Sales.

print(f'r^2 score is : {r2_score(test_y, predict_y)}')

r^2 score is : 0.8674734235783073

follow me in github:

-https://github.com/PyRz-Tech

DEV Community