DEV Community

Njeri Muriithi
Njeri Muriithi

Posted on

Predicting Employee Salary Using Linear Regression

In this project ,I am predicting an employee's salary based on their years of experience using linear regression model.Linear regression is a statistical method used to model the relationship between a dependent variable and an independent variable.
In this case:

  • X(Independent Variable) represents Years Of Experience
  • Y(Dependent Variable) represents the salary

Libraries Used

The following Python libraries are used in this project:

  • pandas for handling data frames
  • seaborn and matplotlib for visualization
  • Scikit-learn (sklearn) provides a collection of tools for data preprocessing, model training and evaluation
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.feature_selection import f_regression
%matplotlib inline  

Enter fullscreen mode Exit fullscreen mode

Data Preparation

Read the excel file that contains the data set.

Excel Dataset First 5
Then define the X and Y variables .

The X variable is stored in a DataFrame format.

X=df[['YearsExperience']]#Should be in dataframe or 2d array
y=df['Salary']

Enter fullscreen mode Exit fullscreen mode

Splitting the Data set

The dataset is split into training and testing sets ,both X and Y are split accordingly.
In this project 25% data(test size = 0.25) is used for testing the rest to train the model.

##Train Test Split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=42)
Enter fullscreen mode Exit fullscreen mode

Model validation

To evaluate whether there is a valid relationship exists between years of experience and salary F-regression is used to calculate the:

  • F-value(622.5):Measures how well the independent variable explains the dependent variable.
  • p-value helps(0.0):Determines the statistical significance of the relationship.

F Regression

Since the p-value is less than 0.05 we reject the null hypothesis which states that there is no relationship between salary and years of experience. This confirms that a strong relationship exists between the two variables.

In general, a higher F-value and a smaller p-value indicate a stronger and more significant relationship

Prediction

The Linear Regression model is trained by fitting it to the training dataset using model.fit().

model= LinearRegression()
model.fit(X_train,y_train)
Enter fullscreen mode Exit fullscreen mode

The Linear Regression model follows the formula:

Y=b0​+b1​X
Where:
Y = Dependent Variable (Salary)
b₀ = Intercept (Predicted salary when years of experience = 0)
b₁ = Slope (Rate of change in salary per year of experience)
X = Independent Variable (Years of Experience)

From the trained model, we can obtain:

  1. The intercept() The predicted salary for an employee with 0 yrs of experience
model.intercept_
Enter fullscreen mode Exit fullscreen mode
  1. The slope (coefficient) Represents the change in salary for each additional year of experience
model.coef_
Enter fullscreen mode Exit fullscreen mode

The trained model is then used to predict salaries on the test dataset using model.predict(X_test).

Prediction of salaries

Model Evaluation

The model’s performance is evaluated using the R-squared (R²) metric.
The R² score measures how much of the variation in salary is explained by years of experience.
In this case the R² score is 0.93 (93%) indicating that years of experience explain most of the variation in salary showing a strong model fit.

R2 Score

Plot the Model

To visualize the relationship and the regression line:
SalaryvsYearExperience visual

Conclusion

This project demonstrates that Linear Regression can effectively model the relationship between years of experience and salary.

Thank you for reading!❤️.

Top comments (0)