DEV Community

Cover image for Insurance Cost Prediction using Machine Learning with Python.
Oluwafunmilola Obisesan
Oluwafunmilola Obisesan

Posted on • Updated on

Insurance Cost Prediction using Machine Learning with Python.

Machine learning (ML) is a sub set of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so.
Machine learning algorithms uses historical data as input to predict new output values.

In this project, I worked on developing an end to end machine learning model using linear regression.
Data cleaning, Extensive data visulaization, Exploratory data analysis was also done.

Data Description:

The dataset used for this project is an Insurance focused dataset that contains columns such as age, sex, bmi, region, and other data, which were used to determine the cost of each person’s insurance.

Steps

  • Importing the necessary libraries: Numpy, pandas, matplotlib, seaborn and sckitlearn were imported.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
%matplotlib inline 
Enter fullscreen mode Exit fullscreen mode

Image description

  • Loading in the dataset: The csv was loaded using the code below:

Insurance = pd.read_csv("https://raw.githubusercontent
Enter fullscreen mode Exit fullscreen mode

Image description

  • Information about the data. To get some information about the data such as the type of data in each column, we use the code below

Insurance.info()
Enter fullscreen mode Exit fullscreen mode

Image description

  • Checking the statistical description of the data:
Insurance.describe()
Enter fullscreen mode Exit fullscreen mode

Image description

  • Checking for the number of rows and columns present in the dataset:
Insurance.shape
Enter fullscreen mode Exit fullscreen mode

Image description

Data Cleaning and preparation:

Working with “unclean” data leads to inaccuracy in results, so it’s necessary to carry out data cleaning before any analysis or prediction is done.

  • Checking for null values:

To check for null values in our dataset, we use the code below:

Insurance.isnull().any()
Enter fullscreen mode Exit fullscreen mode

Image description

  • Checking for duplicates:
Insurance.duplicated().any()
Enter fullscreen mode Exit fullscreen mode

Image description

Exploratory Data Analysis:

Exploratory data analysis helps in understanding the patterns, trends and metrics in a dataset. Also helps in detecting outliers and anomalous events.

  • Using a correlation matrix to check for correlations among the columns in the dataset:
sns.heatmap(Insurance.corr())
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

The correlation matrix shows there’s little or no correlation between “age” and “charges”.

  • Checking for the distribution pattern of the “charges” column
sns.distplot(Insurance['charges'])
Enter fullscreen mode Exit fullscreen mode

Image description

  • Plotting a pairplot to check out the relationship that exists between one column to another.
sns.pairplot(Insurance);
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

Extracting dependent and independent variables:

The dependent variable in this case is the “charges “ while the independent variables are the other columns.

X = Insurance.drop(columns = ["charges"])
X.head(5)
Enter fullscreen mode Exit fullscreen mode
y = Insurance["charges"]
y
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

Splitting the dataset into test and train.

To build a machine learning algorithm, you have to “train” the model with a set of data and use the other set to “test” the model you’ve built.
So we split our data into “test” data and “train” data, using 80 percent to train the model and using the other 20 percent to test the model.


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 0)
X_train.head()
Enter fullscreen mode Exit fullscreen mode

Image description

One hot encoding to transform categorical text data

The data contains some columns which have texts in them, such as gender, region.
Since we can’t build the model with these text data, we need to convert it into numbers.
Using the gender column as an example; assigning 0 to female and 1 to male.
We can do this using one hot encoding, using the code below

X_train_ = pd.get_dummies(X_train, columns=["sex", "smoker", "region"], drop_first=True)
Enter fullscreen mode Exit fullscreen mode

Image description

Building and fitting the model.

Here is the most interesting part of this project , now that we are done with data cleaning and converting text data to numbers, we can now build our model using the line of code below:

from sklearn.linear_model import LinearRegression
Enter fullscreen mode Exit fullscreen mode
lm = LinearRegression()

lm.fit(X_train_,y_train)
Enter fullscreen mode Exit fullscreen mode

Image description

Predicting the “test” set results.

Remember we trained our model on 80 percent of our data, now that we’ve built the model, we can use the model to predict the outcome of the 20 percent we set aside.
Here’s the code and the prediction using our “test” data.

predictions = lm.predict(X_test_)
Enter fullscreen mode Exit fullscreen mode

Image description

Now let’s check the accuracy of our model, if our model is 100 percent accurate in predicting the “test” set results.

Model evaluation:

To evaluate the accuracy of our model, we’ll use the R2 score.
The R2 score measures the amount of variance of the prediction which is explained by the dataset.

If the value of the R2 score is 1, it means the model is perfect, and if it’s 0, it means the model will perform badly in an unseen data.
The closer the value of the R2 is to 1, the more perfectly the model is trained.

To check our R2 score, we use the code below:

from sklearn.metrics import r2_score
r2_score(y_test, predictions)
Enter fullscreen mode Exit fullscreen mode

Image description

Oops
Not a bad model I must say!

View the entire code here:

https://github.com/heyfunmi/Insurance_Cost_Prediction_using_Machine_Learning_with_Python

See you in another project!
Cheers!!

Top comments (1)

Collapse
 
neospy profile image
NeoSpy

Machine learning can significantly enhance insurance cost prediction. By utilizing Python, data scientists can deploy algorithms that analyze vast datasets, uncover patterns, and predict costs more accurately. This not only optimizes pricing strategies but also personalizes customer experiences, making the process more efficient and data-driven.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.