Mohammed Sharuk

Posted on Aug 20

Insurance Cost Prediction

Introduction
Exploratory Data Analysis
Hypothesis Testing
Modelling
Model comparison and performance
Deployment
GitHub Repository

1. Introduction

1.1 About Project

Insurance companies need to accurately predict the cost of health insurance for individuals to set premiums appropriately. However, traditional methods of cost prediction often rely on broad actuarial tables and historical averages, which may not account for the nuanced differences among individuals. By leveraging machine learning techniques, insurers can predict more accurately the insurance costs tailored to individual profiles, leading to more competitive pricing and better risk management.

1.2 Objectives

The primary need for this project arises from the challenges insurers face in pricing policies accurately while remaining competitive in the market. Inaccurate predictions can lead to losses for insurers and unfairly high premiums for policyholders. By implementing a machine learning model, insurers can:

Enhance Precision in Pricing: Use individual data points to determine premiums that reflect actual risk more closely than generic estimates.
Increase Competitiveness: Offer rates that are attractive to consumers while ensuring that the pricing is sustainable for the insurer.
Improve Customer Satisfaction: Fair and transparent pricing based on personal health data can increase trust and satisfaction among policyholders.
Enable Personalized Offerings: Create customized insurance packages based on predicted costs, which can cater more directly to the needs and preferences of individuals.
Risk Assessment: Insurers can use the model to refine their risk assessment processes, identifying key factors that influence costs most significantly.

1.3 Concepts used

Data Cleaning and pre-processing
EDA
Hypothesis testing (using scipy library for statistical test)
Predictions using regression models (using sklearn library)

1.4 Data Source

Dataset provided by Scaler

2. Exploratory Data Analysis (EDA)

This is the most important and time-consuming part of any data science project. As you have to use imagination to dig out the important information or insights while ensuring high interpretability. This helps us in understanding the data better and for other stakeholders too which further helps in making better business decisions. This also helps in understanding how target variable are behaving with different features. This will further help us in feature engineering.

Before moving ahead with EDA we will first, import some libraries like pandas, numpy, matplotlib etc. Then we will be reading the data. The data is in .csv format

Data Description

The dataset comprises the following 11 attributes:

Age: Numeric, ranging from 18 to 66 years.
Diabetes: Binary (0 or 1), where 1 indicates the presence of diabetes.
BloodPressureProblems: Binary (0 or 1), indicating the presence of blood pressure-related issues.
AnyTransplants: Binary (0 or 1), where 1 indicates the person has had a transplant.
AnyChronicDiseases: Binary (0 or 1), indicating the presence of any chronic diseases.
Height: Numeric, measured in centimeters, ranging from 145 cm to 188 cm.
Weight: Numeric, measured in kilograms, ranging from 51 kg to 132 kg.
KnownAllergies: Binary (0 or 1), where 1 indicates known allergies.
HistoryOfCancerInFamily: Binary (0 or 1), indicating a family history of cancer.
NumberOfMajorSurgeries: Numeric, counting the number of major surgeries, ranging from 0 to 3 surgeries.
PremiumPrice: Numeric, representing the premium price in currency, ranging from 15,000 to 40,000.

Reading Dataset

def read_data(url):
  df = pd.read_csv(url)
  df.rename(columns={
      'Age'                     : 'age',
      'Diabetes'                : 'diabetes',
      'BloodPressureProblems'   : 'blood_pressure_problems',
      'AnyTransplants'          : 'any_transplants',
      'AnyChronicDiseases'      : 'any_chronic_diseases',
      'Height'                  : 'height',
      'Weight'                  : 'weight',
      'KnownAllergies'          : 'known_allergies',
      'HistoryOfCancerInFamily' : 'history_of_cancer_in_family',
      'NumberOfMajorSurgeries'  : 'number_of_major_surgeries',
      'PremiumPrice'            : 'premium_price'
  }, inplace=True)

  return df

CSV_URL = 'https://drive.google.com/uc?id=1NBk1TFkK4NeKdodR2DxIdBp2Mk1mh4AS'
df = read_data(CSV_URL)

df.head()

Dataset Shape

rows, cols = df.shape
print(f'Number of rows : {rows}')
print(f'Number of columns : {cols}')

Column Datatypes

Numerical Data Description

Categorical Data Description

Null Value Counts

Number of outliers

Mean Premium by All Disease Interaction Combinations

Target Variable Distribution (Premium Price)

Data is almost normally distributed, but some values are left skewed.
There isnt much outliers available in the target variable

Height, Weight, BMI Distributions

BMI Category counts

A significant amount of data comes under overweight, then obese

Correlation Analysis

Pearson

Spearman

3. Hypothesis Testing

T-Test

This test is conducted across different binary variables (health conditions) with premium prices to see if there is any significant differences

Diabetes

H0: There is no difference in means between diabetic and non-diabetic groups
H1: There is a significant difference in means between diabetic and non-diabetic groups
Result: Age and number of major surgeries significantly differ between diabetic and non-diabetic patients. Physical measurements like BMI, weight, and height show no significant differences.

Blood Pressure Problems

H0: There is no difference in means between groups with and without blood pressure problems
H1: There is a significant difference in means between groups with and without blood pressure problems
Result: Age and number of major surgeries are significantly higher in people with blood pressure problems. Physical stats don't show meaningful differences between the two groups.

Any Transplants

H0: There is no difference in means between transplant and non-transplant groups
H1: There is a significant difference in means between transplant and non-transplant groups
Result: Only premium price differs significantly between groups, while age becomes non-significant. Physical measurements and surgery history show no significant differences.

Chronic Diseases

H0: There is no difference in means between groups with and without chronic diseases
H1: There is a significant difference in means between groups with and without chronic diseases
Result: Premium price is significantly higher for people with chronic diseases, but age doesn't differ significantly. Physical measurements remain unimportant across groups.

Known Allergies

H0: There is no difference in means between groups with and without known allergies
H1: There is a significant difference in means between groups with and without known allergies
Result: Only number of major surgeries differs significantly between allergy groups. Age, premium price, and physical measurements show no meaningful differences.

Cancer

H0: There is no difference in means between groups with and without family cancer history
H1: There is a significant difference in means between groups with and without family cancer history
Result: People with family cancer history have significantly higher premiums and more major surgeries. Age and physical measurements don't differ significantly between groups.

ANOVA

This test conducted with mean values across different categorical variables

Number of Major Surgeries

H0: Number of major surgeries has no effect on insurance premium prices
H1: Number of major surgeries significantly affects insurance premium prices
Result: Age strongly affects insurance prices, but BMI, weight, and height don't matter much. Insurance companies care more about how old you are than your basic body measurements.

Age Group

H0: Age group has no effect on insurance premium prices
H1: Age group significantly affects insurance premium prices
Result: Age is by far the biggest factor in determining insurance costs. Your physical stats like weight and height don't significantly impact pricing.

Health Score

H0: Health score has no effect on insurance premium prices
H1: Health score significantly affects insurance premium prices
Result: Age still matters for insurance pricing, but less so when health scores are involved. Physical measurements like BMI and weight consistently don't affect prices much.

Chi-Squared Contingency Test

Test is conducted between each binary value features

Diabetes vs other diseases

H0: There is no association between diabetes and other health conditions/factors
H1: There is a significant association between diabetes and other health conditions/factors
Result: Diabetes shows strong associations with blood pressure problems, chronic diseases, allergies, surgeries, age group, and health score. No significant link with transplants or family cancer history.

Blood Pressure

H0: There is no association between blood pressure problems and other health conditions/factors
H1: There is a significant association between blood pressure problems and other health conditions/factors
Result: Blood pressure problems are significantly linked to diabetes, number of surgeries, age group, and health score. No associations found with transplants, chronic diseases, allergies, or family cancer history.

Any Transplants

H0: There is no association between transplant history and other health conditions/factors
H1: There is a significant association between transplant history and other health conditions/factors
Result: Transplants show no significant associations with any other health conditions or factors. This suggests transplant patients are distributed randomly across other health categories.

Chronic Diseases

H0: There is no association between chronic diseases and other health conditions/factors
H1: There is a significant association between chronic diseases and other health conditions/factors
Result: Chronic diseases are significantly associated with diabetes, age group, and health score. No links found with blood pressure, transplants, allergies, or family cancer history.

Known Allergies

H0: There is no association between known allergies and other health conditions/factors
H1: There is a significant association between known allergies and other health conditions/factors
Result: Allergies show significant associations with diabetes, family cancer history, number of surgeries, and health score. No connections with blood pressure, transplants, chronic diseases, or age group.

Cancer

H0: There is no association between family cancer history and other health conditions/factors
H1: There is a significant association between family cancer history and other health conditions/factors
Result: Family cancer history is significantly linked to allergies, number of surgeries, and health score. No associations with diabetes, blood pressure, transplants, chronic diseases, or age group.

Data Preprocessing before modelling

Handling the missing values

The Random Forest Iterative Imputer is used to fill in missing values in a dataset. Unlike simple imputation methods (like mean, median, or mode), this method predicts missing values based on other features using a machine learning model, in this case a Random Forest model

Ordinal Encoding

Ordinal encoding is a way to convert categorical features into numerical values, but it’s used specifically for ordinal categories—categories with a clear, meaningful order.
This step is done for the features overall_risk_category and bmi_category

Standard Scaling

Standard Scaling (also called Standardization) is a technique to rescale numeric features so that they have, mean as standard deviation as 0 and 1 respectively
This helps many machine learning algorithms work better, especially those that are sensitive to the scale of features.
This step is done for all the numerical features

Model Development

Regression Problem Framework

Objective: Predict continuous insurance premium values using multiple regression approaches.

Evaluation Metrics:

RMSE (Root Mean Squared Error): Measures prediction accuracy magnitude
MAE (Mean Absolute Error): Assesses average absolute prediction error
R² Score: Quantifies variance explanation capability

Data Preprocessing Pipeline

Missing Value Treatment:

Method: Random Forest Iterative Imputer
Advantage: Predicts missing values using machine learning rather than simple statistical measures
Implementation: Leverages feature relationships for accurate imputation

Feature Encoding:

Ordinal Encoding: Applied to hierarchical categories (risk levels, BMI categories)
Rationale: Preserves natural ordering in categorical variables

Feature Scaling:

Method: Standard Scaling (Standardization)
Process: Transforms features to mean=0, standard deviation=1
Benefit: Ensures algorithm performance optimization across different feature scales

Model Architecture

Algorithm Selection:
Five distinct regression approaches were implemented to ensure comprehensive performance comparison:

Linear Regression: Baseline linear relationship model
Decision Tree Regressor: Non-linear, interpretable tree-based approach
Random Forest Regressor: Ensemble method combining multiple decision trees
Gradient Boosting Regressor: Sequential boosting with error correction
XGBoost Regressor: Optimized gradient boosting with advanced regularization

Cross-Validation Strategy:

Method: 5-fold cross-validation with shuffling
Purpose: Ensures robust performance estimation and reduces overfitting risk
Reproducibility: Fixed random seed for consistent results

Performance Evaluation Process

Model Training:

Feature-target separation
Cross-validation implementation
Hyperparameter optimization
Full dataset retraining
Performance metric calculation

Statistical Validation:

Confidence Intervals: 95% confidence intervals for prediction reliability
Residual Analysis: Error pattern examination
Bias-Variance Assessment: Model stability evaluation

Performance Analysis

Comprehensive Model Comparison

Model	RMSE	MAE	R²	CI_low	CI_high
Linear Regression	3542.13	2419.16	0.678	23987.81	24635.28
Decision Tree	3889.32	1147.06	0.612	24033.06	24808.72
Random Forest	2858.16	1249.14	0.791	24053.53	24743.77
Gradient Boosting	3109.07	1724.89	0.752	24027.90	24719.85
XGBoost	3039.57	1509.54	0.763	24026.34	24730.57

Model Performance Rankings

RMSE Performance (Lower is Better):

Random Forest (2858.16) - Superior accuracy
XGBoost (3039.57) - Strong performance
Gradient Boosting (3109.07) - Competitive results
Linear Regression (3542.13) - Baseline performance
Decision Tree (3889.32) - Highest error rate

R² Score Performance (Higher is Better):

Random Forest (0.791) - Explains 79.1% of variance
XGBoost (0.763) - Explains 76.3% of variance
Gradient Boosting (0.752) - Explains 75.2% of variance
Linear Regression (0.678) - Explains 67.8% of variance
Decision Tree (0.612) - Explains 61.2% of variance

Key Performance Insights

Champion Model: Random Forest

Accuracy: Achieves lowest RMSE and highest R² score
Stability: Balanced performance across all metrics
Generalization: Strong cross-validation performance indicates robust generalization
Reliability: Tight confidence intervals suggest consistent predictions

Algorithm Analysis:

Ensemble Method Superiority:
Tree-based ensemble methods (Random Forest, XGBoost, Gradient Boosting) consistently outperform individual models, demonstrating the power of combining multiple learners.

Decision Tree Characteristics:
Shows lowest MAE but highest RMSE, indicating good performance on typical cases but poor handling of outliers and extreme values.

Linear Model Performance:
Despite its simplicity, Linear Regression delivers respectable performance, suggesting underlying linear relationships in the data.

Prediction Reliability:
All models demonstrate tight confidence intervals, indicating stable and reliable prediction capabilities across the dataset.

Deployment Strategy

Application Architecture

Technology Stack:

Frontend: Streamlit for interactive web interface
Backend: Python with scikit-learn for model inference
Deployment: Docker containerization for scalable deployment
Version Control: Git with structured repository organization

Project Structure

Insurance-Cost-Prediction/
├── app.py                          # Streamlit application entry point
├── requirements.txt                # Python dependencies
├── README.md                       # Documentation
├── Dockerfile                      # Container configuration
├── .gitignore                      # Version control exclusions
├── tableau/
│   └── insurance_workbook.twb      # Tableau visualization
├── src/
│   ├── __init__.py
│   ├── config.py                   # Configuration management
│   ├── features.py                 # Feature engineering
│   ├── model_utils.py              # Model utilities
│   └── preprocessing.py            # Data preprocessing
├── notebooks/
│   └── Insurance_Analysis.ipynb    # Jupyter analysis notebook
└── models/
    └── trained_model.pkl           # Serialized model

Application Features

User Interface Components:

Input Forms: Intuitive data collection interfaces
Validation: Real-time input validation with error messaging
Visualization: Interactive charts showing risk factors and premium breakdowns
Responsive Design: Mobile-optimized interface for accessibility

Prediction Pipeline:

Data Collection: Streamlit widgets capture user inputs
Feature Engineering: Automatic BMI and health score calculations
Preprocessing: Data scaling using trained StandardScaler
Model Inference: Random Forest generates premium predictions
Results Presentation: Premium estimates with confidence intervals and risk analysis

Deployment Options

Local Development:

git clone https://github.com/mhdSharuk/Insurance-Cost-Prediction.git
cd Insurance-Cost-Prediction
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
streamlit run app.py

Docker Deployment:

docker build -t insurance-prediction .
docker run -p 8501:8501 insurance-prediction

Production Considerations:

Scalability: Container orchestration for high-traffic scenarios
Monitoring: Application performance and prediction accuracy tracking
Security: Input validation and data protection measures
Maintenance: Model retraining pipelines for performance maintenance

Project Repository

GitHub Repository: Insurance Cost Prediction

Repository Features:

Complete Codebase: Full implementation with documentation
Jupyter Notebooks: Detailed analysis and experimentation
Model Artifacts: Trained models and preprocessing pipelines
Deployment Scripts: Docker and local deployment configurations
Documentation: Comprehensive README and inline code documentation

Contents

1. Introduction

1.1 About Project

1.2 Objectives

1.3 Concepts used

1.4 Data Source

2. Exploratory Data Analysis (EDA)

Data Description

Reading Dataset

Dataset Shape

Column Datatypes

Numerical Data Description

Categorical Data Description

Null Value Counts

Number of outliers

Mean Premium by All Disease Interaction Combinations

Target Variable Distribution (Premium Price)

Height, Weight, BMI Distributions

BMI Category counts

Correlation Analysis

Pearson

Spearman

3. Hypothesis Testing

T-Test

Diabetes

Blood Pressure Problems

Any Transplants

Chronic Diseases

Known Allergies

Cancer

ANOVA

Number of Major Surgeries

Age Group

Health Score

Chi-Squared Contingency Test

Diabetes vs other diseases

Blood Pressure

Any Transplants

Chronic Diseases

Known Allergies

Cancer

Data Preprocessing before modelling

Handling the missing values

Ordinal Encoding

Standard Scaling

Model Development

Regression Problem Framework

Data Preprocessing Pipeline

Model Architecture

Performance Evaluation Process

Performance Analysis

Comprehensive Model Comparison

Model Performance Rankings

Key Performance Insights

Deployment Strategy

Application Architecture

Project Structure

Application Features

Deployment Options

Project Repository