DEV Community

Mohammed Sharuk
Mohammed Sharuk

Posted on

Insurance Cost Prediction

Contents

  1. Introduction
  2. Exploratory Data Analysis
  3. Hypothesis Testing
  4. Modelling
  5. Model comparison and performance
  6. Deployment
  7. GitHub Repository

1. Introduction

1.1 About Project

Insurance companies need to accurately predict the cost of health insurance for individuals to set premiums appropriately. However, traditional methods of cost prediction often rely on broad actuarial tables and historical averages, which may not account for the nuanced differences among individuals. By leveraging machine learning techniques, insurers can predict more accurately the insurance costs tailored to individual profiles, leading to more competitive pricing and better risk management.

1.2 Objectives

The primary need for this project arises from the challenges insurers face in pricing policies accurately while remaining competitive in the market. Inaccurate predictions can lead to losses for insurers and unfairly high premiums for policyholders. By implementing a machine learning model, insurers can:

  • Enhance Precision in Pricing: Use individual data points to determine premiums that reflect actual risk more closely than generic estimates.
  • Increase Competitiveness: Offer rates that are attractive to consumers while ensuring that the pricing is sustainable for the insurer.
  • Improve Customer Satisfaction: Fair and transparent pricing based on personal health data can increase trust and satisfaction among policyholders.
  • Enable Personalized Offerings: Create customized insurance packages based on predicted costs, which can cater more directly to the needs and preferences of individuals.
  • Risk Assessment: Insurers can use the model to refine their risk assessment processes, identifying key factors that influence costs most significantly.

1.3 Concepts used

  • Data Cleaning and pre-processing
  • EDA
  • Hypothesis testing (using scipy library for statistical test)
  • Predictions using regression models (using sklearn library)

1.4 Data Source

Dataset provided by Scaler

2. Exploratory Data Analysis (EDA)

This is the most important and time-consuming part of any data science project. As you have to use imagination to dig out the important information or insights while ensuring high interpretability. This helps us in understanding the data better and for other stakeholders too which further helps in making better business decisions. This also helps in understanding how target variable are behaving with different features. This will further help us in feature engineering.

Before moving ahead with EDA we will first, import some libraries like pandas, numpy, matplotlib etc. Then we will be reading the data. The data is in .csv format

Data Description

The dataset comprises the following 11 attributes:

  • Age: Numeric, ranging from 18 to 66 years.
  • Diabetes: Binary (0 or 1), where 1 indicates the presence of diabetes.
  • BloodPressureProblems: Binary (0 or 1), indicating the presence of blood pressure-related issues.
  • AnyTransplants: Binary (0 or 1), where 1 indicates the person has had a transplant.
  • AnyChronicDiseases: Binary (0 or 1), indicating the presence of any chronic diseases.
  • Height: Numeric, measured in centimeters, ranging from 145 cm to 188 cm.
  • Weight: Numeric, measured in kilograms, ranging from 51 kg to 132 kg.
  • KnownAllergies: Binary (0 or 1), where 1 indicates known allergies.
  • HistoryOfCancerInFamily: Binary (0 or 1), indicating a family history of cancer.
  • NumberOfMajorSurgeries: Numeric, counting the number of major surgeries, ranging from 0 to 3 surgeries.
  • PremiumPrice: Numeric, representing the premium price in currency, ranging from 15,000 to 40,000.

Reading Dataset

def read_data(url):
  df = pd.read_csv(url)
  df.rename(columns={
      'Age'                     : 'age',
      'Diabetes'                : 'diabetes',
      'BloodPressureProblems'   : 'blood_pressure_problems',
      'AnyTransplants'          : 'any_transplants',
      'AnyChronicDiseases'      : 'any_chronic_diseases',
      'Height'                  : 'height',
      'Weight'                  : 'weight',
      'KnownAllergies'          : 'known_allergies',
      'HistoryOfCancerInFamily' : 'history_of_cancer_in_family',
      'NumberOfMajorSurgeries'  : 'number_of_major_surgeries',
      'PremiumPrice'            : 'premium_price'
  }, inplace=True)

  return df

CSV_URL = 'https://drive.google.com/uc?id=1NBk1TFkK4NeKdodR2DxIdBp2Mk1mh4AS'
df = read_data(CSV_URL)

df.head()
Enter fullscreen mode Exit fullscreen mode

Dataset Shape

rows, cols = df.shape
print(f'Number of rows : {rows}')
print(f'Number of columns : {cols}')
Enter fullscreen mode Exit fullscreen mode

Column Datatypes

Numerical Data Description

Categorical Data Description

Null Value Counts

Number of outliers

Mean Premium by All Disease Interaction Combinations

Target Variable Distribution (Premium Price)

Data is almost normally distributed, but some values are left skewed.
There isnt much outliers available in the target variable

Height, Weight, BMI Distributions

BMI Category counts

A significant amount of data comes under overweight, then obese

Correlation Analysis

Pearson

Spearman

3. Hypothesis Testing

T-Test

This test is conducted across different binary variables (health conditions) with premium prices to see if there is any significant differences

Diabetes

  • H0: There is no difference in means between diabetic and non-diabetic groups
  • H1: There is a significant difference in means between diabetic and non-diabetic groups
  • Result: Age and number of major surgeries significantly differ between diabetic and non-diabetic patients. Physical measurements like BMI, weight, and height show no significant differences.

Blood Pressure Problems

  • H0: There is no difference in means between groups with and without blood pressure problems
  • H1: There is a significant difference in means between groups with and without blood pressure problems
  • Result: Age and number of major surgeries are significantly higher in people with blood pressure problems. Physical stats don't show meaningful differences between the two groups.

Any Transplants

  • H0: There is no difference in means between transplant and non-transplant groups
  • H1: There is a significant difference in means between transplant and non-transplant groups
  • Result: Only premium price differs significantly between groups, while age becomes non-significant. Physical measurements and surgery history show no significant differences.

Chronic Diseases

  • H0: There is no difference in means between groups with and without chronic diseases
  • H1: There is a significant difference in means between groups with and without chronic diseases
  • Result: Premium price is significantly higher for people with chronic diseases, but age doesn't differ significantly. Physical measurements remain unimportant across groups.

Known Allergies

  • H0: There is no difference in means between groups with and without known allergies
  • H1: There is a significant difference in means between groups with and without known allergies
  • Result: Only number of major surgeries differs significantly between allergy groups. Age, premium price, and physical measurements show no meaningful differences.

Cancer

  • H0: There is no difference in means between groups with and without family cancer history
  • H1: There is a significant difference in means between groups with and without family cancer history
  • Result: People with family cancer history have significantly higher premiums and more major surgeries. Age and physical measurements don't differ significantly between groups.

ANOVA

This test conducted with mean values across different categorical variables

Number of Major Surgeries

  • H0: Number of major surgeries has no effect on insurance premium prices
  • H1: Number of major surgeries significantly affects insurance premium prices
  • Result: Age strongly affects insurance prices, but BMI, weight, and height don't matter much. Insurance companies care more about how old you are than your basic body measurements.

Age Group

  • H0: Age group has no effect on insurance premium prices
  • H1: Age group significantly affects insurance premium prices
  • Result: Age is by far the biggest factor in determining insurance costs. Your physical stats like weight and height don't significantly impact pricing.

Health Score

  • H0: Health score has no effect on insurance premium prices
  • H1: Health score significantly affects insurance premium prices
  • Result: Age still matters for insurance pricing, but less so when health scores are involved. Physical measurements like BMI and weight consistently don't affect prices much.

Chi-Squared Contingency Test

Test is conducted between each binary value features

Diabetes vs other diseases

  • H0: There is no association between diabetes and other health conditions/factors
  • H1: There is a significant association between diabetes and other health conditions/factors
  • Result: Diabetes shows strong associations with blood pressure problems, chronic diseases, allergies, surgeries, age group, and health score. No significant link with transplants or family cancer history.

Blood Pressure

  • H0: There is no association between blood pressure problems and other health conditions/factors
  • H1: There is a significant association between blood pressure problems and other health conditions/factors
  • Result: Blood pressure problems are significantly linked to diabetes, number of surgeries, age group, and health score. No associations found with transplants, chronic diseases, allergies, or family cancer history.

Any Transplants

  • H0: There is no association between transplant history and other health conditions/factors
  • H1: There is a significant association between transplant history and other health conditions/factors
  • Result: Transplants show no significant associations with any other health conditions or factors. This suggests transplant patients are distributed randomly across other health categories.

Chronic Diseases

  • H0: There is no association between chronic diseases and other health conditions/factors
  • H1: There is a significant association between chronic diseases and other health conditions/factors
  • Result: Chronic diseases are significantly associated with diabetes, age group, and health score. No links found with blood pressure, transplants, allergies, or family cancer history.

Known Allergies

  • H0: There is no association between known allergies and other health conditions/factors
  • H1: There is a significant association between known allergies and other health conditions/factors
  • Result: Allergies show significant associations with diabetes, family cancer history, number of surgeries, and health score. No connections with blood pressure, transplants, chronic diseases, or age group.

Cancer

  • H0: There is no association between family cancer history and other health conditions/factors
  • H1: There is a significant association between family cancer history and other health conditions/factors
  • Result: Family cancer history is significantly linked to allergies, number of surgeries, and health score. No associations with diabetes, blood pressure, transplants, chronic diseases, or age group.

Data Preprocessing before modelling

Handling the missing values

The Random Forest Iterative Imputer is used to fill in missing values in a dataset. Unlike simple imputation methods (like mean, median, or mode), this method predicts missing values based on other features using a machine learning model, in this case a Random Forest model

Ordinal Encoding

Ordinal encoding is a way to convert categorical features into numerical values, but it’s used specifically for ordinal categories—categories with a clear, meaningful order.
This step is done for the features overall_risk_category and bmi_category

Standard Scaling

Standard Scaling (also called Standardization) is a technique to rescale numeric features so that they have, mean as standard deviation as 0 and 1 respectively
This helps many machine learning algorithms work better, especially those that are sensitive to the scale of features.
This step is done for all the numerical features

Model Development

Regression Problem Framework

Objective: Predict continuous insurance premium values using multiple regression approaches.

Evaluation Metrics:

  • RMSE (Root Mean Squared Error): Measures prediction accuracy magnitude
  • MAE (Mean Absolute Error): Assesses average absolute prediction error
  • R² Score: Quantifies variance explanation capability

Data Preprocessing Pipeline

Missing Value Treatment:

  • Method: Random Forest Iterative Imputer
  • Advantage: Predicts missing values using machine learning rather than simple statistical measures
  • Implementation: Leverages feature relationships for accurate imputation

Feature Encoding:

  • Ordinal Encoding: Applied to hierarchical categories (risk levels, BMI categories)
  • Rationale: Preserves natural ordering in categorical variables

Feature Scaling:

  • Method: Standard Scaling (Standardization)
  • Process: Transforms features to mean=0, standard deviation=1
  • Benefit: Ensures algorithm performance optimization across different feature scales

Model Architecture

Algorithm Selection:
Five distinct regression approaches were implemented to ensure comprehensive performance comparison:

  1. Linear Regression: Baseline linear relationship model
  2. Decision Tree Regressor: Non-linear, interpretable tree-based approach
  3. Random Forest Regressor: Ensemble method combining multiple decision trees
  4. Gradient Boosting Regressor: Sequential boosting with error correction
  5. XGBoost Regressor: Optimized gradient boosting with advanced regularization

Cross-Validation Strategy:

  • Method: 5-fold cross-validation with shuffling
  • Purpose: Ensures robust performance estimation and reduces overfitting risk
  • Reproducibility: Fixed random seed for consistent results

Performance Evaluation Process

Model Training:

  1. Feature-target separation
  2. Cross-validation implementation
  3. Hyperparameter optimization
  4. Full dataset retraining
  5. Performance metric calculation

Statistical Validation:

  • Confidence Intervals: 95% confidence intervals for prediction reliability
  • Residual Analysis: Error pattern examination
  • Bias-Variance Assessment: Model stability evaluation

Performance Analysis

Comprehensive Model Comparison

Model RMSE MAE CI_low CI_high
Linear Regression 3542.13 2419.16 0.678 23987.81 24635.28
Decision Tree 3889.32 1147.06 0.612 24033.06 24808.72
Random Forest 2858.16 1249.14 0.791 24053.53 24743.77
Gradient Boosting 3109.07 1724.89 0.752 24027.90 24719.85
XGBoost 3039.57 1509.54 0.763 24026.34 24730.57

Model Performance Rankings

RMSE Performance (Lower is Better):

  1. Random Forest (2858.16) - Superior accuracy
  2. XGBoost (3039.57) - Strong performance
  3. Gradient Boosting (3109.07) - Competitive results
  4. Linear Regression (3542.13) - Baseline performance
  5. Decision Tree (3889.32) - Highest error rate

R² Score Performance (Higher is Better):

  1. Random Forest (0.791) - Explains 79.1% of variance
  2. XGBoost (0.763) - Explains 76.3% of variance
  3. Gradient Boosting (0.752) - Explains 75.2% of variance
  4. Linear Regression (0.678) - Explains 67.8% of variance
  5. Decision Tree (0.612) - Explains 61.2% of variance

Key Performance Insights

Champion Model: Random Forest

  • Accuracy: Achieves lowest RMSE and highest R² score
  • Stability: Balanced performance across all metrics
  • Generalization: Strong cross-validation performance indicates robust generalization
  • Reliability: Tight confidence intervals suggest consistent predictions

Algorithm Analysis:

Ensemble Method Superiority:
Tree-based ensemble methods (Random Forest, XGBoost, Gradient Boosting) consistently outperform individual models, demonstrating the power of combining multiple learners.

Decision Tree Characteristics:
Shows lowest MAE but highest RMSE, indicating good performance on typical cases but poor handling of outliers and extreme values.

Linear Model Performance:
Despite its simplicity, Linear Regression delivers respectable performance, suggesting underlying linear relationships in the data.

Prediction Reliability:
All models demonstrate tight confidence intervals, indicating stable and reliable prediction capabilities across the dataset.


Deployment Strategy

Application Architecture

Technology Stack:

  • Frontend: Streamlit for interactive web interface
  • Backend: Python with scikit-learn for model inference
  • Deployment: Docker containerization for scalable deployment
  • Version Control: Git with structured repository organization

Project Structure

Insurance-Cost-Prediction/
├── app.py                          # Streamlit application entry point
├── requirements.txt                # Python dependencies
├── README.md                       # Documentation
├── Dockerfile                      # Container configuration
├── .gitignore                      # Version control exclusions
├── tableau/
│   └── insurance_workbook.twb      # Tableau visualization
├── src/
│   ├── __init__.py
│   ├── config.py                   # Configuration management
│   ├── features.py                 # Feature engineering
│   ├── model_utils.py              # Model utilities
│   └── preprocessing.py            # Data preprocessing
├── notebooks/
│   └── Insurance_Analysis.ipynb    # Jupyter analysis notebook
└── models/
    └── trained_model.pkl           # Serialized model
Enter fullscreen mode Exit fullscreen mode

Application Features

User Interface Components:

  • Input Forms: Intuitive data collection interfaces
  • Validation: Real-time input validation with error messaging
  • Visualization: Interactive charts showing risk factors and premium breakdowns
  • Responsive Design: Mobile-optimized interface for accessibility

Prediction Pipeline:

  1. Data Collection: Streamlit widgets capture user inputs
  2. Feature Engineering: Automatic BMI and health score calculations
  3. Preprocessing: Data scaling using trained StandardScaler
  4. Model Inference: Random Forest generates premium predictions
  5. Results Presentation: Premium estimates with confidence intervals and risk analysis

Deployment Options

Local Development:

git clone https://github.com/mhdSharuk/Insurance-Cost-Prediction.git
cd Insurance-Cost-Prediction
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
Enter fullscreen mode Exit fullscreen mode

Docker Deployment:

docker build -t insurance-prediction .
docker run -p 8501:8501 insurance-prediction
Enter fullscreen mode Exit fullscreen mode

Production Considerations:

  • Scalability: Container orchestration for high-traffic scenarios
  • Monitoring: Application performance and prediction accuracy tracking
  • Security: Input validation and data protection measures
  • Maintenance: Model retraining pipelines for performance maintenance

Project Repository

GitHub Repository: Insurance Cost Prediction

Repository Features:

  • Complete Codebase: Full implementation with documentation
  • Jupyter Notebooks: Detailed analysis and experimentation
  • Model Artifacts: Trained models and preprocessing pipelines
  • Deployment Scripts: Docker and local deployment configurations
  • Documentation: Comprehensive README and inline code documentation

Top comments (0)