Contents
- Introduction
- Exploratory Data Analysis
- Hypothesis Testing
- Modelling
- Model comparison and performance
- Deployment
- GitHub Repository
1. Introduction
1.1 About Project
Insurance companies need to accurately predict the cost of health insurance for individuals to set premiums appropriately. However, traditional methods of cost prediction often rely on broad actuarial tables and historical averages, which may not account for the nuanced differences among individuals. By leveraging machine learning techniques, insurers can predict more accurately the insurance costs tailored to individual profiles, leading to more competitive pricing and better risk management.
1.2 Objectives
The primary need for this project arises from the challenges insurers face in pricing policies accurately while remaining competitive in the market. Inaccurate predictions can lead to losses for insurers and unfairly high premiums for policyholders. By implementing a machine learning model, insurers can:
- Enhance Precision in Pricing: Use individual data points to determine premiums that reflect actual risk more closely than generic estimates.
- Increase Competitiveness: Offer rates that are attractive to consumers while ensuring that the pricing is sustainable for the insurer.
- Improve Customer Satisfaction: Fair and transparent pricing based on personal health data can increase trust and satisfaction among policyholders.
- Enable Personalized Offerings: Create customized insurance packages based on predicted costs, which can cater more directly to the needs and preferences of individuals.
- Risk Assessment: Insurers can use the model to refine their risk assessment processes, identifying key factors that influence costs most significantly.
1.3 Concepts used
- Data Cleaning and pre-processing
- EDA
- Hypothesis testing (using scipy library for statistical test)
- Predictions using regression models (using sklearn library)
1.4 Data Source
Dataset provided by Scaler
2. Exploratory Data Analysis (EDA)
This is the most important and time-consuming part of any data science project. As you have to use imagination to dig out the important information or insights while ensuring high interpretability. This helps us in understanding the data better and for other stakeholders too which further helps in making better business decisions. This also helps in understanding how target variable are behaving with different features. This will further help us in feature engineering.
Before moving ahead with EDA we will first, import some libraries like pandas, numpy, matplotlib etc. Then we will be reading the data. The data is in .csv format
Data Description
The dataset comprises the following 11 attributes:
- Age: Numeric, ranging from 18 to 66 years.
- Diabetes: Binary (0 or 1), where 1 indicates the presence of diabetes.
- BloodPressureProblems: Binary (0 or 1), indicating the presence of blood pressure-related issues.
- AnyTransplants: Binary (0 or 1), where 1 indicates the person has had a transplant.
- AnyChronicDiseases: Binary (0 or 1), indicating the presence of any chronic diseases.
- Height: Numeric, measured in centimeters, ranging from 145 cm to 188 cm.
- Weight: Numeric, measured in kilograms, ranging from 51 kg to 132 kg.
- KnownAllergies: Binary (0 or 1), where 1 indicates known allergies.
- HistoryOfCancerInFamily: Binary (0 or 1), indicating a family history of cancer.
- NumberOfMajorSurgeries: Numeric, counting the number of major surgeries, ranging from 0 to 3 surgeries.
- PremiumPrice: Numeric, representing the premium price in currency, ranging from 15,000 to 40,000.
Reading Dataset
def read_data(url):
df = pd.read_csv(url)
df.rename(columns={
'Age' : 'age',
'Diabetes' : 'diabetes',
'BloodPressureProblems' : 'blood_pressure_problems',
'AnyTransplants' : 'any_transplants',
'AnyChronicDiseases' : 'any_chronic_diseases',
'Height' : 'height',
'Weight' : 'weight',
'KnownAllergies' : 'known_allergies',
'HistoryOfCancerInFamily' : 'history_of_cancer_in_family',
'NumberOfMajorSurgeries' : 'number_of_major_surgeries',
'PremiumPrice' : 'premium_price'
}, inplace=True)
return df
CSV_URL = 'https://drive.google.com/uc?id=1NBk1TFkK4NeKdodR2DxIdBp2Mk1mh4AS'
df = read_data(CSV_URL)
df.head()
Dataset Shape
rows, cols = df.shape
print(f'Number of rows : {rows}')
print(f'Number of columns : {cols}')
Column Datatypes
Numerical Data Description
Categorical Data Description
Null Value Counts
Number of outliers
Mean Premium by All Disease Interaction Combinations
Target Variable Distribution (Premium Price)
Data is almost normally distributed, but some values are left skewed.
There isnt much outliers available in the target variable

Height, Weight, BMI Distributions
BMI Category counts
A significant amount of data comes under overweight, then obese

Correlation Analysis
Pearson
Spearman
3. Hypothesis Testing
T-Test
This test is conducted across different binary variables (health conditions) with premium prices to see if there is any significant differences
Diabetes
- H0: There is no difference in means between diabetic and non-diabetic groups
- H1: There is a significant difference in means between diabetic and non-diabetic groups
-
Result: Age and number of major surgeries significantly differ between diabetic and non-diabetic patients. Physical measurements like BMI, weight, and height show no significant differences.
Blood Pressure Problems
- H0: There is no difference in means between groups with and without blood pressure problems
- H1: There is a significant difference in means between groups with and without blood pressure problems
- Result: Age and number of major surgeries are significantly higher in people with blood pressure problems. Physical stats don't show meaningful differences between the two groups.
Any Transplants
- H0: There is no difference in means between transplant and non-transplant groups
- H1: There is a significant difference in means between transplant and non-transplant groups
- Result: Only premium price differs significantly between groups, while age becomes non-significant. Physical measurements and surgery history show no significant differences.
Chronic Diseases
- H0: There is no difference in means between groups with and without chronic diseases
- H1: There is a significant difference in means between groups with and without chronic diseases
- Result: Premium price is significantly higher for people with chronic diseases, but age doesn't differ significantly. Physical measurements remain unimportant across groups.
Known Allergies
- H0: There is no difference in means between groups with and without known allergies
- H1: There is a significant difference in means between groups with and without known allergies
- Result: Only number of major surgeries differs significantly between allergy groups. Age, premium price, and physical measurements show no meaningful differences.
Cancer
- H0: There is no difference in means between groups with and without family cancer history
- H1: There is a significant difference in means between groups with and without family cancer history
- Result: People with family cancer history have significantly higher premiums and more major surgeries. Age and physical measurements don't differ significantly between groups.
ANOVA
This test conducted with mean values across different categorical variables
Number of Major Surgeries
- H0: Number of major surgeries has no effect on insurance premium prices
- H1: Number of major surgeries significantly affects insurance premium prices
-
Result: Age strongly affects insurance prices, but BMI, weight, and height don't matter much. Insurance companies care more about how old you are than your basic body measurements.
Age Group
- H0: Age group has no effect on insurance premium prices
- H1: Age group significantly affects insurance premium prices
-
Result: Age is by far the biggest factor in determining insurance costs. Your physical stats like weight and height don't significantly impact pricing.
Health Score
- H0: Health score has no effect on insurance premium prices
- H1: Health score significantly affects insurance premium prices
-
Result: Age still matters for insurance pricing, but less so when health scores are involved. Physical measurements like BMI and weight consistently don't affect prices much.
Chi-Squared Contingency Test
Test is conducted between each binary value features
Diabetes vs other diseases
- H0: There is no association between diabetes and other health conditions/factors
- H1: There is a significant association between diabetes and other health conditions/factors
-
Result: Diabetes shows strong associations with blood pressure problems, chronic diseases, allergies, surgeries, age group, and health score. No significant link with transplants or family cancer history.
Blood Pressure
- H0: There is no association between blood pressure problems and other health conditions/factors
- H1: There is a significant association between blood pressure problems and other health conditions/factors
-
Result: Blood pressure problems are significantly linked to diabetes, number of surgeries, age group, and health score. No associations found with transplants, chronic diseases, allergies, or family cancer history.
Any Transplants
- H0: There is no association between transplant history and other health conditions/factors
- H1: There is a significant association between transplant history and other health conditions/factors
-
Result: Transplants show no significant associations with any other health conditions or factors. This suggests transplant patients are distributed randomly across other health categories.
Chronic Diseases
- H0: There is no association between chronic diseases and other health conditions/factors
- H1: There is a significant association between chronic diseases and other health conditions/factors
-
Result: Chronic diseases are significantly associated with diabetes, age group, and health score. No links found with blood pressure, transplants, allergies, or family cancer history.
Known Allergies
- H0: There is no association between known allergies and other health conditions/factors
- H1: There is a significant association between known allergies and other health conditions/factors
-
Result: Allergies show significant associations with diabetes, family cancer history, number of surgeries, and health score. No connections with blood pressure, transplants, chronic diseases, or age group.
Cancer
- H0: There is no association between family cancer history and other health conditions/factors
- H1: There is a significant association between family cancer history and other health conditions/factors
-
Result: Family cancer history is significantly linked to allergies, number of surgeries, and health score. No associations with diabetes, blood pressure, transplants, chronic diseases, or age group.
Data Preprocessing before modelling
Handling the missing values
The Random Forest Iterative Imputer is used to fill in missing values in a dataset. Unlike simple imputation methods (like mean, median, or mode), this method predicts missing values based on other features using a machine learning model, in this case a Random Forest model
Ordinal Encoding
Ordinal encoding is a way to convert categorical features into numerical values, but it’s used specifically for ordinal categories—categories with a clear, meaningful order.
This step is done for the features overall_risk_category and bmi_category
Standard Scaling
Standard Scaling (also called Standardization) is a technique to rescale numeric features so that they have, mean as standard deviation as 0 and 1 respectively
This helps many machine learning algorithms work better, especially those that are sensitive to the scale of features.
This step is done for all the numerical features
Model Development
Regression Problem Framework
Objective: Predict continuous insurance premium values using multiple regression approaches.
Evaluation Metrics:
- RMSE (Root Mean Squared Error): Measures prediction accuracy magnitude
- MAE (Mean Absolute Error): Assesses average absolute prediction error
- R² Score: Quantifies variance explanation capability
Data Preprocessing Pipeline
Missing Value Treatment:
- Method: Random Forest Iterative Imputer
- Advantage: Predicts missing values using machine learning rather than simple statistical measures
- Implementation: Leverages feature relationships for accurate imputation
Feature Encoding:
- Ordinal Encoding: Applied to hierarchical categories (risk levels, BMI categories)
- Rationale: Preserves natural ordering in categorical variables
Feature Scaling:
- Method: Standard Scaling (Standardization)
- Process: Transforms features to mean=0, standard deviation=1
- Benefit: Ensures algorithm performance optimization across different feature scales
Model Architecture
Algorithm Selection:
Five distinct regression approaches were implemented to ensure comprehensive performance comparison:
- Linear Regression: Baseline linear relationship model
- Decision Tree Regressor: Non-linear, interpretable tree-based approach
- Random Forest Regressor: Ensemble method combining multiple decision trees
- Gradient Boosting Regressor: Sequential boosting with error correction
- XGBoost Regressor: Optimized gradient boosting with advanced regularization
Cross-Validation Strategy:
- Method: 5-fold cross-validation with shuffling
- Purpose: Ensures robust performance estimation and reduces overfitting risk
- Reproducibility: Fixed random seed for consistent results
Performance Evaluation Process
Model Training:
- Feature-target separation
- Cross-validation implementation
- Hyperparameter optimization
- Full dataset retraining
- Performance metric calculation
Statistical Validation:
- Confidence Intervals: 95% confidence intervals for prediction reliability
- Residual Analysis: Error pattern examination
- Bias-Variance Assessment: Model stability evaluation
Performance Analysis
Comprehensive Model Comparison
| Model | RMSE | MAE | R² | CI_low | CI_high |
|---|---|---|---|---|---|
| Linear Regression | 3542.13 | 2419.16 | 0.678 | 23987.81 | 24635.28 |
| Decision Tree | 3889.32 | 1147.06 | 0.612 | 24033.06 | 24808.72 |
| Random Forest | 2858.16 | 1249.14 | 0.791 | 24053.53 | 24743.77 |
| Gradient Boosting | 3109.07 | 1724.89 | 0.752 | 24027.90 | 24719.85 |
| XGBoost | 3039.57 | 1509.54 | 0.763 | 24026.34 | 24730.57 |
Model Performance Rankings
RMSE Performance (Lower is Better):
- Random Forest (2858.16) - Superior accuracy
- XGBoost (3039.57) - Strong performance
- Gradient Boosting (3109.07) - Competitive results
- Linear Regression (3542.13) - Baseline performance
- Decision Tree (3889.32) - Highest error rate
R² Score Performance (Higher is Better):
- Random Forest (0.791) - Explains 79.1% of variance
- XGBoost (0.763) - Explains 76.3% of variance
- Gradient Boosting (0.752) - Explains 75.2% of variance
- Linear Regression (0.678) - Explains 67.8% of variance
- Decision Tree (0.612) - Explains 61.2% of variance
Key Performance Insights
Champion Model: Random Forest
- Accuracy: Achieves lowest RMSE and highest R² score
- Stability: Balanced performance across all metrics
- Generalization: Strong cross-validation performance indicates robust generalization
- Reliability: Tight confidence intervals suggest consistent predictions
Algorithm Analysis:
Ensemble Method Superiority:
Tree-based ensemble methods (Random Forest, XGBoost, Gradient Boosting) consistently outperform individual models, demonstrating the power of combining multiple learners.
Decision Tree Characteristics:
Shows lowest MAE but highest RMSE, indicating good performance on typical cases but poor handling of outliers and extreme values.
Linear Model Performance:
Despite its simplicity, Linear Regression delivers respectable performance, suggesting underlying linear relationships in the data.
Prediction Reliability:
All models demonstrate tight confidence intervals, indicating stable and reliable prediction capabilities across the dataset.
Deployment Strategy
Application Architecture
Technology Stack:
- Frontend: Streamlit for interactive web interface
- Backend: Python with scikit-learn for model inference
- Deployment: Docker containerization for scalable deployment
- Version Control: Git with structured repository organization
Project Structure
Insurance-Cost-Prediction/
├── app.py # Streamlit application entry point
├── requirements.txt # Python dependencies
├── README.md # Documentation
├── Dockerfile # Container configuration
├── .gitignore # Version control exclusions
├── tableau/
│ └── insurance_workbook.twb # Tableau visualization
├── src/
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── features.py # Feature engineering
│ ├── model_utils.py # Model utilities
│ └── preprocessing.py # Data preprocessing
├── notebooks/
│ └── Insurance_Analysis.ipynb # Jupyter analysis notebook
└── models/
└── trained_model.pkl # Serialized model
Application Features
User Interface Components:
- Input Forms: Intuitive data collection interfaces
- Validation: Real-time input validation with error messaging
- Visualization: Interactive charts showing risk factors and premium breakdowns
- Responsive Design: Mobile-optimized interface for accessibility
Prediction Pipeline:
- Data Collection: Streamlit widgets capture user inputs
- Feature Engineering: Automatic BMI and health score calculations
- Preprocessing: Data scaling using trained StandardScaler
- Model Inference: Random Forest generates premium predictions
- Results Presentation: Premium estimates with confidence intervals and risk analysis
Deployment Options
Local Development:
git clone https://github.com/mhdSharuk/Insurance-Cost-Prediction.git
cd Insurance-Cost-Prediction
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
Docker Deployment:
docker build -t insurance-prediction .
docker run -p 8501:8501 insurance-prediction
Production Considerations:
- Scalability: Container orchestration for high-traffic scenarios
- Monitoring: Application performance and prediction accuracy tracking
- Security: Input validation and data protection measures
- Maintenance: Model retraining pipelines for performance maintenance
Project Repository
GitHub Repository: Insurance Cost Prediction
Repository Features:
- Complete Codebase: Full implementation with documentation
- Jupyter Notebooks: Detailed analysis and experimentation
- Model Artifacts: Trained models and preprocessing pipelines
- Deployment Scripts: Docker and local deployment configurations
- Documentation: Comprehensive README and inline code documentation

















Top comments (0)