<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: jenish</title>
    <description>The latest articles on DEV Community by jenish (@jenish_d1a020f6fc2cc4e90d).</description>
    <link>https://dev.to/jenish_d1a020f6fc2cc4e90d</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3593975%2F1dbb99c4-be3e-4468-89e8-4d9479e3a03a.jpg</url>
      <title>DEV Community: jenish</title>
      <link>https://dev.to/jenish_d1a020f6fc2cc4e90d</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jenish_d1a020f6fc2cc4e90d"/>
    <language>en</language>
    <item>
      <title>Insurance Cost Prediction</title>
      <dc:creator>jenish</dc:creator>
      <pubDate>Mon, 03 Nov 2025 12:51:34 +0000</pubDate>
      <link>https://dev.to/jenish_d1a020f6fc2cc4e90d/insurance-cost-prediction-1lbj</link>
      <guid>https://dev.to/jenish_d1a020f6fc2cc4e90d/insurance-cost-prediction-1lbj</guid>
      <description>&lt;h1&gt;
  
  
  Contents
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Exploratory Data Analysis&lt;/li&gt;
&lt;li&gt;Hypothesis Testing&lt;/li&gt;
&lt;li&gt;Modelling&lt;/li&gt;
&lt;li&gt;Model comparison and performance&lt;/li&gt;
&lt;li&gt;Deployment&lt;/li&gt;
&lt;li&gt;GitHub Repository&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;1. Introduction&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;1.1 About Project&lt;/strong&gt;&lt;br&gt;
Insurance companies need to accurately predict the cost of health insurance for individuals to set premiums appropriately. However, traditional methods of cost prediction often rely on broad actuarial tables and historical averages, which may not account for the nuanced differences among individuals. By leveraging machine learning techniques, insurers can predict more accurately the insurance costs tailored to individual profiles, leading to more competitive pricing and better risk management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.2 Objectives&lt;/strong&gt;&lt;br&gt;
The primary need for this project arises from the challenges insurers face in pricing policies accurately while remaining competitive in the market. Inaccurate predictions can lead to losses for insurers and unfairly high premiums for policyholders. By implementing a machine learning model, insurers can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Enhance Precision in Pricing:&lt;/strong&gt; Use individual data points to determine premiums that reflect actual risk more closely than generic estimates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increase Competitiveness:&lt;/strong&gt; Offer rates that are attractive to consumers while ensuring that the pricing is sustainable for the insurer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improve Customer Satisfaction:&lt;/strong&gt; Fair and transparent pricing based on personal health data can increase trust and satisfaction among policyholders.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable Personalized Offerings:&lt;/strong&gt; Create customized insurance packages based on predicted costs, which can cater more directly to the needs and preferences of individuals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk Assessment:&lt;/strong&gt; Insurers can use the model to refine their risk assessment processes, identifying key factors that influence costs most significantly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;1.3 Concepts used&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data Cleaning and pre-processing&lt;/li&gt;
&lt;li&gt;EDA&lt;/li&gt;
&lt;li&gt;Hypothesis testing (using scipy library for statistical test)&lt;/li&gt;
&lt;li&gt;Predictions using regression models (using sklearn library)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;1.4 Data Source&lt;/strong&gt;&lt;br&gt;
Dataset provided by Scaler&lt;/p&gt;
&lt;h1&gt;
  
  
  &lt;strong&gt;2. Exploratory Data Analysis (EDA)&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;This is the most important and time-consuming part of any data science project. As you have to use imagination to dig out the important information or insights while ensuring high interpretability. This helps us in understanding the data better and for other stakeholders too which further helps in making better business decisions. This also helps in understanding how target variable are behaving with different features. This will further help us in feature engineering.&lt;/p&gt;

&lt;p&gt;Before moving ahead with EDA we will first, import some libraries like pandas, numpy, matplotlib etc. Then we will be reading the data. The data is in .csv format&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Description&lt;/strong&gt;&lt;br&gt;
The dataset comprises the following 11 attributes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Age:&lt;/strong&gt; Numeric, ranging from 18 to 66 years.&lt;/li&gt;
&lt;li&gt;Diabetes: Binary (0 or 1), where 1 indicates the presence of diabetes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BloodPressureProblems:&lt;/strong&gt; Binary (0 or 1), indicating the presence of blood pressure-related issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AnyTransplants:&lt;/strong&gt; Binary (0 or 1), where 1 indicates the person has had a transplant.&lt;/li&gt;
&lt;li&gt;**AnyChronicDiseases: **Binary (0 or 1), indicating the presence of any chronic diseases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Height:&lt;/strong&gt; Numeric, measured in centimeters, ranging from 145 cm to 188 cm.&lt;/li&gt;
&lt;li&gt;**Weight: **Numeric, measured in kilograms, ranging from 51 kg to 132 kg.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KnownAllergies:&lt;/strong&gt; Binary (0 or 1), where 1 indicates known allergies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HistoryOfCancerInFamily:&lt;/strong&gt; Binary (0 or 1), indicating a family history of cancer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NumberOfMajorSurgeries:&lt;/strong&gt; Numeric, counting the number of major surgeries, ranging from 0 to 3 surgeries.&lt;/li&gt;
&lt;li&gt;**PremiumPrice: **Numeric, representing the premium price in currency, ranging from 15,000 to 40,000.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Reading Dataset&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def read_data(url):
  df = pd.read_csv(url)
  df.rename(columns={
      'Age'                     : 'age',
      'Diabetes'                : 'diabetes',
      'BloodPressureProblems'   : 'blood_pressure_problems',
      'AnyTransplants'          : 'any_transplants',
      'AnyChronicDiseases'      : 'any_chronic_diseases',
      'Height'                  : 'height',
      'Weight'                  : 'weight',
      'KnownAllergies'          : 'known_allergies',
      'HistoryOfCancerInFamily' : 'history_of_cancer_in_family',
      'NumberOfMajorSurgeries'  : 'number_of_major_surgeries',
      'PremiumPrice'            : 'premium_price'
  }, inplace=True)

  return df

CSV_URL = 'https://drive.google.com/uc?id=1NBk1TFkK4NeKdodR2DxIdBp2Mk1mh4AS'
df = read_data(CSV_URL)

df.head()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0s4nn6r6w46me5tw8aw0.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0s4nn6r6w46me5tw8aw0.webp" alt=" " width="800" height="121"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dataset Shape&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rows, cols = df.shape
print(f'Number of rows : {rows}')
print(f'Number of columns : {cols}')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgousjic9g3vdk9l6giqv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgousjic9g3vdk9l6giqv.png" alt=" " width="170" height="45"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Column Datatypes&lt;/strong&gt; &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftts6eow96x89dfzta787.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftts6eow96x89dfzta787.png" alt=" " width="242" height="647"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Numerical Data Description&lt;/strong&gt; &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70gsvhbh9c816kor9uz8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70gsvhbh9c816kor9uz8.png" alt=" " width="800" height="124"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Categorical Data Description&lt;/strong&gt; &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8c8nlsxa7k66bp8umrec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8c8nlsxa7k66bp8umrec.png" alt=" " width="800" height="122"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Null Value Counts&lt;/strong&gt; &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwmqqfldzmyxw63gpa9bq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwmqqfldzmyxw63gpa9bq.png" alt=" " width="202" height="647"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Number of outliers&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbk7n32a22v8560j23xa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbk7n32a22v8560j23xa.png" alt=" " width="296" height="323"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mean Premium by All Disease Interaction Combinations&lt;/strong&gt; &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk7zcy6lgrb09taw9odsp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk7zcy6lgrb09taw9odsp.png" alt=" " width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Target Variable Distribution (Premium Price)&lt;/strong&gt;&lt;br&gt;
Data is almost normally distributed, but some values are left skewed.&lt;br&gt;
There isn't much outliers available in the target variable&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxuqpwvkbtwe3gtwu87dz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxuqpwvkbtwe3gtwu87dz.png" alt=" " width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Height, Weight, BMI Distributions&lt;/strong&gt; &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frhogdlek6yv4utqd668x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frhogdlek6yv4utqd668x.png" alt=" " width="800" height="281"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhszktkj07ysua2tdprj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhszktkj07ysua2tdprj.png" alt=" " width="800" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lycugdkpczc7jzg19jg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lycugdkpczc7jzg19jg.png" alt=" " width="800" height="276"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BMI Category counts&lt;/strong&gt;&lt;br&gt;
A significant amount of data comes under overweight, then obese&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F06fhub92yy0pp9thte58.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F06fhub92yy0pp9thte58.png" alt=" " width="800" height="298"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Correlation Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pearson&lt;/strong&gt; &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe86n3jsccqvukorezulc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe86n3jsccqvukorezulc.png" alt=" " width="800" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spearman&lt;/strong&gt; &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqegj4m4lob5pqp1e8ue.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqegj4m4lob5pqp1e8ue.png" alt=" " width="800" height="342"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  3. Hypothesis Testing
&lt;/h1&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;T-Test&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This test is conducted across different binary variables (health conditions) with premium prices to see if there is any significant differences&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diabetes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; There is no difference in means between diabetic and non-diabetic groups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; There is a significant difference in means between diabetic and non-diabetic groups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Age and number of major surgeries significantly differ between diabetic and non-diabetic patients. Physical measurements like BMI, weight, and height show no significant differences. 
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fewt5xilmmlw3taxvmgj3.png" alt=" " width="443" height="186"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Blood Pressure Problems&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; There is no difference in means between groups with and without blood pressure problems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; There is a significant difference in means between groups with and without blood pressure problems&lt;/li&gt;
&lt;li&gt;**Result: **Age and number of major surgeries are significantly higher in people with blood pressure problems. Physical stats don't show meaningful differences between the two groups.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1uhihs2eohrk2mwswr6j.png" alt=" " width="465" height="201"&gt; &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Any Transplants&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;**H0: **There is no difference in means between transplant and non-transplant groups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; There is a significant difference in means between transplant and non-transplant groups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Only premium price differs significantly between groups, while age becomes non-significant. Physical measurements and surgery history show no significant differences.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ak65bnzcv8z45lc4sza.png" alt=" " width="461" height="200"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Chronic Diseases&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; There is no difference in means between groups with and without chronic diseases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; There is a significant difference in means between groups with and without chronic diseases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Premium price is significantly higher for people with chronic diseases, but age doesn't differ significantly. Physical measurements remain unimportant across groups. 
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02piwmdbp844hqv6g7an.png" alt=" " width="467" height="210"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Known Allergies&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; There is no difference in means between groups with and without known allergies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; There is a significant difference in means between groups with and without known allergies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Only number of major surgeries differs significantly between allergy groups. Age, premium price, and physical measurements show no meaningful differences.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbh4b44a8k0fcxupdgcmy.png" alt=" " width="431" height="203"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cancer&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; There is no difference in means between groups with and without family cancer history&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; There is a significant difference in means between groups with and without family cancer history&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; People with family cancer history have significantly higher premiums and more major surgeries. Age and physical measurements don't differ significantly between groups.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffczrz3gioa7x8vftpio4.png" alt=" " width="448" height="186"&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;ANOVA&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This test conducted with mean values across different categorical variables&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Number of Major Surgeries&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; Number of major surgeries has no effect on insurance premium prices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; Number of major surgeries significantly affects insurance premium prices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Age strongly affects insurance prices, but BMI, weight, and height don't matter much. Insurance companies care more about how old you are than your basic body measurements. 
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16btbnfe622jbzbboaq8.png" alt=" " width="388" height="167"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Age Group&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; Age group has no effect on insurance premium prices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; Age group significantly affects insurance premium prices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Age is by far the biggest factor in determining insurance costs. Your physical stats like weight and height don't significantly impact pricing. 
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsnhrsry2fxzii0c2iaf1.png" alt=" " width="372" height="178"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Health Score&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; Health score has no effect on insurance premium prices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; Health score significantly affects insurance premium prices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Age still matters for insurance pricing, but less so when health scores are involved. Physical measurements like BMI and weight consistently don't affect prices much.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb82n742moex3b34uw70r.png" alt=" " width="381" height="182"&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Chi-Squared Contingency Test&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Test is conducted between each binary value features&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diabetes vs other diseases&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; There is no association between diabetes and other health conditions/factors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; There is a significant association between diabetes and other health conditions/factors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Diabetes shows strong associations with blood pressure problems, chronic diseases, allergies, surgeries, age group, and health score. No significant link with transplants or family cancer history. 
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fced12ib3btv63ldfwjto.png" alt=" " width="515" height="252"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Blood Pressure&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; There is no association between blood pressure problems and other health conditions/factors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; There is a significant association between blood pressure problems and other health conditions/factors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Blood pressure problems are significantly linked to diabetes, number of surgeries, age group, and health score. No associations found with transplants, chronic diseases, allergies, or family cancer history. 
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmkpt28h1aaws57wbt79h.png" alt=" " width="581" height="245"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Any Transplants&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; There is no association between transplant history and other health conditions/factors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; There is a significant association between transplant history and other health conditions/factors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Transplants show no significant associations with any other health conditions or factors. This suggests transplant patients are distributed randomly across other health categories.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbeaub23a43rckutt3agj.png" alt=" " width="583" height="242"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Chronic Diseases&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; There is no association between chronic diseases and other health conditions/factors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; There is a significant association between chronic diseases and other health conditions/factors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Chronic diseases are significantly associated with diabetes, age group, and health score. No links found with blood pressure, transplants, allergies, or family cancer history. 
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vwbjycqxqv20i5m4t2d.png" alt=" " width="581" height="242"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Known Allergies&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; There is no association between known allergies and other health conditions/factors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; There is a significant association between known allergies and other health conditions/factors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Allergies show significant associations with diabetes, family cancer history, number of surgeries, and health score. No connections with blood pressure, transplants, chronic diseases, or age group. 
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyeqojrlr8goknac3r229.png" alt=" " width="581" height="250"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cancer&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;H0:&lt;/strong&gt; There is no association between family cancer history and other health conditions/factors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;H1:&lt;/strong&gt; There is a significant association between family cancer history and other health conditions/factors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Family cancer history is significantly linked to allergies, number of surgeries, and health score. No associations with diabetes, blood pressure, transplants, chronic diseases, or age group. 
Data Preprocessing before modelling
Handling the missing values
The Random Forest Iterative Imputer is used to fill in missing values in a dataset. Unlike simple imputation methods (like mean, median, or mode), this method predicts missing values based on other features using a machine learning model, in this case a Random Forest model
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flervue2q06nr7furze0g.png" alt=" " width="611" height="235"&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;
  
  
  &lt;strong&gt;4. Modelling&lt;/strong&gt;
&lt;/h1&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Data Preprocessing before modelling&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Handling the missing values&lt;/strong&gt;&lt;br&gt;
The Random Forest Iterative Imputer is used to fill in missing values in a dataset. Unlike simple imputation methods (like mean, median, or mode), this method predicts missing values based on other features using a machine learning model, in this case a Random Forest model&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ordinal Encoding&lt;/strong&gt;&lt;br&gt;
Ordinal encoding is a way to convert categorical features into numerical values, but it’s used specifically for ordinal categories—categories with a clear, meaningful order.&lt;br&gt;
This step is done for the features overall_risk_category and bmi_category&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standard Scaling&lt;/strong&gt;&lt;br&gt;
Standard Scaling (also called Standardization) is a technique to rescale numeric features so that they have, mean as standard deviation as 0 and 1 respectively&lt;br&gt;
This helps many machine learning algorithms work better, especially those that are sensitive to the scale of features.&lt;br&gt;
This step is done for all the numerical features&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Model Development&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Regression Problem Framework&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Objective:&lt;/strong&gt; Predict continuous insurance premium values using multiple regression approaches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;*&lt;em&gt;RMSE *&lt;/em&gt;(Root Mean Squared Error): Measures prediction accuracy magnitude&lt;/li&gt;
&lt;li&gt;*&lt;em&gt;MAE *&lt;/em&gt;(Mean Absolute Error): Assesses average absolute prediction error&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;R² Score&lt;/strong&gt;: Quantifies variance explanation capability&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Data Preprocessing Pipeline&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Missing Value Treatment:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Method:&lt;/strong&gt; Random Forest Iterative Imputer&lt;/li&gt;
&lt;li&gt;**Advantage: **Predicts missing values using machine learning rather than simple statistical measures&lt;/li&gt;
&lt;li&gt;**Implementation: **Leverages feature relationships for accurate imputation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Feature Encoding:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ordinal Encoding:&lt;/strong&gt; Applied to hierarchical categories (risk levels, BMI categories)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rationale:&lt;/strong&gt; Preserves natural ordering in categorical variables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Feature Scaling:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Method:&lt;/strong&gt; Standard Scaling (Standardization)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Process:&lt;/strong&gt; Transforms features to mean=0, standard deviation=1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benefit:&lt;/strong&gt; Ensures algorithm performance optimization across different feature scales&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Model Architecture&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Algorithm Selection:&lt;/strong&gt;&lt;br&gt;
Five distinct regression approaches were implemented to ensure comprehensive performance comparison:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Linear Regression: Baseline linear relationship model&lt;/li&gt;
&lt;li&gt;Decision Tree Regressor: Non-linear, interpretable tree-based approach&lt;/li&gt;
&lt;li&gt;Random Forest Regressor: Ensemble method combining multiple decision trees&lt;/li&gt;
&lt;li&gt;Gradient Boosting Regressor: Sequential boosting with error correction&lt;/li&gt;
&lt;li&gt;XGBoost Regressor: Optimized gradient boosting with advanced regularization&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Cross-Validation Strategy:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Method:&lt;/strong&gt; 5-fold cross-validation with shuffling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Purpose:&lt;/strong&gt; Ensures robust performance estimation and reduces overfitting risk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reproducibility:&lt;/strong&gt; Fixed random seed for consistent results&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Performance Evaluation Process&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Model Training:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Feature-target separation&lt;/li&gt;
&lt;li&gt;Cross-validation implementation&lt;/li&gt;
&lt;li&gt;Hyperparameter optimization&lt;/li&gt;
&lt;li&gt;Full dataset retraining&lt;/li&gt;
&lt;li&gt;Performance metric calculation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Statistical Validation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Confidence Intervals:&lt;/strong&gt; 95% confidence intervals for prediction reliability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Residual Analysis:&lt;/strong&gt; Error pattern examination&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bias-Variance Assessment:&lt;/strong&gt; Model stability evaluation&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;5. Model comparison and performance&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Comprehensive Model Comparison&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;RMSE&lt;/th&gt;
&lt;th&gt;MAE&lt;/th&gt;
&lt;th&gt;R²&lt;/th&gt;
&lt;th&gt;CI_low&lt;/th&gt;
&lt;th&gt;CI_high&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Linear Regression&lt;/td&gt;
&lt;td&gt;3542.13&lt;/td&gt;
&lt;td&gt;2419.16&lt;/td&gt;
&lt;td&gt;0.678&lt;/td&gt;
&lt;td&gt;23987.81&lt;/td&gt;
&lt;td&gt;24635.28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decision Tree&lt;/td&gt;
&lt;td&gt;3889.32&lt;/td&gt;
&lt;td&gt;1147.06&lt;/td&gt;
&lt;td&gt;0.612&lt;/td&gt;
&lt;td&gt;24033.06&lt;/td&gt;
&lt;td&gt;24808.72&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random Forest&lt;/td&gt;
&lt;td&gt;2858.16&lt;/td&gt;
&lt;td&gt;1249.14&lt;/td&gt;
&lt;td&gt;0.791&lt;/td&gt;
&lt;td&gt;24053.53&lt;/td&gt;
&lt;td&gt;24743.77&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gradient Boosting&lt;/td&gt;
&lt;td&gt;3109.07&lt;/td&gt;
&lt;td&gt;1724.89&lt;/td&gt;
&lt;td&gt;0.752&lt;/td&gt;
&lt;td&gt;24027.90&lt;/td&gt;
&lt;td&gt;24719.85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XGBoost&lt;/td&gt;
&lt;td&gt;3039.57&lt;/td&gt;
&lt;td&gt;1509.54&lt;/td&gt;
&lt;td&gt;0.763&lt;/td&gt;
&lt;td&gt;24026.34&lt;/td&gt;
&lt;td&gt;24730.57&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Model Performance Rankings&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;RMSE Performance (Lower is Better):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;*&lt;em&gt;Random Forest *&lt;/em&gt;(2858.16) - Superior accuracy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;XGBoost&lt;/strong&gt; (3039.57) - Strong performance&lt;/li&gt;
&lt;li&gt;*&lt;em&gt;Gradient Boosting *&lt;/em&gt;(3109.07) - Competitive results&lt;/li&gt;
&lt;li&gt;*&lt;em&gt;Linear Regression *&lt;/em&gt;(3542.13) - Baseline performance&lt;/li&gt;
&lt;li&gt;*&lt;em&gt;Decision Tree *&lt;/em&gt;(3889.32) - Highest error rate&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;R² Score Performance (Higher is Better):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;*&lt;em&gt;Random Forest *&lt;/em&gt;(0.791) - Explains 79.1% of variance&lt;/li&gt;
&lt;li&gt;*&lt;em&gt;XGBoost *&lt;/em&gt;(0.763) - Explains 76.3% of variance&lt;/li&gt;
&lt;li&gt;*&lt;em&gt;Gradient Boosting *&lt;/em&gt;(0.752) - Explains 75.2% of variance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linear Regression&lt;/strong&gt; (0.678) - Explains 67.8% of variance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decision Tree&lt;/strong&gt; (0.612) - Explains 61.2% of variance&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Key Performance Insights&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Champion Model: Random Forest&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy:&lt;/strong&gt; Achieves lowest RMSE and highest R² score&lt;/li&gt;
&lt;li&gt;**Stability: **Balanced performance across all metrics&lt;/li&gt;
&lt;li&gt;**Generalization: **Strong cross-validation performance indicates robust generalization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability:&lt;/strong&gt; Tight confidence intervals suggest consistent predictions&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Algorithm Analysis:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Ensemble Method Superiority:&lt;/strong&gt;&lt;br&gt;
Tree-based ensemble methods (Random Forest, XGBoost, Gradient Boosting) consistently outperform individual models, demonstrating the power of combining multiple learners.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision Tree Characteristics:&lt;/strong&gt;&lt;br&gt;
Shows lowest MAE but highest RMSE, indicating good performance on typical cases but poor handling of outliers and extreme values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Linear Model Performance:&lt;/strong&gt;&lt;br&gt;
Despite its simplicity, Linear Regression delivers respectable performance, suggesting underlying linear relationships in the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prediction Reliability:&lt;/strong&gt;&lt;br&gt;
All models demonstrate tight confidence intervals, indicating stable and reliable prediction capabilities across the dataset.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;6. Deployment Strategy&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Application Architecture&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Technology Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend:&lt;/strong&gt; Streamlit for interactive web interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend:&lt;/strong&gt; Python with scikit-learn for model inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; Docker containerization for scalable deployment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version Control:&lt;/strong&gt; Git with structured repository organization&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Project Structure&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Insurance-Cost-Prediction/
├── app.py                          # Streamlit application entry point
├── requirements.txt                # Python dependencies
├── README.md                       # Documentation
├── Dockerfile                      # Container configuration
├── .gitignore                      # Version control exclusions
├── tableau/
│   └── insurance_workbook.twb      # Tableau visualization
├── src/
│   ├── __init__.py
│   ├── config.py                   # Configuration management
│   ├── features.py                 # Feature engineering
│   ├── model_utils.py              # Model utilities
│   └── preprocessing.py            # Data preprocessing
├── notebooks/
│   └── Insurance_Analysis.ipynb    # Jupyter analysis notebook
└── models/
    └── trained_model.pkl           # Serialized model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Application Features&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;User Interface Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input Forms:&lt;/strong&gt; Intuitive data collection interfaces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation:&lt;/strong&gt; Real-time input validation with error messaging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visualization:&lt;/strong&gt; Interactive charts showing risk factors and premium breakdowns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Responsive Design:&lt;/strong&gt; Mobile-optimized interface for accessibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Prediction Pipeline:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data Collection:&lt;/strong&gt; Streamlit widgets capture user inputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature Engineering:&lt;/strong&gt; Automatic BMI and health score calculations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Preprocessing:&lt;/strong&gt; Data scaling using trained StandardScaler&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Inference:&lt;/strong&gt; Random Forest generates premium predictions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Results Presentation:&lt;/strong&gt; Premium estimates with confidence intervals and risk analysis&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Deployment Options&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Local Development:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone https://github.com/mhdSharuk/Insurance-Cost-Prediction.git
cd Insurance-Cost-Prediction
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Docker Deployment:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker build -t insurance-prediction .
docker run -p 8501:8501 insurance-prediction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Production Considerations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; Container orchestration for high-traffic scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring:&lt;/strong&gt; Application performance and prediction accuracy tracking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security:&lt;/strong&gt; Input validation and data protection measures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintenance:&lt;/strong&gt; Model retraining pipelines for performance maintenance&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;6. Project Repository&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/jenishoza/DS-Portfolio-Project---Insurace-Cost-Prediction/tree/main" rel="noopener noreferrer"&gt;Insurance Cost Prediction&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complete Codebase:&lt;/strong&gt; Full implementation with documentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jupyter Notebooks:&lt;/strong&gt; Detailed analysis and experimentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Artifacts:&lt;/strong&gt; Trained models and preprocessing pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment Scripts:&lt;/strong&gt; Docker and local deployment configurations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation:&lt;/strong&gt; Comprehensive README and inline code documentation&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>datascience</category>
      <category>github</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
