DEV Community

Cover image for How I Explored a US Health Dataset with Python — EDA + Hypothesis Testing
EricMWaimiri
EricMWaimiri

Posted on

How I Explored a US Health Dataset with Python — EDA + Hypothesis Testing

I recently completed an exploratory data analysis project on the NHANES (National Health and Nutrition Examination Survey) dataset from Kaggle. It's a real-world health survey collected by the CDC covering body measurements, lifestyle habits, and demographic data from thousands of US adults.

In this article I'll walk you through exactly what I did — from loading and cleaning the data all the way to running statistical tests — and share what I found along the way.


The Dataset

The dataset has 5,735 rows and 28 columns, but for this project I focused on 8 columns that were relevant to the questions I wanted to answer:

Column Description
smoking Has the person smoked at least 100 cigarettes?
gender Male or Female
age Age in years
education Highest level of education
weight Weight in kg
height Height in cm
bmi Body Mass Index

Step 1 — Loading and Selecting Columns

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

db = pd.read_csv('NHANES.csv')

data = db.loc[:, ('SEQN', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',
                   'DMDEDUC2', 'BMXWT', 'BMXHT', 'BMXBMI')]

data = data.rename(columns={
    'SEQN': 'id', 'SMQ020': 'smoking', 'RIAGENDR': 'gender',
    'RIDAGEYR': 'age', 'DMDEDUC2': 'education',
    'BMXWT': 'weight', 'BMXHT': 'height', 'BMXBMI': 'bmi'
})
Enter fullscreen mode Exit fullscreen mode

One thing worth knowing about NHANES: all the columns come in as numeric codes. 1 means Male, 2 means Female. 1 means the person smoked, 2 means they didn't. You have to map these to readable labels before doing any analysis, otherwise your charts are meaningless.


Step 2 — Cleaning the Data

Drop the ID column and remove nulls

data.drop('id', axis=1, inplace=True)
data.dropna(inplace=True)
Enter fullscreen mode Exit fullscreen mode

This brought us from 5,735 rows down to 5,406 — about 6% lost, which is acceptable.

Remove outliers using the IQR method

The IQR (Interquartile Range) method flags values that fall too far outside the middle 50% of the data:

  • Lower bound = Q1 − 1.5 × IQR
  • Upper bound = Q3 + 1.5 × IQR
# Height outlier removal
hq25, hq50, hq75 = data['height'].quantile([0.25, 0.5, 0.75])
hiqr = hq75 - hq25
hlower = hq25 - 1.5 * hiqr
hupper = hq75 + 1.5 * hiqr
data = data[(data['height'] >= hlower) & (data['height'] <= hupper)]
Enter fullscreen mode Exit fullscreen mode

I applied this to height, weight, and bmi. After removing outliers the final dataset had 5,171 clean rows.

Map coded values to labels

data['smoking'] = data['smoking'].replace({1: 'Yes', 2: 'No', 7: np.nan, 9: np.nan})
data['gender']  = data['gender'].replace({1: 'Male', 2: 'Female'})
data['education'] = data['education'].replace({
    1: 'Less than 9th grade', 2: '9th to 12th grade',
    3: 'High school graduate',  4: 'Some college',
    5: 'College graduate',      9: 'Others'
})
Enter fullscreen mode Exit fullscreen mode

Note: codes 7 and 9 in the smoking column mean "Refused" and "Don't know" — I converted these to NaN rather than treating them as valid answers.


Step 3 — Distribution Analysis

Before looking at relationships between variables, I first looked at each variable individually using histograms and boxplots.

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
sns.histplot(data=data, x='age',    kde=True, bins=30, ax=axes[0,0], color='skyblue')
sns.histplot(data=data, x='weight', kde=True, bins=30, ax=axes[0,1], color='lime')
sns.histplot(data=data, x='height', kde=True, bins=30, ax=axes[1,0], color='red')
sns.histplot(data=data, x='bmi',    kde=True, bins=30, ax=axes[1,1], color='orange')
plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

What I found:

  • age is fairly uniform — the survey was designed to cover all adult age groups
  • bmi and weight are right-skewed — a few very high values pull the mean above the median
  • height is roughly normally distributed

The boxplots before and after outlier removal made it easy to confirm the cleaning worked — the extreme dots beyond the whiskers were gone after applying IQR.


Step 4 — Correlation Analysis

numerical = data.select_dtypes(include='number')
corr = numerical.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Variables')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Key findings from the heatmap:

  • weight and bmi have a very strong positive correlation — expected, since BMI is calculated from weight
  • age and bmi have a weak positive correlation — BMI tends to increase slightly with age
  • height and weight show a moderate positive correlation

I also ran pairplots split by gender and by smoking status to see if patterns differed across groups.


Step 5 — Group Comparisons

I binned the continuous age column into decade bands for group analysis:

data['age'] = pd.cut(data['age'], [18, 30, 40, 50, 60, 70, 80])
Enter fullscreen mode Exit fullscreen mode

Then looked at smoking proportions and BMI by age band and gender:

data.groupby(['age', 'gender']).agg({'smoking': lambda x: np.mean(x == 'Yes')})
Enter fullscreen mode Exit fullscreen mode

Some patterns that emerged:

  • Smoking rates are highest in the 30–40 age band
  • Males smoke at a much higher rate than females across all age groups
  • BMI peaks in the 50–60 age band for both genders

Step 6 — Hypothesis Testing

This is where the analysis gets interesting. I defined three hypotheses and ran statistical tests on each one.


H01 — Are females aged 40–50 predominantly obese?

from statsmodels.stats.proportion import proportions_ztest

females_40_50 = data[
    (data['gender'] == 'Female') & (data['age'].astype(str) == '(40, 50]')
]
obese = (females_40_50['bmi'] > 30).sum()
n = len(females_40_50)

stat, p = proportions_ztest(obese, n, value=0.5, alternative='larger')
print(f'Obese: {obese}/{n} ({obese/n*100:.1f}%)')
print(f'Z-stat: {stat:.3f}, p-value: {p:.4f}')
Enter fullscreen mode Exit fullscreen mode

Result: 224 out of 469 females aged 40–50 had BMI > 30 which is the cutoff according to WHO — that's 47.8%. The p-value was greater than 0.05, so we fail to reject H₀. Just under half are obese — close, but not a statistically significant majority.


H02 — Do males and females smoke at different rates?

from scipy.stats import chi2_contingency

smokers_only = data[data['smoking'].isin(['Yes', 'No'])]
ct = pd.crosstab(smokers_only['gender'], smokers_only['smoking'])
chi2, p, dof, expected = chi2_contingency(ct)
print(f'Chi-square: {chi2:.3f}, p-value: {p:.4f}')
Enter fullscreen mode Exit fullscreen mode

Result: Male smoking rate was 53.3%, female was 31.2%. The p-value was well below 0.05 — we reject H₀. The difference in smoking rates between males and females is statistically significant.


H03 — Is BMI significantly different between males and females?

from scipy.stats import ttest_ind

male_bmi   = data[data['gender'] == 'Male']['bmi'].dropna()
female_bmi = data[data['gender'] == 'Female']['bmi'].dropna()

stat, p = ttest_ind(male_bmi, female_bmi)
print(f'Male BMI mean: {male_bmi.mean():.2f}')
print(f'Female BMI mean: {female_bmi.mean():.2f}')
print(f'T-stat: {stat:.3f}, p-value: {p:.4f}')
Enter fullscreen mode Exit fullscreen mode

Result: Male mean BMI was 28.21, female was 29.09. The p-value was below 0.05 — we reject H₀. The difference, while small, is statistically significant.


Summary of findings

Hypothesis Test used Result
Females 40–50 are predominantly obese Proportion z-test Fail to reject H₀ (47.8%, not a majority)
Smoking rates differ by gender Chi-square Reject H₀ (53.3% male vs 31.2% female)
BMI differs by gender Independent t-test Reject H₀ (28.21 male vs 29.09 female)

What I learned

A few things that stood out doing this project:

Numeric codes will catch you off guard. NHANES stores everything as numbers. If you forget to map them, your heatmap will show a correlation between "gender" and "BMI" that is actually just the correlation between the numbers 1 and 2 and BMI values — meaningless.

Mean vs median gap is your quickest signal. BMI mean was 29.1 but median was 27.8. That 1.3 gap immediately told me there were high-end outliers pulling the average up before I even plotted anything.

Statistical significance ≠ practical significance. H03 rejected the null — male and female BMI are statistically different. But the actual difference is less than 1 BMI point. Significant in the mathematical sense, but probably not meaningful in a clinical one.


Tools used

  • Python 3
  • pandas, numpy
  • matplotlib, seaborn
  • scipy, statsmodels

Links


Thanks for reading! If you have questions about any of the steps or want to suggest improvements, drop them in the comments.

Top comments (0)