EricMWaimiri

Posted on Jun 28

How I Explored a US Health Dataset with Python — EDA + Hypothesis Testing

#data #datascience #python #tutorial

I recently completed an exploratory data analysis project on the NHANES (National Health and Nutrition Examination Survey) dataset from Kaggle. It's a real-world health survey collected by the CDC covering body measurements, lifestyle habits, and demographic data from thousands of US adults.

In this article I'll walk you through exactly what I did — from loading and cleaning the data all the way to running statistical tests — and share what I found along the way.

The Dataset

The dataset has 5,735 rows and 28 columns, but for this project I focused on 8 columns that were relevant to the questions I wanted to answer:

Column	Description
`smoking`	Has the person smoked at least 100 cigarettes?
`gender`	Male or Female
`age`	Age in years
`education`	Highest level of education
`weight`	Weight in kg
`height`	Height in cm
`bmi`	Body Mass Index

Step 1 — Loading and Selecting Columns

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

db = pd.read_csv('NHANES.csv')

data = db.loc[:, ('SEQN', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',
                   'DMDEDUC2', 'BMXWT', 'BMXHT', 'BMXBMI')]

data = data.rename(columns={
    'SEQN': 'id', 'SMQ020': 'smoking', 'RIAGENDR': 'gender',
    'RIDAGEYR': 'age', 'DMDEDUC2': 'education',
    'BMXWT': 'weight', 'BMXHT': 'height', 'BMXBMI': 'bmi'
})

One thing worth knowing about NHANES: all the columns come in as numeric codes. 1 means Male, 2 means Female. 1 means the person smoked, 2 means they didn't. You have to map these to readable labels before doing any analysis, otherwise your charts are meaningless.

Step 2 — Cleaning the Data

Drop the ID column and remove nulls

data.drop('id', axis=1, inplace=True)
data.dropna(inplace=True)

This brought us from 5,735 rows down to 5,406 — about 6% lost, which is acceptable.

Remove outliers using the IQR method

The IQR (Interquartile Range) method flags values that fall too far outside the middle 50% of the data:

Lower bound = Q1 − 1.5 × IQR
Upper bound = Q3 + 1.5 × IQR

# Height outlier removal
hq25, hq50, hq75 = data['height'].quantile([0.25, 0.5, 0.75])
hiqr = hq75 - hq25
hlower = hq25 - 1.5 * hiqr
hupper = hq75 + 1.5 * hiqr
data = data[(data['height'] >= hlower) & (data['height'] <= hupper)]

I applied this to height, weight, and bmi. After removing outliers the final dataset had 5,171 clean rows.

Map coded values to labels

data['smoking'] = data['smoking'].replace({1: 'Yes', 2: 'No', 7: np.nan, 9: np.nan})
data['gender']  = data['gender'].replace({1: 'Male', 2: 'Female'})
data['education'] = data['education'].replace({
    1: 'Less than 9th grade', 2: '9th to 12th grade',
    3: 'High school graduate',  4: 'Some college',
    5: 'College graduate',      9: 'Others'
})

Note: codes 7 and 9 in the smoking column mean "Refused" and "Don't know" — I converted these to NaN rather than treating them as valid answers.

Step 3 — Distribution Analysis

Before looking at relationships between variables, I first looked at each variable individually using histograms and boxplots.

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
sns.histplot(data=data, x='age',    kde=True, bins=30, ax=axes[0,0], color='skyblue')
sns.histplot(data=data, x='weight', kde=True, bins=30, ax=axes[0,1], color='lime')
sns.histplot(data=data, x='height', kde=True, bins=30, ax=axes[1,0], color='red')
sns.histplot(data=data, x='bmi',    kde=True, bins=30, ax=axes[1,1], color='orange')
plt.tight_layout()
plt.show()

What I found:

age is fairly uniform — the survey was designed to cover all adult age groups
bmi and weight are right-skewed — a few very high values pull the mean above the median
height is roughly normally distributed

The boxplots before and after outlier removal made it easy to confirm the cleaning worked — the extreme dots beyond the whiskers were gone after applying IQR.

Step 4 — Correlation Analysis

numerical = data.select_dtypes(include='number')
corr = numerical.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Variables')
plt.show()

Key findings from the heatmap:

weight and bmi have a very strong positive correlation — expected, since BMI is calculated from weight
age and bmi have a weak positive correlation — BMI tends to increase slightly with age
height and weight show a moderate positive correlation

I also ran pairplots split by gender and by smoking status to see if patterns differed across groups.

Step 5 — Group Comparisons

I binned the continuous age column into decade bands for group analysis:

data['age'] = pd.cut(data['age'], [18, 30, 40, 50, 60, 70, 80])

Then looked at smoking proportions and BMI by age band and gender:

data.groupby(['age', 'gender']).agg({'smoking': lambda x: np.mean(x == 'Yes')})

Some patterns that emerged:

Smoking rates are highest in the 30–40 age band
Males smoke at a much higher rate than females across all age groups
BMI peaks in the 50–60 age band for both genders

Step 6 — Hypothesis Testing

This is where the analysis gets interesting. I defined three hypotheses and ran statistical tests on each one.

H01 — Are females aged 40–50 predominantly obese?

from statsmodels.stats.proportion import proportions_ztest

females_40_50 = data[
    (data['gender'] == 'Female') & (data['age'].astype(str) == '(40, 50]')
]
obese = (females_40_50['bmi'] > 30).sum()
n = len(females_40_50)

stat, p = proportions_ztest(obese, n, value=0.5, alternative='larger')
print(f'Obese: {obese}/{n} ({obese/n*100:.1f}%)')
print(f'Z-stat: {stat:.3f}, p-value: {p:.4f}')

Result: 224 out of 469 females aged 40–50 had BMI > 30 which is the cutoff according to WHO — that's 47.8%. The p-value was greater than 0.05, so we fail to reject H₀. Just under half are obese — close, but not a statistically significant majority.

H02 — Do males and females smoke at different rates?

from scipy.stats import chi2_contingency

smokers_only = data[data['smoking'].isin(['Yes', 'No'])]
ct = pd.crosstab(smokers_only['gender'], smokers_only['smoking'])
chi2, p, dof, expected = chi2_contingency(ct)
print(f'Chi-square: {chi2:.3f}, p-value: {p:.4f}')

Result: Male smoking rate was 53.3%, female was 31.2%. The p-value was well below 0.05 — we reject H₀. The difference in smoking rates between males and females is statistically significant.

H03 — Is BMI significantly different between males and females?

from scipy.stats import ttest_ind

male_bmi   = data[data['gender'] == 'Male']['bmi'].dropna()
female_bmi = data[data['gender'] == 'Female']['bmi'].dropna()

stat, p = ttest_ind(male_bmi, female_bmi)
print(f'Male BMI mean: {male_bmi.mean():.2f}')
print(f'Female BMI mean: {female_bmi.mean():.2f}')
print(f'T-stat: {stat:.3f}, p-value: {p:.4f}')

Result: Male mean BMI was 28.21, female was 29.09. The p-value was below 0.05 — we reject H₀. The difference, while small, is statistically significant.

Summary of findings

Hypothesis	Test used	Result
Females 40–50 are predominantly obese	Proportion z-test	Fail to reject H₀ (47.8%, not a majority)
Smoking rates differ by gender	Chi-square	Reject H₀ (53.3% male vs 31.2% female)
BMI differs by gender	Independent t-test	Reject H₀ (28.21 male vs 29.09 female)

What I learned

A few things that stood out doing this project:

Numeric codes will catch you off guard. NHANES stores everything as numbers. If you forget to map them, your heatmap will show a correlation between "gender" and "BMI" that is actually just the correlation between the numbers 1 and 2 and BMI values — meaningless.

Mean vs median gap is your quickest signal. BMI mean was 29.1 but median was 27.8. That 1.3 gap immediately told me there were high-end outliers pulling the average up before I even plotted anything.

Statistical significance ≠ practical significance. H03 rejected the null — male and female BMI are statistically different. But the actual difference is less than 1 BMI point. Significant in the mathematical sense, but probably not meaningful in a clinical one.

Tools used

Python 3
pandas, numpy
matplotlib, seaborn
scipy, statsmodels

DEV Community