I recently completed an exploratory data analysis project on the NHANES (National Health and Nutrition Examination Survey) dataset from Kaggle. It's a real-world health survey collected by the CDC covering body measurements, lifestyle habits, and demographic data from thousands of US adults.
In this article I'll walk you through exactly what I did — from loading and cleaning the data all the way to running statistical tests — and share what I found along the way.
The Dataset
The dataset has 5,735 rows and 28 columns, but for this project I focused on 8 columns that were relevant to the questions I wanted to answer:
| Column | Description |
|---|---|
smoking |
Has the person smoked at least 100 cigarettes? |
gender |
Male or Female |
age |
Age in years |
education |
Highest level of education |
weight |
Weight in kg |
height |
Height in cm |
bmi |
Body Mass Index |
Step 1 — Loading and Selecting Columns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
db = pd.read_csv('NHANES.csv')
data = db.loc[:, ('SEQN', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',
'DMDEDUC2', 'BMXWT', 'BMXHT', 'BMXBMI')]
data = data.rename(columns={
'SEQN': 'id', 'SMQ020': 'smoking', 'RIAGENDR': 'gender',
'RIDAGEYR': 'age', 'DMDEDUC2': 'education',
'BMXWT': 'weight', 'BMXHT': 'height', 'BMXBMI': 'bmi'
})
One thing worth knowing about NHANES: all the columns come in as numeric codes. 1 means Male, 2 means Female. 1 means the person smoked, 2 means they didn't. You have to map these to readable labels before doing any analysis, otherwise your charts are meaningless.
Step 2 — Cleaning the Data
Drop the ID column and remove nulls
data.drop('id', axis=1, inplace=True)
data.dropna(inplace=True)
This brought us from 5,735 rows down to 5,406 — about 6% lost, which is acceptable.
Remove outliers using the IQR method
The IQR (Interquartile Range) method flags values that fall too far outside the middle 50% of the data:
- Lower bound = Q1 − 1.5 × IQR
- Upper bound = Q3 + 1.5 × IQR
# Height outlier removal
hq25, hq50, hq75 = data['height'].quantile([0.25, 0.5, 0.75])
hiqr = hq75 - hq25
hlower = hq25 - 1.5 * hiqr
hupper = hq75 + 1.5 * hiqr
data = data[(data['height'] >= hlower) & (data['height'] <= hupper)]
I applied this to height, weight, and bmi. After removing outliers the final dataset had 5,171 clean rows.
Map coded values to labels
data['smoking'] = data['smoking'].replace({1: 'Yes', 2: 'No', 7: np.nan, 9: np.nan})
data['gender'] = data['gender'].replace({1: 'Male', 2: 'Female'})
data['education'] = data['education'].replace({
1: 'Less than 9th grade', 2: '9th to 12th grade',
3: 'High school graduate', 4: 'Some college',
5: 'College graduate', 9: 'Others'
})
Note: codes 7 and 9 in the smoking column mean "Refused" and "Don't know" — I converted these to NaN rather than treating them as valid answers.
Step 3 — Distribution Analysis
Before looking at relationships between variables, I first looked at each variable individually using histograms and boxplots.
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
sns.histplot(data=data, x='age', kde=True, bins=30, ax=axes[0,0], color='skyblue')
sns.histplot(data=data, x='weight', kde=True, bins=30, ax=axes[0,1], color='lime')
sns.histplot(data=data, x='height', kde=True, bins=30, ax=axes[1,0], color='red')
sns.histplot(data=data, x='bmi', kde=True, bins=30, ax=axes[1,1], color='orange')
plt.tight_layout()
plt.show()
What I found:
-
ageis fairly uniform — the survey was designed to cover all adult age groups -
bmiandweightare right-skewed — a few very high values pull the mean above the median -
heightis roughly normally distributed
The boxplots before and after outlier removal made it easy to confirm the cleaning worked — the extreme dots beyond the whiskers were gone after applying IQR.
Step 4 — Correlation Analysis
numerical = data.select_dtypes(include='number')
corr = numerical.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Variables')
plt.show()
Key findings from the heatmap:
-
weightandbmihave a very strong positive correlation — expected, since BMI is calculated from weight -
ageandbmihave a weak positive correlation — BMI tends to increase slightly with age -
heightandweightshow a moderate positive correlation
I also ran pairplots split by gender and by smoking status to see if patterns differed across groups.
Step 5 — Group Comparisons
I binned the continuous age column into decade bands for group analysis:
data['age'] = pd.cut(data['age'], [18, 30, 40, 50, 60, 70, 80])
Then looked at smoking proportions and BMI by age band and gender:
data.groupby(['age', 'gender']).agg({'smoking': lambda x: np.mean(x == 'Yes')})
Some patterns that emerged:
- Smoking rates are highest in the 30–40 age band
- Males smoke at a much higher rate than females across all age groups
- BMI peaks in the 50–60 age band for both genders
Step 6 — Hypothesis Testing
This is where the analysis gets interesting. I defined three hypotheses and ran statistical tests on each one.
H01 — Are females aged 40–50 predominantly obese?
from statsmodels.stats.proportion import proportions_ztest
females_40_50 = data[
(data['gender'] == 'Female') & (data['age'].astype(str) == '(40, 50]')
]
obese = (females_40_50['bmi'] > 30).sum()
n = len(females_40_50)
stat, p = proportions_ztest(obese, n, value=0.5, alternative='larger')
print(f'Obese: {obese}/{n} ({obese/n*100:.1f}%)')
print(f'Z-stat: {stat:.3f}, p-value: {p:.4f}')
Result: 224 out of 469 females aged 40–50 had BMI > 30 which is the cutoff according to WHO — that's 47.8%. The p-value was greater than 0.05, so we fail to reject H₀. Just under half are obese — close, but not a statistically significant majority.
H02 — Do males and females smoke at different rates?
from scipy.stats import chi2_contingency
smokers_only = data[data['smoking'].isin(['Yes', 'No'])]
ct = pd.crosstab(smokers_only['gender'], smokers_only['smoking'])
chi2, p, dof, expected = chi2_contingency(ct)
print(f'Chi-square: {chi2:.3f}, p-value: {p:.4f}')
Result: Male smoking rate was 53.3%, female was 31.2%. The p-value was well below 0.05 — we reject H₀. The difference in smoking rates between males and females is statistically significant.
H03 — Is BMI significantly different between males and females?
from scipy.stats import ttest_ind
male_bmi = data[data['gender'] == 'Male']['bmi'].dropna()
female_bmi = data[data['gender'] == 'Female']['bmi'].dropna()
stat, p = ttest_ind(male_bmi, female_bmi)
print(f'Male BMI mean: {male_bmi.mean():.2f}')
print(f'Female BMI mean: {female_bmi.mean():.2f}')
print(f'T-stat: {stat:.3f}, p-value: {p:.4f}')
Result: Male mean BMI was 28.21, female was 29.09. The p-value was below 0.05 — we reject H₀. The difference, while small, is statistically significant.
Summary of findings
| Hypothesis | Test used | Result |
|---|---|---|
| Females 40–50 are predominantly obese | Proportion z-test | Fail to reject H₀ (47.8%, not a majority) |
| Smoking rates differ by gender | Chi-square | Reject H₀ (53.3% male vs 31.2% female) |
| BMI differs by gender | Independent t-test | Reject H₀ (28.21 male vs 29.09 female) |
What I learned
A few things that stood out doing this project:
Numeric codes will catch you off guard. NHANES stores everything as numbers. If you forget to map them, your heatmap will show a correlation between "gender" and "BMI" that is actually just the correlation between the numbers 1 and 2 and BMI values — meaningless.
Mean vs median gap is your quickest signal. BMI mean was 29.1 but median was 27.8. That 1.3 gap immediately told me there were high-end outliers pulling the average up before I even plotted anything.
Statistical significance ≠ practical significance. H03 rejected the null — male and female BMI are statistically different. But the actual difference is less than 1 BMI point. Significant in the mathematical sense, but probably not meaningful in a clinical one.
Tools used
- Python 3
- pandas, numpy
- matplotlib, seaborn
- scipy, statsmodels
Links
Thanks for reading! If you have questions about any of the steps or want to suggest improvements, drop them in the comments.
Top comments (0)