What 3.9M powerlifting records tell us about competition strategy — an EDA with Python

#python #datascience #data #sportscience

What 3.9M powerlifting records tell us about competition strategy — an EDA with Python

When I started this EDA project for my Data Science Master at Evolve, I picked the Open Powerlifting dataset because beyond being a gym-rat, I've always been curious about the competition strategy in powerlifting.

The dataset

Open Powerlifting is an open-source project that tracks powerlifting competition results worldwide. The full dataset has ~3.9M rows and 42 columns covering athlete info, every single lift attempt, and performance metrics.

Before any analysis I filtered it down to sanctioned, drug-tested competitions only and kept only the columns I'd actually use. The main challenge: negative values mean a failed lift, not bad data. That required building boolean columns to track success/failure before converting negatives to NaN.

The process

Fully modularized in Python using pandas, numpy, seaborn, matplotlib and pingouin. The pipeline runs end-to-end from main.py:

raw CSV → filter → clean → features → assert → analyze

Imputation was done conservatively; age from AgeClass ranges, bodyweight from weight class, never synthetic values. Also, NaN values were filtered dynamically for each question.

Results

Peak performance age: Athletes peak between 22-24 years old and decline steadily after. No meaningful difference between men and women once normalized by bodyweight.

Where do athletes fail most? Bench press has a 54% fail rate on the 3rd attempt. Squat and deadlift sit around 36-40%. The gap is consistent across sexes and equipment types — bench just behaves differently.

The 4th attempt: When athletes take a 4th attempt, they succeed ~77% of the time on average. Deadlift leads at 83%. This is the most actionable insight of the whole project — just take the 4th attempt.

What I learned

About powerlifting
Athletes peak between 22-24, always take the 4th attempt and make sure you won't fail the 3rd one, it can change the whole competition.

About analysing data
If you have enough data, maybe it's better to not fill the gaps with artificial values. Also, some features must be built before cleaning or you'll spend an hour chating with AI wondering why all your booleans are NaN.

Full code: github.com/rubengil-dev/power_lifting_analisis

Project developed during the Data Science Master at Evolve.