I’ve always been fascinated by the Olympics.
The stories, the records, the triumphs… but when I went looking for a clean dataset of every athlete in history, I hit a wall.
Sure, there’s Olympedia.org — an incredible resource — but no “Download” button.
So I decided:
If the dataset doesn’t exist, I’ll build it myself.
The result? A Python scraper that can pull every athlete profile from 1896 to today — perfect for data analysis and visualization projects.
📌 What This Script Does
With one command, you get:
✅ Name, gender, height, weight
✅ Birth & death info (date, city, country)
✅ National Olympic Committee (NOC)
✅ Last Olympic Games and sport
✅ Medal counts (gold, silver, bronze)
Saved neatly in a CSV ready for Pandas or Excel.
📊 What You Can Do With It
This isn’t just about scraping.
Once you have the data, you can:
- Visualize medal trends over decades
- Explore which sports certain countries dominate
- Study athlete physique trends (height/weight) over time
- Map birthplaces of medalists with GeoPandas
⚡ How Fast Is It?
With 10 threads and a 0.4 second delay per request,
you can scrape thousands of athletes in under an hour — without hammering the site.
🚀 Quick Start
1️⃣ Clone the repo
git clone https://github.com/Wydoinn/Olympedia-Athlete-Scraper.git
cd Olympedia-Athlete-Scraper
pip install -r requirements.txt
2️⃣ Run the scraper
# Start fresh
python scraper.py --start 1 --concurrency 10 --delay 0.4 --csv olympedia.csv
# Or resume where you left off
python scraper.py --resume
3️⃣ Open olympedia.csv
and start exploring.
📂 The Data Format
Example row:
athlete_id,name,sex,height_cm,weight_kg,born_date,died_date,
born_city,born_region,born_country,died_city,died_region,died_country,
noc,games,year,sport,gold_medal,silver_medal,bronze_medal
19,Maurice Germot,M,178,68,1882-11-15,1958-01-06,
Vichy,Allier,FRA,Vichy,Allier,FRA,FRA,
Stockholm 1912,1912,Tennis,0,2,0
🧠 How It Works (in 20 Seconds)
-
Multi-threaded with
ThreadPoolExecutor
-
Resumable with a
progress.json
checkpoint - Auto-stops after 1000 consecutive missing IDs
- Parses HTML using BeautifulSoup
- Writes CSV as it runs (so you can peek mid-scrape)
🧹 A Note on Responsible Scraping
Please be respectful:
- Keep a delay between requests
- Don’t flood the server
- Always credit the source (Olympedia)
💬 What would you want to analyze first?
Drop a comment and let’s brainstorm!
Top comments (1)
awesome