The Hook: Why Sports Data Matters More Than Ever
Last season, a mid-tier English football club made headlines when they announced a dramatic shift in their recruitment strategy. Their secret? A Python script that analyzed 15,000+ player actions across 500+ matches. Within two years, they'd climbed 14 positions in the league using data-driven insights that cost less than a single journeyman player's salary.
This isn't fiction anymore. Sports data analysis has moved from luxury to necessity, and the best part? You don't need a six-figure budget to get started. With Python, open APIs, and publicly available datasets, you can build enterprise-grade sports analytics pipelines in your spare time.
In this tutorial, I'll walk you through building a complete sports data pipeline that ingests StatsBomb data, processes it with pandas, and surfaces actionable insights. By the end, you'll have a reusable framework you can apply to any sport.
Part 1: Understanding Your Data Sources
Before writing a single line of code, you need to understand where sports data lives and what you're actually working with.
The Sports Data Ecosystem
The sports data landscape has three main tiers:
Tier 1 - Commercial APIs (StatsBomb, Opta, InStat)
- Highest quality, most comprehensive event data
- Premium pricing ($5,000-$50,000+ annually)
- Real-time or near real-time availability
- StatsBomb offers free educational access
Tier 2 - Public/Semi-Public APIs (Rapid API, ESPN, Football-Data.org)
- Solid data quality for most use cases
- Free or freemium pricing
- Some restrictions on rate limits and historical depth
- Great for learning and non-commercial projects
Tier 3 - Web Scraping (Understat, FBref, WhoScored)
- Variable quality depending on source
- Requires legal and ethical consideration
- No API = more maintenance overhead
- Best for supplementary data
For this tutorial, I'm using StatsBomb's open data, which is freely available on GitHub and includes detailed event-level data for 3,000+ professional matches.
What Data Will We Analyze?
StatsBomb provides event-level data with approximately 500 attributes per match, including:
- Positional data: x/y coordinates for every action
- Event types: 28+ categories (pass, shot, tackle, dribble, etc.)
- Outcome data: success/failure, pressure, defensive actions
- Player/team metadata: IDs, names, positions, jersey numbers
- Match context: dates, competition, stadiums, lineups
A single match generates 2,000-3,000 events. Our pipeline will aggregate this granular data into meaningful statistics.
Part 2: Environment Setup and Core Libraries
Let's build the foundation for a production-ready pipeline.
Installation
# Create virtual environment
python -m venv sports_pipeline
source sports_pipeline/bin/activate # On Windows: sports_pipeline\Scripts\activate
# Install required packages
pip install pandas numpy matplotlib seaborn requests scipy scikit-learn jupyter
# Optional: for advanced visualization
pip install plotly kaleido
# Optional: for StatsBomb-specific convenience
pip install statsbomb
Import Structure
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import json
import requests
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')
# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)
Key Libraries Explained
- pandas: DataFrames are perfect for nested JSON sports data. You'll spend 60% of time here.
- numpy: Fast numerical operations and statistical calculations
- matplotlib/seaborn: Publication-quality visualizations
- requests: Clean API interactions
- scipy: Statistical tests and distributions
Part 3: Building the Data Pipeline
Now for the practical implementation. Here's where theory meets code.
Step 1: Fetching StatsBomb Data
StatsBomb's open data is hosted on GitHub. We'll fetch match data programmatically:
class StatsBombPipeline:
"""Production-ready StatsBomb data pipeline"""
BASE_URL = "https://raw.githubusercontent.com/statsbomb/StatsBomb/master/data"
def __init__(self):
self.matches = None
self.events = None
self.lineups = None
def fetch_competitions(self):
"""Fetch available competitions"""
url = f"{self.BASE_URL}/competitions.json"
response = requests.get(url)
competitions = response.json()
return pd.DataFrame(competitions)
def fetch_matches(self, competition_id, season_id):
"""Fetch all matches for a specific season"""
url = f"{self.BASE_URL}/matches/{competition_id}/{season_id}.json"
response = requests.get(url)
matches = response.json()
# Flatten nested JSON to DataFrame
matches_df = pd.json_normalize(matches)
return matches_df
def fetch_events(self, match_id):
"""Fetch detailed event data for a single match"""
url = f"{self.BASE_URL}/events/{match_id}.json"
response = requests.get(url)
events = response.json()
# Flatten nested structure
events_df = pd.json_normalize(events)
events_df['match_id'] = match_id
return events_df
def fetch_season_events(self, competition_id, season_id, limit=None):
"""Fetch events for entire season"""
matches_df = self.fetch_matches(competition_id, season_id)
all_events = []
match_list = matches_df['match_id'].tolist()
if limit:
match_list = match_list[:limit]
print(f"Fetching events for {len(match_list)} matches...")
for idx, match_id in enumerate(match_list):
try:
events_df = self.fetch_events(match_id)
all_events.append(events_df)
if (idx + 1) % 10 == 0:
print(f"✓ Processed {idx + 1}/{len(match_list)} matches")
except Exception as e:
print(f"✗ Error fetching match {match_id}: {e}")
continue
# Concatenate and reset index
events_df = pd.concat(all_events, ignore_index=True)
self.events = events_df
self.matches = matches_df
return events_df
# Initialize pipeline
pipeline = StatsBombPipeline()
# Fetch competitions and pick Premier League (37)
competitions = pipeline.fetch_competitions()
print(competitions[['competition_id', 'competition_name']].drop_duplicates())
# Fetch Premier League 2017-18 season (sample)
events = pipeline.fetch_season_events(
competition_id=37,
season_id=1,
limit=50 # First 50 matches for this example
)
print(f"Loaded {len(events)} events across {len(pipeline.matches)} matches")
Step 2: Data Cleaning and Enrichment
Raw sports data is messy. Let's clean it:
def clean_and_enrich_events(events_df):
"""Clean and add useful features to event data"""
df = events_df.copy()
# Convert timestamps
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['minute'] = df['minute'].fillna(0).astype(int)
# Extract pass-specific columns
df['pass_completed'] = df['pass.outcome'].isna().astype(int)
df['pass_length'] = df.apply(
lambda row: np.sqrt(
(row.get('pass.end_location', [0])[0] - row.get('location', [0])[0])**2 +
(row.get('pass.end_location', [0])[1] - row.get('location', [0])[1])**2
) if pd.notna(row.get('pass.end_location')) else np.nan,
axis=1
)
# Extract shot data
df['shot_result'] = df['shot.outcome.name'].fillna('Not Shot')
# Team and player names
df['team_name'] = df['team.name'].fillna('Unknown')
df['player_name'] = df['player.name'].fillna('Unknown')
df['position'] = df['position.name'].fillna('Unknown')
# Event type simplification
df['event_type'] = df['type.name'].fillna('Other')
# Sort by match and timestamp
df = df.sort_values(['match_id', 'timestamp']).reset_index(drop=True)
return df
events_clean = clean_and_enrich_events(events)
print(events_clean[['timestamp', 'team_name', 'player_name', 'event_type', 'pass_completed']].head(10))
Step 3: Aggregating to Team-Level Statistics
Now we transform event-level data into actionable metrics:
python
def calculate_team_stats(events_df, matches_df):
"""Calculate comprehensive team statistics"""
# Initialize results dictionary
team_stats = defaultdict(lambda: {
'matches_played': 0,
'passes_attempted': 0,
'passes_completed': 0,
'pass_accuracy': 0,
'shots': 0,
'shots_on_target': 0,
'tackles': 0,
'interceptions': 0,
'fouls_committed': 0,
'goals': 0,
'possession_time': 0
})
# Get match durations for possession calculations
match_durations = matches_df.set_index('match_id')[['duration']].to_dict()['duration']
# Iterate through events
for _, event in events_df.iterrows():
team = event['team_name']
event_type = event['event_type']
match_id = event['match_id']
# Passes
if event_type == 'Pass':
team_stats[team]['passes_attempted'] += 1
if event['pass_completed'] == 1:
team_stats[team]['passes_completed'] += 1
# Shots
elif event_type == 'Shot':
team_stats[team]['shots'] += 1
if event['shot_result'] in ['Saved', 'Goal']:
team_stats[team]['shots_on_target'] += 1
if event['shot_result'] == 'Goal':
team_stats[team]['goals'] += 1
# Defensive actions
elif event_type == 'Tackle':
Top comments (0)