The Hook: From API Calls to Actionable Insights
I spent three weeks manually downloading soccer match data from scattered sources, copying statistics into Excel, and creating ad-hoc visualizations. When I discovered StatsBomb's open data API and learned to automate the entire pipeline with Python, I recovered those 60+ hours and gained insights no spreadsheet could deliver.
In this tutorial, I'll show you exactly how to build an automated sports data pipeline from scratch. By the end, you'll have a reproducible system that fetches match data, transforms it with pandas, and generates visualizations that reveal patterns invisible to the naked eye.
Part 1: Understanding Your Data Sources
Before writing a single line of Python, understanding available data sources is crucial.
Primary Open Sports Data Sources
StatsBomb Open Data (https://statsbomb.com/what-we-do/soccer-data/)
- Free access to 3,000+ matches (primarily women's football and lower leagues)
- Comprehensive event-level data including passes, shots, pressures, and defensive actions
- JSON format via GitHub repository
Understat (https://understat.com/)
- Expected Goals (xG) and underlying metrics
- Shot maps and player heat maps
- Requires scraping or API access (limited free tier)
Football-Data.org (https://www.football-data.org/)
- Structured match results and standings
- League information across multiple countries
- CSV and JSON formats
Wyscout (https://www.wyscout.com/)
- Professional video data with automated tracking
- API available (subscription-based)
FBref (Sports Reference) (https://fbref.com/)
- Detailed player and team statistics
- Scrapable without explicit API
For this tutorial, I'll focus on StatsBomb's open data because it's freely accessible, comprehensive, and requires no authentication keys.
Part 2: Setup and Required Libraries
Installation
pip install pandas numpy matplotlib seaborn requests statsbomb
Core Libraries Overview
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json
from datetime import datetime
from collections import Counter
# Set style for better-looking plots
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)
Why these libraries?
- pandas: Tabular data manipulation and aggregation
- numpy: Numerical operations and array handling
- matplotlib/seaborn: Publication-quality visualizations
- requests: HTTP requests to fetch data from APIs
- statsbomb: Purpose-built package for StatsBomb data
Part 3: Fetching and Loading StatsBomb Data
Method 1: Using the StatsBomb Python Package
The easiest approach leverages the statsbomb package:
from statsbomb.data import (
matches,
events,
competitions,
lineups,
player_pass
)
# Get all available competitions
comps = competitions()
print(comps[['competition_name', 'season_name']].drop_duplicates())
# Filter for a specific competition (e.g., Women's Super League)
wsl_competitions = comps[comps['competition_name'] == "FA Women's Super League"]
print(wsl_competitions)
# Extract season IDs
comp_id = wsl_competitions.iloc[0]['competition_id']
season_id = wsl_competitions.iloc[0]['season_id']
# Fetch all matches in this competition
matches_df = matches(comp_id, season_id)
print(f"Total matches: {len(matches_df)}")
print(matches_df[['match_date', 'home_team', 'away_team', 'home_score', 'away_score']].head(10))
Method 2: Fetching via GitHub (if package fails)
import requests
import json
def fetch_statsbomb_data(data_type='matches', comp_id=1, season_id=1):
"""
Fetch StatsBomb data directly from GitHub
data_type: 'matches', 'events', 'lineups'
"""
base_url = "https://raw.githubusercontent.com/statsbomb/StatsBomb/master/data"
if data_type == 'matches':
url = f"{base_url}/matches/{comp_id}/{season_id}.json"
elif data_type == 'events':
url = f"{base_url}/events/{event_id}.json"
response = requests.get(url)
return response.json()
# Fetch matches
matches_data = fetch_statsbomb_data('matches', comp_id=1, season_id=1)
matches_df = pd.json_normalize(matches_data)
print(matches_df.shape)
Part 4: Building the Data Pipeline
Step 1: Clean and Normalize Match Data
def process_matches(raw_matches):
"""
Transform raw StatsBomb match data into analysis-ready format
"""
matches = pd.json_normalize(raw_matches)
# Extract relevant columns
matches_clean = matches[[
'match_id',
'match_date',
'kick_off',
'home_team.team_name',
'away_team.team_name',
'home_score',
'away_score',
'duration',
'competition.competition_name',
'season.season_name'
]].copy()
# Rename columns for clarity
matches_clean.columns = [
'match_id', 'match_date', 'kick_off', 'home_team',
'away_team', 'home_score', 'away_score', 'duration',
'competition', 'season'
]
# Convert date columns
matches_clean['match_date'] = pd.to_datetime(matches_clean['match_date'])
# Create derived metrics
matches_clean['total_goals'] = matches_clean['home_score'] + matches_clean['away_score']
matches_clean['goal_diff'] = matches_clean['home_score'] - matches_clean['away_score']
matches_clean['result'] = matches_clean['goal_diff'].apply(
lambda x: 'Home Win' if x > 0 else ('Away Win' if x < 0 else 'Draw')
)
return matches_clean
# Process the data
matches_processed = process_matches(matches_data)
print(matches_processed.head())
Step 2: Fetch and Parse Events Data
def fetch_match_events(match_id):
"""
Fetch all events for a specific match
"""
base_url = "https://raw.githubusercontent.com/statsbomb/StatsBomb/master/data/events"
url = f"{base_url}/{match_id}.json"
response = requests.get(url)
if response.status_code == 200:
return response.json()
else:
print(f"Error fetching match {match_id}")
return None
def parse_events(events_raw):
"""
Normalize and flatten nested event JSON
"""
events_list = []
for event in events_raw:
event_dict = {
'event_id': event.get('id'),
'event_type': event.get('type', {}).get('name'),
'timestamp': event.get('timestamp'),
'minute': event.get('minute'),
'second': event.get('second'),
'period': event.get('period'),
'player_id': event.get('player', {}).get('id'),
'player_name': event.get('player', {}).get('name'),
'team': event.get('team', {}).get('name'),
'x': event.get('location', [None])[0],
'y': event.get('location', [None])[1],
}
# Add type-specific data
if event.get('type', {}).get('name') == 'Pass':
event_dict['pass_length'] = event.get('pass', {}).get('length')
event_dict['pass_angle'] = event.get('pass', {}).get('angle')
event_dict['pass_outcome'] = event.get('pass', {}).get('outcome', {}).get('name')
event_dict['pass_end_x'] = event.get('pass', {}).get('end_location', [None])[0]
event_dict['pass_end_y'] = event.get('pass', {}).get('end_location', [None])[1]
elif event.get('type', {}).get('name') == 'Shot':
event_dict['shot_outcome'] = event.get('shot', {}).get('outcome', {}).get('name')
event_dict['xg'] = event.get('shot', {}).get('expected_goals')
event_dict['shot_type'] = event.get('shot', {}).get('type', {}).get('name')
events_list.append(event_dict)
return pd.DataFrame(events_list)
# Fetch events for first 5 matches
all_events = []
for match_id in matches_processed['match_id'].head(5):
events_raw = fetch_match_events(match_id)
if events_raw:
events_df = parse_events(events_raw)
events_df['match_id'] = match_id
all_events.append(events_df)
print(f"Processed match {match_id}")
events_combined = pd.concat(all_events, ignore_index=True)
print(events_combined.head())
print(f"Total events: {len(events_combined)}")
Step 3: Data Validation and Quality Checks
def validate_data(matches_df, events_df):
"""
Perform quality checks on datasets
"""
print("=== MATCHES DATA VALIDATION ===")
print(f"Shape: {matches_df.shape}")
print(f"Date range: {matches_df['match_date'].min()} to {matches_df['match_date'].max()}")
print(f"Missing values:\n{matches_df.isnull().sum()}")
print(f"Competitions: {matches_df['competition'].unique()}")
print("\n=== EVENTS DATA VALIDATION ===")
print(f"Shape: {events_df.shape}")
print(f"Event types: {events_df['event_type'].value_counts()}")
print(f"Matches covered: {events_df['match_id'].nunique()}")
print(f"Missing values:\n{events_df.isnull().sum()}")
# Check for data consistency
match_ids_in_events = set(events_df['match_id'].unique())
match_ids_in_matches = set(matches_df['match_id'].unique())
print(f"\nMatches with events: {len(match_ids_in_events)}")
print(f"Total matches: {len(match_ids_in_matches)}")
validate_data(matches_processed, events_combined)
Part 5: Exploratory Analysis and Insights
Analysis 1: Home Advantage Investigation
python
# Home advantage analysis
home_stats = matches_processed.groupby('result').size()
print("Match Results Distribution:")
print(home_stats)
# Calculate home win percentage
home_wins = (matches_processed['goal_diff'] > 0).sum()
home_advantage_pct = (home_wins / len(matches_processed)) * 100
print(f"\nHome team win percentage: {home_advantage_pct:.1f}%")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Result distribution
Top comments (0)