Edge Lab

Posted on Jun 27

Building a Sports Data Pipeline: Python, StatsBomb API, and pandas in Practice

#tutorial

The Hook: From API Calls to Actionable Insights

I spent three weeks manually downloading soccer match data from scattered sources, copying statistics into Excel, and creating ad-hoc visualizations. When I discovered StatsBomb's open data API and learned to automate the entire pipeline with Python, I recovered those 60+ hours and gained insights no spreadsheet could deliver.

In this tutorial, I'll show you exactly how to build an automated sports data pipeline from scratch. By the end, you'll have a reproducible system that fetches match data, transforms it with pandas, and generates visualizations that reveal patterns invisible to the naked eye.

Part 1: Understanding Your Data Sources

Before writing a single line of Python, understanding available data sources is crucial.

Primary Open Sports Data Sources

StatsBomb Open Data (https://statsbomb.com/what-we-do/soccer-data/)

Free access to 3,000+ matches (primarily women's football and lower leagues)
Comprehensive event-level data including passes, shots, pressures, and defensive actions
JSON format via GitHub repository

Understat (https://understat.com/)

Expected Goals (xG) and underlying metrics
Shot maps and player heat maps
Requires scraping or API access (limited free tier)

Football-Data.org (https://www.football-data.org/)

Structured match results and standings
League information across multiple countries
CSV and JSON formats

Wyscout (https://www.wyscout.com/)

Professional video data with automated tracking
API available (subscription-based)

FBref (Sports Reference) (https://fbref.com/)

Detailed player and team statistics
Scrapable without explicit API

For this tutorial, I'll focus on StatsBomb's open data because it's freely accessible, comprehensive, and requires no authentication keys.

Part 2: Setup and Required Libraries

Installation

pip install pandas numpy matplotlib seaborn requests statsbomb

Core Libraries Overview

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json
from datetime import datetime
from collections import Counter

# Set style for better-looking plots
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)

Why these libraries?

pandas: Tabular data manipulation and aggregation
numpy: Numerical operations and array handling
matplotlib/seaborn: Publication-quality visualizations
requests: HTTP requests to fetch data from APIs
statsbomb: Purpose-built package for StatsBomb data

Part 3: Fetching and Loading StatsBomb Data

Method 1: Using the StatsBomb Python Package

The easiest approach leverages the statsbomb package:

from statsbomb.data import (
    matches,
    events,
    competitions,
    lineups,
    player_pass
)

# Get all available competitions
comps = competitions()
print(comps[['competition_name', 'season_name']].drop_duplicates())

# Filter for a specific competition (e.g., Women's Super League)
wsl_competitions = comps[comps['competition_name'] == "FA Women's Super League"]
print(wsl_competitions)

# Extract season IDs
comp_id = wsl_competitions.iloc[0]['competition_id']
season_id = wsl_competitions.iloc[0]['season_id']

# Fetch all matches in this competition
matches_df = matches(comp_id, season_id)
print(f"Total matches: {len(matches_df)}")
print(matches_df[['match_date', 'home_team', 'away_team', 'home_score', 'away_score']].head(10))

Method 2: Fetching via GitHub (if package fails)

import requests
import json

def fetch_statsbomb_data(data_type='matches', comp_id=1, season_id=1):
    """
    Fetch StatsBomb data directly from GitHub
    data_type: 'matches', 'events', 'lineups'
    """
    base_url = "https://raw.githubusercontent.com/statsbomb/StatsBomb/master/data"

    if data_type == 'matches':
        url = f"{base_url}/matches/{comp_id}/{season_id}.json"
    elif data_type == 'events':
        url = f"{base_url}/events/{event_id}.json"

    response = requests.get(url)
    return response.json()

# Fetch matches
matches_data = fetch_statsbomb_data('matches', comp_id=1, season_id=1)
matches_df = pd.json_normalize(matches_data)
print(matches_df.shape)

Part 4: Building the Data Pipeline

Step 1: Clean and Normalize Match Data

def process_matches(raw_matches):
    """
    Transform raw StatsBomb match data into analysis-ready format
    """
    matches = pd.json_normalize(raw_matches)

    # Extract relevant columns
    matches_clean = matches[[
        'match_id',
        'match_date',
        'kick_off',
        'home_team.team_name',
        'away_team.team_name',
        'home_score',
        'away_score',
        'duration',
        'competition.competition_name',
        'season.season_name'
    ]].copy()

    # Rename columns for clarity
    matches_clean.columns = [
        'match_id', 'match_date', 'kick_off', 'home_team', 
        'away_team', 'home_score', 'away_score', 'duration',
        'competition', 'season'
    ]

    # Convert date columns
    matches_clean['match_date'] = pd.to_datetime(matches_clean['match_date'])

    # Create derived metrics
    matches_clean['total_goals'] = matches_clean['home_score'] + matches_clean['away_score']
    matches_clean['goal_diff'] = matches_clean['home_score'] - matches_clean['away_score']
    matches_clean['result'] = matches_clean['goal_diff'].apply(
        lambda x: 'Home Win' if x > 0 else ('Away Win' if x < 0 else 'Draw')
    )

    return matches_clean

# Process the data
matches_processed = process_matches(matches_data)
print(matches_processed.head())

Step 2: Fetch and Parse Events Data

def fetch_match_events(match_id):
    """
    Fetch all events for a specific match
    """
    base_url = "https://raw.githubusercontent.com/statsbomb/StatsBomb/master/data/events"
    url = f"{base_url}/{match_id}.json"

    response = requests.get(url)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error fetching match {match_id}")
        return None

def parse_events(events_raw):
    """
    Normalize and flatten nested event JSON
    """
    events_list = []

    for event in events_raw:
        event_dict = {
            'event_id': event.get('id'),
            'event_type': event.get('type', {}).get('name'),
            'timestamp': event.get('timestamp'),
            'minute': event.get('minute'),
            'second': event.get('second'),
            'period': event.get('period'),
            'player_id': event.get('player', {}).get('id'),
            'player_name': event.get('player', {}).get('name'),
            'team': event.get('team', {}).get('name'),
            'x': event.get('location', [None])[0],
            'y': event.get('location', [None])[1],
        }

        # Add type-specific data
        if event.get('type', {}).get('name') == 'Pass':
            event_dict['pass_length'] = event.get('pass', {}).get('length')
            event_dict['pass_angle'] = event.get('pass', {}).get('angle')
            event_dict['pass_outcome'] = event.get('pass', {}).get('outcome', {}).get('name')
            event_dict['pass_end_x'] = event.get('pass', {}).get('end_location', [None])[0]
            event_dict['pass_end_y'] = event.get('pass', {}).get('end_location', [None])[1]

        elif event.get('type', {}).get('name') == 'Shot':
            event_dict['shot_outcome'] = event.get('shot', {}).get('outcome', {}).get('name')
            event_dict['xg'] = event.get('shot', {}).get('expected_goals')
            event_dict['shot_type'] = event.get('shot', {}).get('type', {}).get('name')

        events_list.append(event_dict)

    return pd.DataFrame(events_list)

# Fetch events for first 5 matches
all_events = []
for match_id in matches_processed['match_id'].head(5):
    events_raw = fetch_match_events(match_id)
    if events_raw:
        events_df = parse_events(events_raw)
        events_df['match_id'] = match_id
        all_events.append(events_df)
    print(f"Processed match {match_id}")

events_combined = pd.concat(all_events, ignore_index=True)
print(events_combined.head())
print(f"Total events: {len(events_combined)}")

Step 3: Data Validation and Quality Checks

def validate_data(matches_df, events_df):
    """
    Perform quality checks on datasets
    """
    print("=== MATCHES DATA VALIDATION ===")
    print(f"Shape: {matches_df.shape}")
    print(f"Date range: {matches_df['match_date'].min()} to {matches_df['match_date'].max()}")
    print(f"Missing values:\n{matches_df.isnull().sum()}")
    print(f"Competitions: {matches_df['competition'].unique()}")

    print("\n=== EVENTS DATA VALIDATION ===")
    print(f"Shape: {events_df.shape}")
    print(f"Event types: {events_df['event_type'].value_counts()}")
    print(f"Matches covered: {events_df['match_id'].nunique()}")
    print(f"Missing values:\n{events_df.isnull().sum()}")

    # Check for data consistency
    match_ids_in_events = set(events_df['match_id'].unique())
    match_ids_in_matches = set(matches_df['match_id'].unique())

    print(f"\nMatches with events: {len(match_ids_in_events)}")
    print(f"Total matches: {len(match_ids_in_matches)}")

validate_data(matches_processed, events_combined)

Part 5: Exploratory Analysis and Insights

Analysis 1: Home Advantage Investigation


python
# Home advantage analysis
home_stats = matches_processed.groupby('result').size()
print("Match Results Distribution:")
print(home_stats)

# Calculate home win percentage
home_wins = (matches_processed['goal_diff'] > 0).sum()
home_advantage_pct = (home_wins / len(matches_processed)) * 100
print(f"\nHome team win percentage: {home_advantage_pct:.1f}%")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Result distribution

DEV Community