Edge Lab

Posted on Jul 2

Building a Sports Data Pipeline: Python, StatsBomb API, and pandas in Practice

#tutorial

The Hook: Why Sports Data Matters More Than Ever

Last season, a mid-tier English football club made headlines when they announced a dramatic shift in their recruitment strategy. Their secret? A Python script that analyzed 15,000+ player actions across 500+ matches. Within two years, they'd climbed 14 positions in the league using data-driven insights that cost less than a single journeyman player's salary.

This isn't fiction anymore. Sports data analysis has moved from luxury to necessity, and the best part? You don't need a six-figure budget to get started. With Python, open APIs, and publicly available datasets, you can build enterprise-grade sports analytics pipelines in your spare time.

In this tutorial, I'll walk you through building a complete sports data pipeline that ingests StatsBomb data, processes it with pandas, and surfaces actionable insights. By the end, you'll have a reusable framework you can apply to any sport.

Part 1: Understanding Your Data Sources

Before writing a single line of code, you need to understand where sports data lives and what you're actually working with.

The Sports Data Ecosystem

The sports data landscape has three main tiers:

Tier 1 - Commercial APIs (StatsBomb, Opta, InStat)

Highest quality, most comprehensive event data
Premium pricing ($5,000-$50,000+ annually)
Real-time or near real-time availability
StatsBomb offers free educational access

Tier 2 - Public/Semi-Public APIs (Rapid API, ESPN, Football-Data.org)

Solid data quality for most use cases
Free or freemium pricing
Some restrictions on rate limits and historical depth
Great for learning and non-commercial projects

Tier 3 - Web Scraping (Understat, FBref, WhoScored)

Variable quality depending on source
Requires legal and ethical consideration
No API = more maintenance overhead
Best for supplementary data

For this tutorial, I'm using StatsBomb's open data, which is freely available on GitHub and includes detailed event-level data for 3,000+ professional matches.

What Data Will We Analyze?

StatsBomb provides event-level data with approximately 500 attributes per match, including:

Positional data: x/y coordinates for every action
Event types: 28+ categories (pass, shot, tackle, dribble, etc.)
Outcome data: success/failure, pressure, defensive actions
Player/team metadata: IDs, names, positions, jersey numbers
Match context: dates, competition, stadiums, lineups

A single match generates 2,000-3,000 events. Our pipeline will aggregate this granular data into meaningful statistics.

Part 2: Environment Setup and Core Libraries

Let's build the foundation for a production-ready pipeline.

Installation

# Create virtual environment
python -m venv sports_pipeline
source sports_pipeline/bin/activate  # On Windows: sports_pipeline\Scripts\activate

# Install required packages
pip install pandas numpy matplotlib seaborn requests scipy scikit-learn jupyter

# Optional: for advanced visualization
pip install plotly kaleido

# Optional: for StatsBomb-specific convenience
pip install statsbomb

Import Structure

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import json
import requests
from collections import defaultdict
import warnings

warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)

Key Libraries Explained

pandas: DataFrames are perfect for nested JSON sports data. You'll spend 60% of time here.
numpy: Fast numerical operations and statistical calculations
matplotlib/seaborn: Publication-quality visualizations
requests: Clean API interactions
scipy: Statistical tests and distributions

Part 3: Building the Data Pipeline

Now for the practical implementation. Here's where theory meets code.

Step 1: Fetching StatsBomb Data

StatsBomb's open data is hosted on GitHub. We'll fetch match data programmatically:

class StatsBombPipeline:
    """Production-ready StatsBomb data pipeline"""

    BASE_URL = "https://raw.githubusercontent.com/statsbomb/StatsBomb/master/data"

    def __init__(self):
        self.matches = None
        self.events = None
        self.lineups = None

    def fetch_competitions(self):
        """Fetch available competitions"""
        url = f"{self.BASE_URL}/competitions.json"
        response = requests.get(url)
        competitions = response.json()
        return pd.DataFrame(competitions)

    def fetch_matches(self, competition_id, season_id):
        """Fetch all matches for a specific season"""
        url = f"{self.BASE_URL}/matches/{competition_id}/{season_id}.json"
        response = requests.get(url)
        matches = response.json()

        # Flatten nested JSON to DataFrame
        matches_df = pd.json_normalize(matches)
        return matches_df

    def fetch_events(self, match_id):
        """Fetch detailed event data for a single match"""
        url = f"{self.BASE_URL}/events/{match_id}.json"
        response = requests.get(url)
        events = response.json()

        # Flatten nested structure
        events_df = pd.json_normalize(events)
        events_df['match_id'] = match_id
        return events_df

    def fetch_season_events(self, competition_id, season_id, limit=None):
        """Fetch events for entire season"""
        matches_df = self.fetch_matches(competition_id, season_id)

        all_events = []
        match_list = matches_df['match_id'].tolist()

        if limit:
            match_list = match_list[:limit]

        print(f"Fetching events for {len(match_list)} matches...")

        for idx, match_id in enumerate(match_list):
            try:
                events_df = self.fetch_events(match_id)
                all_events.append(events_df)

                if (idx + 1) % 10 == 0:
                    print(f"✓ Processed {idx + 1}/{len(match_list)} matches")

            except Exception as e:
                print(f"✗ Error fetching match {match_id}: {e}")
                continue

        # Concatenate and reset index
        events_df = pd.concat(all_events, ignore_index=True)
        self.events = events_df
        self.matches = matches_df

        return events_df

# Initialize pipeline
pipeline = StatsBombPipeline()

# Fetch competitions and pick Premier League (37)
competitions = pipeline.fetch_competitions()
print(competitions[['competition_id', 'competition_name']].drop_duplicates())

# Fetch Premier League 2017-18 season (sample)
events = pipeline.fetch_season_events(
    competition_id=37,
    season_id=1,
    limit=50  # First 50 matches for this example
)

print(f"Loaded {len(events)} events across {len(pipeline.matches)} matches")

Step 2: Data Cleaning and Enrichment

Raw sports data is messy. Let's clean it:

def clean_and_enrich_events(events_df):
    """Clean and add useful features to event data"""

    df = events_df.copy()

    # Convert timestamps
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['minute'] = df['minute'].fillna(0).astype(int)

    # Extract pass-specific columns
    df['pass_completed'] = df['pass.outcome'].isna().astype(int)
    df['pass_length'] = df.apply(
        lambda row: np.sqrt(
            (row.get('pass.end_location', [0])[0] - row.get('location', [0])[0])**2 +
            (row.get('pass.end_location', [0])[1] - row.get('location', [0])[1])**2
        ) if pd.notna(row.get('pass.end_location')) else np.nan,
        axis=1
    )

    # Extract shot data
    df['shot_result'] = df['shot.outcome.name'].fillna('Not Shot')

    # Team and player names
    df['team_name'] = df['team.name'].fillna('Unknown')
    df['player_name'] = df['player.name'].fillna('Unknown')
    df['position'] = df['position.name'].fillna('Unknown')

    # Event type simplification
    df['event_type'] = df['type.name'].fillna('Other')

    # Sort by match and timestamp
    df = df.sort_values(['match_id', 'timestamp']).reset_index(drop=True)

    return df

events_clean = clean_and_enrich_events(events)
print(events_clean[['timestamp', 'team_name', 'player_name', 'event_type', 'pass_completed']].head(10))

Step 3: Aggregating to Team-Level Statistics

Now we transform event-level data into actionable metrics:


python
def calculate_team_stats(events_df, matches_df):
    """Calculate comprehensive team statistics"""

    # Initialize results dictionary
    team_stats = defaultdict(lambda: {
        'matches_played': 0,
        'passes_attempted': 0,
        'passes_completed': 0,
        'pass_accuracy': 0,
        'shots': 0,
        'shots_on_target': 0,
        'tackles': 0,
        'interceptions': 0,
        'fouls_committed': 0,
        'goals': 0,
        'possession_time': 0
    })

    # Get match durations for possession calculations
    match_durations = matches_df.set_index('match_id')[['duration']].to_dict()['duration']

    # Iterate through events
    for _, event in events_df.iterrows():
        team = event['team_name']
        event_type = event['event_type']
        match_id = event['match_id']

        # Passes
        if event_type == 'Pass':
            team_stats[team]['passes_attempted'] += 1
            if event['pass_completed'] == 1:
                team_stats[team]['passes_completed'] += 1

        # Shots
        elif event_type == 'Shot':
            team_stats[team]['shots'] += 1
            if event['shot_result'] in ['Saved', 'Goal']:
                team_stats[team]['shots_on_target'] += 1
            if event['shot_result'] == 'Goal':
                team_stats[team]['goals'] += 1

        # Defensive actions
        elif event_type == 'Tackle':

DEV Community