⚽ The Data XI: Building a Modern Football Data Platform — Chapter 1: Taming the Data Beast

#dataengineering #webscraping #python #webdev

Introduction

Football has always been a game of passion. For me, it's also been a source of frustration, especially when it comes to betting and Fantasy Leagues. Relying on gut feelings and basic form guides felt like guesswork. I knew there had to be a better way—a data-driven way. In today’s world, football is also a game of data. From scouting to tactics to betting, data shapes how the sport is played, managed, and monetized.

This project, The Data XI, is my journey to build that better way. It's an end-to-end, open-source data platform for football analytics, built with the same tools used in real-world data engineering.

In this series, I’ll document everything: from gathering millions of data points to building a cloud-ready ELT pipeline with Postgres, Airflow, and dbt, and eventually powering predictive models and web applications. This is Chapter 1: The quest for data.

Acquiring the Data

Every data platform begins with a simple question: where do you get the data? I didn't want just any data. I needed:

Comprehensive Coverage: All 5 big European leagues (Premier League, La Liga, Bundesliga, Serie A, Ligue 1).
Historical Depth: Enough seasons to uncover meaningful trends and patterns.
Granular Detail: More than just final scores. I wanted lineups, advanced stats, heatmaps, betting odds, and individual player events.

To achieve this, I combined data from several public sources. FBRef, for instance, was a goldmine for advanced stats like xG and detailed possession and defensive stats. For more granular, event-level data—like shot maps and player heatmaps—I utilized a variety of publicly available sports data APIs to gather the necessary JSON files.

The Scale of the Dataset

This wasn't a weekend project. The final raw dataset includes:

7 full seasons of football data
Top 5 European leagues
~380 matches per season, per league
28 distinct JSON/CSV files per match That’s tens of thousands of files and well over 1.5 million rows of raw football data before the real work even begins.

Organizing the Chaos

At this scale, file management is an engineering challenge in itself. A flat folder of 100,000 files is useless. I designed a hierarchical folder structure that is both human-readable and programmatically accessible:

Done/
  ├── PremierLeague/
  │     ├── 2020-21/
  │     │     ├── GW1/
  │     │     │     ├── 2023-08-19ManArs/   ← combo_id (unique match ID)
  │     │     │     │    ├── game_summary.csv
  │     │     │     │    ├── match_stats.json
  │     │     │     │    └── ...
  │     │     ├── GW2/
  │     │     └── ...
  └── LaLiga/
        └── ...

This Tournament → Season → Gameweek → Match structure means I can pinpoint the exact files for any given game instantly, which is critical for our ETL pipeline. It also means that if for any reason, our scraping fails, I can easily pinpoint which games (using the combo_id) or gameweeks are affected.

Lessons from the Trenches

Acquiring this data was a project in itself. Here are a few lessons learned:

APIs Can Be a Black Box: Many public sports data APIs are powerful but often undocumented. It took significant network analysis and trial-and-error to understand request patterns and map out the necessary endpoints.
Schemas Are Not Sacred: JSON by default is schema-less, so parsing it could lead to problems down the line, like Schema Evolution/Drift . The ETL script has to be resilient to this.
Scraping Requires Patience: Downloading and parsing this volume of data required building a robust, batched process with error handling and retries to avoid getting blocked or losing data.
Updates: Building the process so I can easily add new seasons and gameweeks without breaking existing data.
Lots of Data Sources: Combining and grouping data from multiple sources for a single match led to the creation of the combo_id, which is basically just concatenating the date the game was played and the first 3 characters of the home and away team, which turned out to be a surprisingly unique identifier

Each of these required careful design, and the lessons will feed directly into how I build the pipeline.

What’s Next

The beast has been captured; now it's time to tame it. The data is acquired and organized, ready to be piped into our platform.

In Chapter 2, we'll get our hands dirty with the core infrastructure:

Containerizing a PostgreSQL database with Docker to serve as our data warehouse.
Designing the three core schemas: raw, staging, and analytics.
Writing the first Python ETL script to load our thousands of raw files into Postgres.
Incorporate Airflow for orchestration and dbt for transformations

✍️ If you’re into football, data, or engineering, follow my journey! I’d love to hear your thoughts. What’s the first question you would try to answer with this dataset?

📌 Next up: Chapter 2: Building the Raw Data Warehouse with Postgres + Docker

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.