<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sunkanmi Otitolaye</title>
    <description>The latest articles on DEV Community by Sunkanmi Otitolaye (@leosuky).</description>
    <link>https://dev.to/leosuky</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3543605%2F8dc53efe-755d-4d35-945d-a38ec6fe076e.png</url>
      <title>DEV Community: Sunkanmi Otitolaye</title>
      <link>https://dev.to/leosuky</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/leosuky"/>
    <language>en</language>
    <item>
      <title>⚽ The Data XI: Building a Modern Football Data Platform — Chapter 1: Taming the Data Beast</title>
      <dc:creator>Sunkanmi Otitolaye</dc:creator>
      <pubDate>Mon, 06 Oct 2025 10:37:58 +0000</pubDate>
      <link>https://dev.to/leosuky/the-data-xi-building-a-modern-football-data-platform-chapter-1-taming-the-data-beast-5bg7</link>
      <guid>https://dev.to/leosuky/the-data-xi-building-a-modern-football-data-platform-chapter-1-taming-the-data-beast-5bg7</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Football has always been a game of passion. For me, it's also been a source of frustration, especially when it comes to betting and Fantasy Leagues. Relying on gut feelings and basic form guides felt like guesswork. I knew there had to be a better way—a data-driven way. In today’s world, football is also a game of data. From scouting to tactics to betting, data shapes how the sport is played, managed, and monetized.&lt;/p&gt;

&lt;p&gt;This project, &lt;strong&gt;The Data XI&lt;/strong&gt;, is my journey to build that better way. It's an end-to-end, open-source data platform for football analytics, built with the same tools used in real-world data engineering.&lt;/p&gt;

&lt;p&gt;In this series, I’ll document everything: from gathering millions of data points to building a cloud-ready ELT pipeline with Postgres, Airflow, and dbt, and eventually powering predictive models and web applications. This is Chapter 1: The quest for data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Acquiring the Data
&lt;/h2&gt;

&lt;p&gt;Every data platform begins with a simple question: where do you get the data? I didn't want just any data. I needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Comprehensive Coverage&lt;/strong&gt;: All 5 big European leagues (Premier League, La Liga, Bundesliga, Serie A, Ligue 1).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Historical Depth&lt;/strong&gt;: Enough seasons to uncover meaningful trends and patterns.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Granular Detail&lt;/strong&gt;: More than just final scores. I wanted lineups, advanced stats, heatmaps, betting odds, and individual player events.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To achieve this, I combined data from several public sources. &lt;strong&gt;FBRef&lt;/strong&gt;, for instance, was a goldmine for advanced stats like xG and detailed possession and defensive stats. For more granular, event-level data—like shot maps and player heatmaps—I utilized a variety of publicly available sports data APIs to gather the necessary JSON files.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scale of the Dataset
&lt;/h2&gt;

&lt;p&gt;This wasn't a weekend project. The final raw dataset includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;7 full seasons&lt;/strong&gt; of football data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Top 5&lt;/strong&gt; European leagues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~380 matches per season&lt;/strong&gt;, per league&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;28 distinct&lt;/strong&gt; JSON/CSV files per match
That’s tens of thousands of files and well over 1.5 million rows of raw football data before the real work even begins.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9x5g2kipk1e2puoet7j0.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9x5g2kipk1e2puoet7j0.gif" alt="mind-blown" width="196" height="187"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Organizing the Chaos
&lt;/h2&gt;

&lt;p&gt;At this scale, file management is an engineering challenge in itself. A flat folder of 100,000 files is useless. I designed a hierarchical folder structure that is both human-readable and programmatically accessible:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Done/
  ├── PremierLeague/
  │     ├── 2020-21/
  │     │     ├── GW1/
  │     │     │     ├── 2023-08-19ManArs/   ← combo_id (unique match ID)
  │     │     │     │    ├── game_summary.csv
  │     │     │     │    ├── match_stats.json
  │     │     │     │    └── ...
  │     │     ├── GW2/
  │     │     └── ...
  └── LaLiga/
        └── ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This &lt;strong&gt;Tournament&lt;/strong&gt; → &lt;strong&gt;Season&lt;/strong&gt; → &lt;strong&gt;Gameweek&lt;/strong&gt; → &lt;strong&gt;Match&lt;/strong&gt; structure means I can pinpoint the exact files for any given game instantly, which is critical for our ETL pipeline. It also means that if for any reason, our scraping fails, I can easily pinpoint which games (using the combo_id) or gameweeks are affected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons from the Trenches
&lt;/h2&gt;

&lt;p&gt;Acquiring this data was a project in itself. Here are a few lessons learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;APIs Can Be a Black Box&lt;/strong&gt;: Many public sports data APIs are powerful but often undocumented. It took significant network analysis and trial-and-error to understand request patterns and map out the necessary endpoints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schemas Are Not Sacred&lt;/strong&gt;: JSON by default is schema-less, so parsing it could lead to problems down the line, like Schema Evolution/Drift . The ETL script has to be resilient to this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scraping Requires Patience&lt;/strong&gt;: Downloading and parsing this volume of data required building a robust, batched process with error handling and retries to avoid getting blocked or losing data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Updates&lt;/strong&gt;: Building the process so I can easily add new seasons and gameweeks without breaking existing data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lots of Data Sources&lt;/strong&gt;: Combining and grouping data from multiple sources for a single match led to the creation of the &lt;code&gt;combo_id&lt;/code&gt;, which is basically just concatenating the date the game was played and the first 3 characters of the home and away team, which turned out to be a surprisingly unique identifier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these required careful design, and the lessons will feed directly into how I build the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftz0zikvdol5xkkshlnpf.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftz0zikvdol5xkkshlnpf.gif" alt="applause" width="300" height="169"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;The beast has been captured; now it's time to tame it. The data is acquired and organized, ready to be piped into our platform.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Chapter 2&lt;/strong&gt;, we'll get our hands dirty with the core infrastructure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Containerizing a &lt;strong&gt;PostgreSQL&lt;/strong&gt; database with Docker to serve as our data warehouse.&lt;/li&gt;
&lt;li&gt;Designing the three core schemas: &lt;code&gt;raw&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;, and &lt;code&gt;analytics&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Writing the first Python ETL script to load our thousands of raw files into Postgres.&lt;/li&gt;
&lt;li&gt;Incorporate &lt;strong&gt;Airflow&lt;/strong&gt; for orchestration and &lt;strong&gt;dbt&lt;/strong&gt; for transformations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✍️ If you’re into football, data, or engineering, follow my journey! I’d love to hear your thoughts. What’s the first question you would try to answer with this dataset?&lt;/p&gt;

&lt;p&gt;📌 Next up: &lt;strong&gt;Chapter 2&lt;/strong&gt;: Building the Raw Data Warehouse with Postgres + Docker&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>webscraping</category>
      <category>python</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
