DEV Community

James K.
James K.

Posted on

Building an Automated Data Pipeline: Injuries vs Performance in the Premier League

Having spent the better part of the last two months learning the basics of data engineering, I wanted to build a project to test my understanding of the concepts. This article documents the process of building an automated ETL pipeline using Python, PostgreSQL, and Apache Airflow — including the bugs, the lessons learned, and the insights drawn from the data.

I settled on analyzing whether player injuries contributed to the worst season in Manchester United’s recent history. The club cited injuries as a major reason for their poor performance, but I wasn’t convinced. Since data rarely lies, I decided to test that claim myself by combining injury data and team performance statistics, using my newly acquired data engineering skills.

What started as a curiosity about one team quickly turned into a league-wide project.

Tech Stack

  • Language: Python
  • Database: PostgreSQL
  • Orchestration: Apache Airflow
  • Libraries: BeautifulSoup, Requests, psycopg2
  • Tools: ScraperAPI

The github repo to this project can be found here

ETL Process

Data Extraction

The project began by obtaining raw data from two main sources:
Premier Injuries – for player injury data
Understat – for match and player statistics

a. Injury Data

To scrape injury data, I initially tried Selenium, but quickly ran into Cloudflare’s human verification challenge (the “I’m not a robot” checkbox).
After experimenting with Playwright, I found a more reliable solution — ScraperAPI, a service that handles CAPTCHAs and rotating proxies automatically.

Using ScraperAPI with BeautifulSoup, I extracted HTML content and parsed the table containing injury data:

def extract_injury_data():
    payload = {
        'api_key': 'your_api_key',
        'url': 'https://www.premierinjuries.com/injury-table.php',
        'render': 'true' 
    }

    response = requests.get('https://api.scraperapi.com/', params=payload)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Basic validation — check that the table or expected content exists
    if not soup.find("table"):
        raise ValueError("Expected injury table not found in the page HTML.")

    return soup

Enter fullscreen mode Exit fullscreen mode

This function fetches the rendered page via ScraperAPI and returns the parsed HTML tree for later processing.

b. Performance Data

I then had to get the match statistics, so that I can compare the team's performance when certain players were unavailable due to injury to when they were fit and playing. I scraped this data from Understat website by extracting the JSON data that's embedded inside the JavaScript code on the webpage. I first sent the HTTP request, then parsed the HTML response from the server containing the entire webpage using BeautifulSoup. Next, I extracted the script tags from the page as this is how Understat embeds the match data. I then looped through each script tag looking for the datesData variable where match information is stored and extracted the json string. The JavaScript code on the page looks something like this :
var datesData = JSON.parse('{"2024-08-16":[{"id":"12345",...}]}');
The loop extracts the JSON strings between the positions JSON.parse(' and ) thus extracts something like : {"date":"2024-08-16"}

The resulting JSON string contains escaped characters such as apostrophes and backslashes that needed to be converted into actual characters before being parsed into a Python data structure that can actually be returned by the function. Extracting player statistics was a fairly similar process, only now I had to look for the playersData variable.


def extract_match_data():
    #Send HTTP request
    url = 'https://understat.com/league/EPL/2024'
    res = requests.get(url)
    #Parse response
    soup = BeautifulSoup(res.content, 'lxml')

    #find and loop through <script> tags to find relevant info and extract JSON string
    script_tags = soup.find_all('script')
    for script in script_tags:
        if 'datesData' in script.text:
            raw_text = script.string
            start = raw_text.find("JSON.parse('") + len("JSON.parse('")
            end = raw_text.find("')", start)
            json_str = raw_text[start:end]
            decoded_json = json_str.encode('utf8').decode('unicode_escape')
            return json.loads(decoded_json)

    # raise error if no data is extracted
    raise RuntimeError("Failed to extract match data from Understat.")

Enter fullscreen mode Exit fullscreen mode

Data Transformation

Now that I had the raw data, I had to clean and validate the raw data so that I can use it for analysis. Some of the operations done on the data included:

  • Name Standardization: both team an player names are normalized to ensure consistent matching across both data sources.
  • Foreign Key Resolution: Player and team names are mapped to database IDs using lookup tables
  • Date Parsing: Various date formats are converted to ISO standard format
  • Validation: Incomplete match entries are skipped

After transformation, the data was structured into two main entities:

  • Matches – containing match ID, date, teams involved, expected goals (xG), and result.
  • Players and Player Stats – separating static player information from season performance data.
    Each player’s record included:

  • Matches and minutes played

  • Goals and assists

  • Expected goals (xG) and expected assists (xA)

Data Loading

Transformed data was loaded into a PostgreSQL database using psycopg2.
The schema consisted of the following tables:

  • player_details
  • player_stats
  • matches
  • injuries

To ensure integrity and efficiency, I implemented:

  • UPSERT operations using ON CONFLICT for handling duplicate updates
  • Batch inserts for efficiency
  • Explicit transactions for commit/rollback safety
  • Timestamp tracking for debugging and data lineage

The database structure diagram below shows how the schema is set up.
dbSchema

Automation with Apache Airflow

Running scripts manually became inefficient and error-prone, especially as data volumes grew.
To automate the ETL process, I used Apache Airflow to define and schedule the workflow as a Directed Acyclic Graph (DAG).

with DAG(
    dag_id="injury_etl_dag",
    description="ETL pipeline for Understat and Premier Injuries data",
    schedule=timedelta(days=1),
    start_date=datetime(2024, 1, 1),
    catchup=False,
    default_args=default_args,
    tags=["injury", "understat", "ETL"]
) as dag:
Enter fullscreen mode Exit fullscreen mode

Next, I added the tasks to be carried out and encapsulated each task in its own @task() function. The data is stored temporarily using Airflow's Xcom and passed between tasks. I then specified the task dependencies and the workflow sequence.

    @task()
    def extract_task(**kwargs):
        injury_html = extract_injury_data()
        matches_json = extract_match_data()
        players_raw = extract_understat_player_stats()

        kwargs["ti"].xcom_push(key="injury_html", value=injury_html.prettify())
        kwargs["ti"].xcom_push(key="matches_json", value=matches_json)
        kwargs["ti"].xcom_push(key="players_raw", value=players_raw)

    @task()
    def transform_task(**kwargs):
        from bs4 import BeautifulSoup

        ti = kwargs["ti"]
        conn = get_db_connection()
        team_name_to_id = get_team_name_to_id(conn)

        # Load XComs
        injury_html = ti.xcom_pull(key="injury_html", task_ids="extract_task")
        matches_json = ti.xcom_pull(key="matches_json", task_ids="extract_task")
        players_raw = ti.xcom_pull(key="players_raw", task_ids="extract_task")

        # Transform
        match_data = transform_match_data(matches_json, team_name_to_id)
        players, stats = transform_player_stats(players_raw, team_name_to_id, conn)
        injuries = transform_injury_data(BeautifulSoup(injury_html, "lxml"), conn, team_name_to_id)

        # Push to XComs
        ti.xcom_push(key="matches", value=match_data)
        ti.xcom_push(key="players", value=players)
        ti.xcom_push(key="player_stats", value=stats)
        ti.xcom_push(key="injuries", value=injuries)

    @task()
    def load_task(**kwargs):
        ti = kwargs["ti"]
        conn = get_db_connection()

        matches = ti.xcom_pull(key="matches", task_ids="transform_task")
        players = ti.xcom_pull(key="players", task_ids="transform_task")
        player_stats = ti.xcom_pull(key="player_stats", task_ids="transform_task")
        injuries = ti.xcom_pull(key="injuries", task_ids="transform_task")

        with conn.cursor() as cur:
            create_injury_schema(conn)
            load_teams(cur)
            insert_matches(cur, matches)
            insert_player_details(cur, players)
            insert_player_stats(cur, player_stats)
            conn.commit()

        load_injuries_data(conn, injuries)

    # Set dependencies
    extract_task() >> transform_task() >> load_task()

Enter fullscreen mode Exit fullscreen mode

The diagram below shows successful DAG runs, eventually:

DAG runs

Data Analysis

Once the data is consolidated, it can now be used for statistical analysis to try and find the correlation between injury occurrences and team performance metrics. I was able to get team and plater stats for the whole season, but the data covering the injuries spanned a few weeks before my free trial on ScrapperAPI expired, so proper analysis of the entire season wasn't possible. With data covering the last third or so of the 2024 season, I was able to do the following analysis:

Frequency of injuries by team

Purpose: Determine teams with the most injuries
Finding: Man United had a lower number of injuries compared to other teams. Chelsea had the most with 4, and Arsenal had the least with none. Man United had one.

Injury Rate vs Win Percentage Analysis

Purpose: Determine if teams with more injuries perform worse
Finding: Man United had less injuries than the other teams, but performed worse, showing injuries were not the only cause of poor performance.

Challenges faced during building this pipeline.

a. Data extraction

The Cloudflare CAPTCHA test prevented me from using basic python libraries
to extract injury data, resulting in the need for a paid-for API for extraction. This also meant I could not get injury data spanning the entire season for proper analysis.

b. Data Normalization

Since data was collected from multiple sources referencing the same players, each source used different identifiers to represent them. To ensure consistency and eliminate redundancy, I applied database normalization techniques — standardizing player identifiers, resolving data inconsistencies, and structuring the data into related tables for more efficient storage and retrieval.

Conclusion

Building this pipeline was more than just a coding exercise — it was a practical test of my data engineering foundations. From web scraping and normalization to ETL automation and database design, the project taught me how data moves through every stage of an analytical workflow.
Even with partial data, the findings challenged assumptions about Manchester United’s poor season and highlighted how data engineering enables evidence-based analysis.
This project strengthened my grasp of ETL orchestration, workflow automation, and schema design — setting the stage for future work involving analytics dashboards and machine learning-driven insights.

Top comments (0)