How to Run Stateful ML Pipelines for Free using GitHub Actions

#python #githubactions #devops #machinelearning

Today is the start of the 2026 FIFA World Cup, the largest sporting competition every four years. As a fun project, I decided to build a model to predict the tournament.

In cases like this, traditional machine learning models typically fail because the data doesn’t properly update the model in real time. So, I built a different kind of predictive engine. While the core math relies on a Monte Carlo simulation running 10,000 iterations, the real production challenge was state management: updating and reading a changing dataset every single day without manual intervention or expensive cloud compute.

I solved this by building an autonomous pipeline using GitHub Actions, flat CSV files, and Streamlit. This is exactly how the live state management and fault tolerance work.

Live State Management & Engineering Fault Tolerance

What makes this project stand out even more is its live state management and updates during the World Cup. Once the tournament begins (today), the system shifts from being just a predictive model to a tracker by handling two major risks: The Elimination Trap and Timezone Offsets.

The Elimination Trap

At the start of each run, the engine reads elo_results.csv and checks to see if a match already has a real-world score recorded. If it does, it locks that score in for all 10,000 runs. This instantly forces any eliminated team to drop to a 0% probability, allowing us to continue to predict accurately without running random simulations on games that have already concluded.

Timezone Offsets

Matches across North America have many late-night finishes that spill into the next day in UTC. I set up a cron job to pull the latest scores and results every day, but a standard UTC cloud cron job will miss these late results. To fix this, I anchored the parameters to the West Coast timezone.

params = {
    'league': '1',
    'season': '2026',
    'timezone': 'America/Los_Angeles'
}

The script filters data by match status, accepting only completed games, so the pipeline does not error with corrupted or partial data. It explicitly verifies that data matches complete games before touching the stateful historical files.

Autonomous CI/CD Pipeline

To pull live data, I configured a GitHub Actions workflow. It handles the live data ingestion, runs the 10,000 simulations, and saves the new states fully autonomously.

Because standard GitHub runner environments are ephemeral, the workflow requires explicit write permissions to commit updated datasets directly back to the main branch. The cron job is timed for 06:00 UTC to ensure all late-night North American games have completely concluded.

name: Daily World Cup Data Update

on:
  schedule:
    # Runs at 06:00 UTC every day to ensure all matches have concluded
    - cron: '0 6 * * *'
  # Allows you to trigger the run manually from the GitHub Actions tab
  workflow_dispatch: 

permissions:
  contents: write # Needed so the bot can push changes back to the repo

jobs:
  update-data:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
          cache: 'pip'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run live update pipeline
        env:
          # This pulls your API key from GitHub Secrets
          API_SPORTS_KEY: ${{ secrets.API_SPORTS_KEY }}
        run: python src/update_live_data.py

      - name: Commit and push updated data
        run: |
          git config --local user.email "github-actions[bot]@users.noreply.github.com"
          git config --local user.name "github-actions[bot]"          
          # Stage the updated data files
          git add data/processed/elo_results.csv
          git add data/processed/simulation_results.csv

          # Check if anything actually changed, and if so, commit and push
          git diff --quiet && git diff --staged --quiet || (git commit -m "Auto-update World Cup live data & simulations" && git push)

Streamlit Integration

The frontend is a simple Streamlit dashboard directly linked to the repository. Whenever the GitHub Action finishes, it pushes the fresh simulation_results.csv and sample_bracket.json files. Streamlit actively monitors the underlying repository for file updates. The moment the commit lands, the public dashboard re-renders and updates the presentation layer live.

Check out the full code on GitHub
Check out the live dashboard on Streamlit Cloud

Top comments (2)

Luis Cruz • Jun 11

This is a fantastic example of running stateful, autonomous ML pipelines entirely on GitHub Actions. I really appreciate the careful handling of live state, elimination trap logic, and timezone offsets—these are exactly the kinds of edge cases that often break naive pipelines.

The integration with Streamlit for a live dashboard, combined with fully automated commits of updated CSV datasets, demonstrates a production-ready approach to ML workflow automation using entirely free infrastructure.

I’d love to collaborate and explore extending this pattern—experimenting with multi-model simulations, cross-timezone event handling, and reproducible CI/CD pipelines for live ML projects. Sharing strategies for state management, fault tolerance, and automated verification could benefit teams working on live predictive systems.

Would you be open to discussing a collaboration to test similar autonomous pipelines for other competitions or real-world streaming datasets?

Adarsh • Jul 9

Hi Luis! Thanks for the feedback! I'm glad you found the engineering choices around live state management and timezone offsets useful. I'd definitely be open to collaborating on this pattern for other competitions or real-world streaming datasets. Let's connect to discuss how we can team up on it. What's the best way to reach you?