DEV Community

Cover image for Building a Self-Healing CI/CD Pipeline on GitLab
Thiru G
Thiru G

Posted on

Building a Self-Healing CI/CD Pipeline on GitLab

πŸ” Building a Self-Healing CI/CD Pipeline on GitLab

Auto-resume from stuck jobs. Improve resilience. Save time.


πŸ€” The Problem: Flaky Pipelines, Delayed Delivery

If you've worked with CI/CD in GitLab long enough, you've likely run into this:

  • Pipelines hang or fail midway because a GitLab runner disconnects.
  • A flaky environment causes a single failed job β€” and the entire pipeline restarts.
  • No checkpointing, no fallback... just re-run from the top.

These issues waste time, delay releases, and block engineering focus.

So I built something to fix it.


πŸ”§ Introducing: GitLab Self-Healing Pipeline Framework

A fully open-source system that allows your pipelines to automatically resume from the last successful stage β€” without manual intervention.

πŸ‘‰ GitHub: gThiru/gitlab-self-healing-pipeline


🧠 How It Works

  1. Each stage of your GitLab pipeline records progress to .ci-progress.json (shared volume)
  2. A Python watchdog script checks these files on a schedule (cron or Kubernetes)
  3. If a pipeline is stuck or timed out, it:
    • Reads the last successful stage
    • Triggers a new pipeline with RESUME_STAGE set to the next needed stage
    • Enforces retry limits and pipeline age cutoffs

🧰 Tech Stack

  • 🟣 GitLab CI/CD
  • 🐍 Python for watchdog
  • πŸ“‚ Shared mount (e.g. NFS, EFS) across runners
  • ⏱️ Linux cron or Kubernetes CronJob
  • πŸ” Environment-safe with retry guardrails

πŸ” Key Features

  • βœ… Pipeline resumes from last good stage
  • βœ… JSON-based per-pipeline tracking
  • βœ… Retry limit + max age protection
  • βœ… Works in hybrid GitLab runner setups
  • βœ… Dev-friendly Bash helper to update stage status
  • βœ… Linux + K8s CronJob support

🏁 Quick Example

In .gitlab-ci.yml:

rules:
  - if: '$RESUME_STAGE == "test" || $RESUME_STAGE == ""'
Enter fullscreen mode Exit fullscreen mode

In build job:

source ./update_stage_status.sh
update_stage_status build in_progress
# ... your build steps ...
update_stage_status build done
Enter fullscreen mode Exit fullscreen mode

πŸš€ OSS Ready

This project is:

  • βœ… Released under MIT License
  • βœ… Submitted to awesome-devops
  • βœ… Includes full documentation, examples, templates
  • 🏁 Ready for production use in GitLab-based orgs

🌱 What's Next

I'm planning to add:

  • Job-level tracking (not just stage)
  • Webhook-based watchdog
  • S3/GCS support instead of local disk
  • Slack/email notifications on resume

πŸ™Œ Get Involved

πŸ‘‰ GitHub Repo

πŸ’« Star it if you like it

πŸ’¬ Open issues or suggest ideas

🀝 Contributions are welcome!


πŸ“’ Let’s build smarter pipelines, together.

β€” Thirunavukkarasu Ganesan

DevOps Manager / Architect

Top comments (0)