π Building a Self-Healing CI/CD Pipeline on GitLab
Auto-resume from stuck jobs. Improve resilience. Save time.
π€ The Problem: Flaky Pipelines, Delayed Delivery
If you've worked with CI/CD in GitLab long enough, you've likely run into this:
- Pipelines hang or fail midway because a GitLab runner disconnects.
- A flaky environment causes a single failed job β and the entire pipeline restarts.
- No checkpointing, no fallback... just re-run from the top.
These issues waste time, delay releases, and block engineering focus.
So I built something to fix it.
π§ Introducing: GitLab Self-Healing Pipeline Framework
A fully open-source system that allows your pipelines to automatically resume from the last successful stage β without manual intervention.
π GitHub: gThiru/gitlab-self-healing-pipeline
π§ How It Works
- Each stage of your GitLab pipeline records progress to
.ci-progress.json
(shared volume) - A Python watchdog script checks these files on a schedule (cron or Kubernetes)
- If a pipeline is stuck or timed out, it:
- Reads the last successful stage
- Triggers a new pipeline with
RESUME_STAGE
set to the next needed stage - Enforces retry limits and pipeline age cutoffs
π§° Tech Stack
- π£ GitLab CI/CD
- π Python for watchdog
- π Shared mount (e.g. NFS, EFS) across runners
- β±οΈ Linux cron or Kubernetes CronJob
- π Environment-safe with retry guardrails
π Key Features
- β Pipeline resumes from last good stage
- β JSON-based per-pipeline tracking
- β Retry limit + max age protection
- β Works in hybrid GitLab runner setups
- β Dev-friendly Bash helper to update stage status
- β Linux + K8s CronJob support
π Quick Example
In .gitlab-ci.yml
:
rules:
- if: '$RESUME_STAGE == "test" || $RESUME_STAGE == ""'
In build
job:
source ./update_stage_status.sh
update_stage_status build in_progress
# ... your build steps ...
update_stage_status build done
π OSS Ready
This project is:
- β Released under MIT License
- β Submitted to awesome-devops
- β Includes full documentation, examples, templates
- π Ready for production use in GitLab-based orgs
π± What's Next
I'm planning to add:
- Job-level tracking (not just stage)
- Webhook-based watchdog
- S3/GCS support instead of local disk
- Slack/email notifications on resume
π Get Involved
π GitHub Repo
π« Star it if you like it
π¬ Open issues or suggest ideas
π€ Contributions are welcome!
π’ Letβs build smarter pipelines, together.
β Thirunavukkarasu Ganesan
DevOps Manager / Architect
Top comments (0)