How I built an AI tool that diagnoses CI/CD pipeline failures in seconds

#devops #ai #github #showdev

Every engineering team I've talked to has the same frustration.
Pipeline fails. Engineer gets a Slack notification. Opens GitHub Actions. Stares at 300 lines of logs. Googles the error. Checks Stack Overflow. 45 minutes later — maybe they've found the fix.
I built PipelineIQ to eliminate that 45 minutes.
What it does
When a CI/CD pipeline fails, PipelineIQ automatically:

Captures the error logs
Sends them to Claude AI for analysis
Delivers a Slack alert with the exact root cause and fix steps — within seconds

Here's a real example of what lands in Slack:
🔴 Pipeline Failure: Stripe API connection timeout blocking payment webhooks

AI Diagnosis: The deployment is failing because the application cannot establish a connection to Stripe's API within the 30-second timeout limit. This is preventing payment webhook processing.
Recommended Fix: Check STRIPE_SECRET_KEY and STRIPE_PUBLISHABLE_KEY in production environment variables. Test connectivity to api.stripe.com from your deployment environment. Increase API timeout from 30s to 60s.

No log diving. No guessing. Specific, actionable steps.
The stack

FastAPI — Python backend with async support
Supabase — PostgreSQL database with Row Level Security
Anthropic Claude API — AI diagnosis engine
Slack API — Rich block-based alerts
Railway — Production deployment
GitHub Actions — Integration via one workflow step

How the integration works
You add one step to any existing GitHub Actions workflow:
yaml- name: Notify PipelineIQ
if: always()
run: |
curl -X POST $PIPELINEIQ_URL/api/v1/pipelines/runs \
-H "X-PipelineIQ-Key: $PIPELINEIQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"repo_full_name": "${{ github.repository }}",
"branch": "${{ github.ref_name }}",
"commit_sha": "${{ github.sha }}",
"commit_message": "${{ github.event.head_commit.message }}",
"workflow_name": "${{ github.workflow }}",
"status": "${{ job.status }}",
"started_at": "${{ github.event.head_commit.timestamp }}"
}'
That's it. Every run — success or failure — gets stored. Every failure triggers AI diagnosis automatically.
The AI diagnosis engine
The core of PipelineIQ is a FastAPI background task that fires Claude when a failure is stored:
pythonasync def run_ai_diagnosis(run: dict, org_id: str, supabase: Client):
insight = await diagnose_from_run(run)
if not insight:
return

supabase.table("insights").insert({
    "severity": insight.get("severity"),
    "title": insight.get("title"),
    "diagnosis": insight.get("diagnosis"),
    "recommendation": insight.get("recommendation"),
    "estimated_time_save_minutes": insight.get("estimated_time_save_minutes"),
    "confidence": insight.get("confidence"),
}).execute()

await send_pipeline_alert(insight, run)

Claude returns structured JSON with severity, diagnosis, recommendation, confidence score, and estimated time saved. The whole thing runs in under 5 seconds.
What's next
Right now PipelineIQ stores pipeline runs, generates AI diagnosis on failures, and sends Slack alerts. The roadmap includes:

Web dashboard with pipeline health across all repos
DORA metrics (deployment frequency, change failure rate, recovery time)
Environment drift detection
Industry benchmarks — how does your team compare?

Try it free
PipelineIQ is in free beta. I'm looking for engineering teams to try it and give honest feedback on what's missing.
👉 pipelineiq.dev
API docs: pipelineiq-production-3496.up.railway.app/docs
Happy to answer questions in the comments — especially from DevOps engineers who deal with pipeline failures daily.

DEV Community

How I built an AI tool that diagnoses CI/CD pipeline failures in seconds

Top comments (0)