Mamoor Ahmad

Posted on May 6 • Edited on May 9

How I Used AI Agents to Automate My Entire CI/CD Pipeline

#devops #ai #automation #cicd

We were deploying like it was 2019. Manual steps, Slack prayers, and a 45-minute pipeline that broke twice a week. Then I gave AI agents the keys — and everything changed.

😩 The Problem: Death by a Thousand Manual Steps

Let me paint you a picture of our old deploy process:

1. Developer pushes to main
2. Someone notices (maybe)
3. Manually trigger CI in Jenkins
4. Wait for tests (pray they pass)
5. Manually approve staging deploy
6. Run smoke tests (manually, of course)
7. Ping Slack: "staging looks good?"
8. Wait for someone to say 👍
9. Manually trigger production deploy
10. Monitor dashboards for 20 minutes
11. If something breaks → rollback (manually)
12. Write incident report
13. Question life choices
14. Repeat tomorrow

14 steps. 45 minutes. 8–12 failed deploys per month. 😵

We weren't shipping software — we were performing rituals.

💡 The Idea: What If the Pipeline Could Think?

The breakthrough moment came during a 2 AM incident. Our deploy broke because someone forgot to run a database migration. I thought:

"An AI agent would have caught that. It would've looked at the diff, seen the migration file, and known to run it."

That's when I decided to build an AI-agent-driven CI/CD pipeline — not just automation scripts, but agents that understand what's being deployed and decide how to handle it.

🏗️ The Architecture

Here's what the final system looks like:

The Three Agents

I built three specialized agents, each with a distinct job:

🧪 Agent 1: The Test Agent

# test_agent.py
class TestAgent:
    """Analyzes code changes and generates/updates tests automatically."""

    def on_push(self, commit):
        diff = self.get_diff(commit)
        changed_files = self.analyze_changes(diff)

        # AI analyzes what changed and why
        analysis = self.llm.analyze(
            prompt=f"""
            Analyze this code change:
            {diff}

            What could break? What edge cases should be tested?
            Generate targeted test cases.
            """,
            context=self.get_codebase_context()
        )

        # Generate tests for uncovered paths
        new_tests = self.generate_tests(analysis)
        self.run_and_validate(new_tests)

        # Fix any flaky tests it detects
        flaky_tests = self.detect_flaky_tests()
        for test in flaky_tests:
            self.fix_flaky_test(test)

What it does:

🔍 Reads the actual diff, not just "run all tests"
🧬 Generates tests for new code paths automatically
🔧 Detects and fixes flaky tests before they block deploys
📊 Reports coverage gaps with suggestions

🔨 Agent 2: The Build Agent

# build_agent.py
class BuildAgent:
    """Optimizes build process based on what actually changed."""

    def on_tests_pass(self, commit):
        changes = self.analyze_changes(commit)

        # Smart Dockerfile optimization
        if changes.has_dependency_changes():
            self.rebuild_base_layer()
        elif changes.only_app_code():
            self.use_cached_layers()  # Saves 8-12 minutes

        # AI optimizes the Dockerfile itself
        optimized_dockerfile = self.llm.optimize(
            prompt=f"""
            Optimize this Dockerfile for the current changes:
            {self.current_dockerfile}

            Changes: {changes.summary}

            Focus on: layer caching, multi-stage builds, image size.
            """,
            constraints=["must pass security scan", "under 500MB"]
        )

        self.build(optimized_dockerfile)

What it does:

🏎️ Skips full rebuilds when only app code changed (saves 8–12 min)
📦 Optimizes Dockerfiles on the fly — smaller images, better caching
🛡️ Runs security scans and blocks vulnerable dependencies
📝 Generates build reports with size diffs

🚀 Agent 3: The Deploy Agent

# deploy_agent.py
class DeployAgent:
    """Handles deployment strategy and rollback decisions."""

    def on_build_pass(self, artifact):
        # AI decides deployment strategy
        strategy = self.llm.decide(
            prompt=f"""
            Decide deployment strategy for:
            - Change type: {artifact.change_type}
            - Risk level: {artifact.risk_score}
            - Affected services: {artifact.services}
            - Time of day: {datetime.now()}

            Options: rolling, blue-green, canary, hotfix
            """,
            rules=self.deployment_rules
        )

        # Execute with monitoring
        result = self.deploy(artifact, strategy)

        # Watch metrics for anomalies
        anomalies = self.monitor_deployment(duration="10m")
        if anomalies:
            self.auto_rollback(reason=anomalies.summary)
            self.notify_team(f"🚨 Auto-rolled back: {anomalies.summary}")
        else:
            self.notify_team(f"✅ Deploy successful! {strategy.name}")

What it does:

🎯 Chooses deployment strategy based on risk (not one-size-fits-all)
📈 Monitors key metrics for 10 minutes post-deploy
⏪ Auto-rolls back in 30 seconds if anomalies detected
📱 Smart notifications — no more "deployed!" spam

🧠 The Brain: How the Orchestrator Works

The three agents don't work in isolation. An orchestrator coordinates them:

# pipeline.yaml
pipeline:
  trigger: on_push

  stages:
    - name: analyze
      agent: orchestrator
      action: "Analyze commit, determine risk, route to appropriate pipeline"

    - name: test
      agent: test_agent
      timeout: 10m
      on_failure: "Generate fix suggestions, retry once"

    - name: build
      agent: build_agent
      timeout: 5m
      depends_on: test

    - name: deploy
      agent: deploy_agent
      timeout: 15m
      depends_on: build
      strategy: "AI-selected based on risk score"

    - name: monitor
      agent: deploy_agent
      duration: 10m
      on_anomaly: auto_rollback

The magic is in the analyze stage. Before any agent runs, the orchestrator:

Reads the commit — what files changed, what the diff looks like
Assesses risk — database migration? config change? just a typo fix?
Routes accordingly — low-risk gets fast pipeline, high-risk gets extra scrutiny

🟡 Low risk  (typo fix, docs)   → Skip tests, fast build, rolling deploy
🟠 Medium risk (feature code)   → Full tests, standard build, rolling deploy
🔴 High risk (DB migration, auth) → Full tests + extra, canary deploy, 30min monitor

📊 The Results: Real Numbers

After 3 months of running this system:

Metric	Before	After	Change
⏱️ Deploy time	45 min	3 min	-93%
🐛 Failed deploys/month	8-12	0-1	-92%
🧑‍💻 Manual steps	14	0	-100%
🔄 Rollback time	20 min	30 sec	-97%
😴 After-hours deploys	Frequent	Never	∞ better
💸 Dev time wasted/week	~6 hrs	~0 hrs	+6 hrs/week

That's 6 extra hours per week of actual coding time. Per developer. Across the team.

🛠️ How to Build Your Own (Step by Step)

Step 1: Start with the Test Agent

This is the easiest win. Here's a minimal version:

# minimal_test_agent.py
import openai
from github import Github

def analyze_and_test(pr_number):
    g = Github(os.environ["GITHUB_TOKEN"])
    repo = g.get_repo("your-org/your-repo")
    pr = repo.get_pull(pr_number)

    # Get the diff
    diff = "\n".join([f.filename for f in pr.get_files()])

    # Ask AI what to test
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "system",
            "content": "You are a senior QA engineer. Analyze code changes and suggest test cases."
        }, {
            "role": "user",
            "content": f"Files changed: {diff}\n\nWhat tests should we add or update?"
        }]
    )

    # Generate test code
    test_suggestions = response.choices[0].message.content
    return test_suggestions

Step 2: Add the Build Optimizer

# build_optimizer.py
def optimize_build(changed_files):
    """Decide if we need a full rebuild or can use cache."""

    needs_full_rebuild = any(
        f in changed_files for f in [
            "package.json", "requirements.txt",
            "Dockerfile", "docker-compose.yml"
        ]
    )

    if needs_full_rebuild:
        return "full"
    else:
        return "cached"  # Saves 8-12 minutes!

Step 3: Wire It All Together

# .github/workflows/ai-pipeline.yml
name: AI-Powered CI/CD

on:
  push:
    branches: [main]

jobs:
  ai-analyze:
    runs-on: ubuntu-latest
    outputs:
      risk_level: ${{ steps.analyze.outputs.risk }}
    steps:
      - uses: actions/checkout@v4
      - id: analyze
        run: python scripts/ai_analyze.py

  test:
    needs: ai-analyze
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python scripts/ai_test_agent.py

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python scripts/ai_build_agent.py

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - run: python scripts/ai_deploy_agent.py

⚠️ What Went Wrong (Honesty Time)

It wasn't all smooth sailing. Here's what broke:

🤖 Hallucinated Test Cases

The test agent occasionally generated tests for functionality that didn't exist. Fix: Added a validation step that runs generated tests against the codebase first.

🐌 Over-Conservative Risk Scoring

Early on, the orchestrator flagged everything as high-risk. Every deploy was a canary deploy. Fix: Trained it on 3 months of historical deploy data to calibrate risk scores.

💸 API Costs

Running GPT-4 on every commit got expensive fast. Fix: Used GPT-4 only for risk assessment, GPT-3.5-turbo for test generation and build optimization. Cost dropped 70%.

🔇 Alert Fatigue (The Irony)

The deploy agent was too cautious and sent too many alerts. Fix: Added an "alert agent" that batches and deduplicates notifications.

🧰 The Tech Stack

For those who want the full picture:

Component	Tool	Why
🧠 LLM	OpenAI GPT-4 + GPT-3.5	Best reasoning for risk; fast + cheap for routine
🤖 Agent Framework	LangChain	Tool use, memory, chaining
🔄 CI/CD	GitHub Actions	Native integration, easy webhooks
📦 Container	Docker + BuildKit	Layer caching, multi-stage
🚀 Deploy	ArgoCD + Kubernetes	GitOps, auto-sync, rollback
📊 Monitoring	Prometheus + Grafana	Metrics for anomaly detection
🔔 Notifications	Slack Bot	Smart, batched, contextual

🔮 What's Next

I'm currently working on:

🧠 Self-Healing Pipelines — If a step fails, the agent diagnoses why and fixes it automatically
📈 Predictive Deploys — Agent suggests "deploy now" based on traffic patterns and team availability
🤝 Multi-Repo Coordination — Agents that understand microservice dependencies and deploy in the right order
📝 Auto-Generated Changelogs — AI writes release notes from the actual code changes

🎯 Key Takeaways

Start small — The test agent alone saved us 2 hours/week
Let AI decide, not just do — The routing logic (risk assessment) is more valuable than the automation itself
Monitor the monitor — AI agents need oversight too; build in feedback loops
Cost-optimize aggressively — Use expensive models for decisions, cheap models for execution
Be honest about failures — Every system breaks; the goal is faster recovery

💬 Let's Talk

Have you tried using AI agents in your DevOps workflow? What worked? What exploded?

Drop a comment below — I'd love to hear your war stories. 👇

If this post saved you time, it'll save your friends time too. Share it. 🔄

Follow me for more on AI-powered development workflows.

DEV Community

How I Used AI Agents to Automate My Entire CI/CD Pipeline

😩 The Problem: Death by a Thousand Manual Steps

💡 The Idea: What If the Pipeline Could Think?

🏗️ The Architecture

The Three Agents

🧪 Agent 1: The Test Agent

🔨 Agent 2: The Build Agent

🚀 Agent 3: The Deploy Agent

🧠 The Brain: How the Orchestrator Works

📊 The Results: Real Numbers

🛠️ How to Build Your Own (Step by Step)

Step 1: Start with the Test Agent

Step 2: Add the Build Optimizer

Step 3: Wire It All Together

⚠️ What Went Wrong (Honesty Time)

🤖 Hallucinated Test Cases

🐌 Over-Conservative Risk Scoring

💸 API Costs

🔇 Alert Fatigue (The Irony)

🧰 The Tech Stack

🔮 What's Next

🎯 Key Takeaways

💬 Let's Talk

Related Reading

Top comments (0)