DEV Community

Cover image for How I Used AI Agents to Automate My Entire CI/CD Pipeline
Mamoor Ahmad
Mamoor Ahmad Subscriber

Posted on

How I Used AI Agents to Automate My Entire CI/CD Pipeline

We were deploying like it was 2019. Manual steps, Slack prayers, and a 45-minute pipeline that broke twice a week. Then I gave AI agents the keys โ€” and everything changed.


๐Ÿ˜ฉ The Problem: Death by a Thousand Manual Steps

Let me paint you a picture of our old deploy process:

1. Developer pushes to main
2. Someone notices (maybe)
3. Manually trigger CI in Jenkins
4. Wait for tests (pray they pass)
5. Manually approve staging deploy
6. Run smoke tests (manually, of course)
7. Ping Slack: "staging looks good?"
8. Wait for someone to say ๐Ÿ‘
9. Manually trigger production deploy
10. Monitor dashboards for 20 minutes
11. If something breaks โ†’ rollback (manually)
12. Write incident report
13. Question life choices
14. Repeat tomorrow
Enter fullscreen mode Exit fullscreen mode

14 steps. 45 minutes. 8โ€“12 failed deploys per month. ๐Ÿ˜ต

We weren't shipping software โ€” we were performing rituals.

Before vs After comparison showing dramatic improvements


๐Ÿ’ก The Idea: What If the Pipeline Could Think?

The breakthrough moment came during a 2 AM incident. Our deploy broke because someone forgot to run a database migration. I thought:

"An AI agent would have caught that. It would've looked at the diff, seen the migration file, and known to run it."

That's when I decided to build an AI-agent-driven CI/CD pipeline โ€” not just automation scripts, but agents that understand what's being deployed and decide how to handle it.


๐Ÿ—๏ธ The Architecture

Here's what the final system looks like:

AI Agent CI/CD Architecture Diagram

The Three Agents

I built three specialized agents, each with a distinct job:

๐Ÿงช Agent 1: The Test Agent

# test_agent.py
class TestAgent:
    """Analyzes code changes and generates/updates tests automatically."""

    def on_push(self, commit):
        diff = self.get_diff(commit)
        changed_files = self.analyze_changes(diff)

        # AI analyzes what changed and why
        analysis = self.llm.analyze(
            prompt=f"""
            Analyze this code change:
            {diff}

            What could break? What edge cases should be tested?
            Generate targeted test cases.
            """,
            context=self.get_codebase_context()
        )

        # Generate tests for uncovered paths
        new_tests = self.generate_tests(analysis)
        self.run_and_validate(new_tests)

        # Fix any flaky tests it detects
        flaky_tests = self.detect_flaky_tests()
        for test in flaky_tests:
            self.fix_flaky_test(test)
Enter fullscreen mode Exit fullscreen mode

What it does:

  • ๐Ÿ” Reads the actual diff, not just "run all tests"
  • ๐Ÿงฌ Generates tests for new code paths automatically
  • ๐Ÿ”ง Detects and fixes flaky tests before they block deploys
  • ๐Ÿ“Š Reports coverage gaps with suggestions

๐Ÿ”จ Agent 2: The Build Agent

# build_agent.py
class BuildAgent:
    """Optimizes build process based on what actually changed."""

    def on_tests_pass(self, commit):
        changes = self.analyze_changes(commit)

        # Smart Dockerfile optimization
        if changes.has_dependency_changes():
            self.rebuild_base_layer()
        elif changes.only_app_code():
            self.use_cached_layers()  # Saves 8-12 minutes

        # AI optimizes the Dockerfile itself
        optimized_dockerfile = self.llm.optimize(
            prompt=f"""
            Optimize this Dockerfile for the current changes:
            {self.current_dockerfile}

            Changes: {changes.summary}

            Focus on: layer caching, multi-stage builds, image size.
            """,
            constraints=["must pass security scan", "under 500MB"]
        )

        self.build(optimized_dockerfile)
Enter fullscreen mode Exit fullscreen mode

What it does:

  • ๐ŸŽ๏ธ Skips full rebuilds when only app code changed (saves 8โ€“12 min)
  • ๐Ÿ“ฆ Optimizes Dockerfiles on the fly โ€” smaller images, better caching
  • ๐Ÿ›ก๏ธ Runs security scans and blocks vulnerable dependencies
  • ๐Ÿ“ Generates build reports with size diffs

๐Ÿš€ Agent 3: The Deploy Agent

# deploy_agent.py
class DeployAgent:
    """Handles deployment strategy and rollback decisions."""

    def on_build_pass(self, artifact):
        # AI decides deployment strategy
        strategy = self.llm.decide(
            prompt=f"""
            Decide deployment strategy for:
            - Change type: {artifact.change_type}
            - Risk level: {artifact.risk_score}
            - Affected services: {artifact.services}
            - Time of day: {datetime.now()}

            Options: rolling, blue-green, canary, hotfix
            """,
            rules=self.deployment_rules
        )

        # Execute with monitoring
        result = self.deploy(artifact, strategy)

        # Watch metrics for anomalies
        anomalies = self.monitor_deployment(duration="10m")
        if anomalies:
            self.auto_rollback(reason=anomalies.summary)
            self.notify_team(f"๐Ÿšจ Auto-rolled back: {anomalies.summary}")
        else:
            self.notify_team(f"โœ… Deploy successful! {strategy.name}")
Enter fullscreen mode Exit fullscreen mode

What it does:

  • ๐ŸŽฏ Chooses deployment strategy based on risk (not one-size-fits-all)
  • ๐Ÿ“ˆ Monitors key metrics for 10 minutes post-deploy
  • โช Auto-rolls back in 30 seconds if anomalies detected
  • ๐Ÿ“ฑ Smart notifications โ€” no more "deployed!" spam

๐Ÿง  The Brain: How the Orchestrator Works

The three agents don't work in isolation. An orchestrator coordinates them:

# pipeline.yaml
pipeline:
  trigger: on_push

  stages:
    - name: analyze
      agent: orchestrator
      action: "Analyze commit, determine risk, route to appropriate pipeline"

    - name: test
      agent: test_agent
      timeout: 10m
      on_failure: "Generate fix suggestions, retry once"

    - name: build
      agent: build_agent
      timeout: 5m
      depends_on: test

    - name: deploy
      agent: deploy_agent
      timeout: 15m
      depends_on: build
      strategy: "AI-selected based on risk score"

    - name: monitor
      agent: deploy_agent
      duration: 10m
      on_anomaly: auto_rollback
Enter fullscreen mode Exit fullscreen mode

The magic is in the analyze stage. Before any agent runs, the orchestrator:

  1. Reads the commit โ€” what files changed, what the diff looks like
  2. Assesses risk โ€” database migration? config change? just a typo fix?
  3. Routes accordingly โ€” low-risk gets fast pipeline, high-risk gets extra scrutiny
๐ŸŸก Low risk  (typo fix, docs)   โ†’ Skip tests, fast build, rolling deploy
๐ŸŸ  Medium risk (feature code)   โ†’ Full tests, standard build, rolling deploy
๐Ÿ”ด High risk (DB migration, auth) โ†’ Full tests + extra, canary deploy, 30min monitor
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“Š The Results: Real Numbers

After 3 months of running this system:

Metric Before After Change
โฑ๏ธ Deploy time 45 min 3 min -93%
๐Ÿ› Failed deploys/month 8-12 0-1 -92%
๐Ÿง‘โ€๐Ÿ’ป Manual steps 14 0 -100%
๐Ÿ”„ Rollback time 20 min 30 sec -97%
๐Ÿ˜ด After-hours deploys Frequent Never โˆž better
๐Ÿ’ธ Dev time wasted/week ~6 hrs ~0 hrs +6 hrs/week

That's 6 extra hours per week of actual coding time. Per developer. Across the team.


๐Ÿ› ๏ธ How to Build Your Own (Step by Step)

Step 1: Start with the Test Agent

This is the easiest win. Here's a minimal version:

# minimal_test_agent.py
import openai
from github import Github

def analyze_and_test(pr_number):
    g = Github(os.environ["GITHUB_TOKEN"])
    repo = g.get_repo("your-org/your-repo")
    pr = repo.get_pull(pr_number)

    # Get the diff
    diff = "\n".join([f.filename for f in pr.get_files()])

    # Ask AI what to test
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "system",
            "content": "You are a senior QA engineer. Analyze code changes and suggest test cases."
        }, {
            "role": "user",
            "content": f"Files changed: {diff}\n\nWhat tests should we add or update?"
        }]
    )

    # Generate test code
    test_suggestions = response.choices[0].message.content
    return test_suggestions
Enter fullscreen mode Exit fullscreen mode

Step 2: Add the Build Optimizer

# build_optimizer.py
def optimize_build(changed_files):
    """Decide if we need a full rebuild or can use cache."""

    needs_full_rebuild = any(
        f in changed_files for f in [
            "package.json", "requirements.txt",
            "Dockerfile", "docker-compose.yml"
        ]
    )

    if needs_full_rebuild:
        return "full"
    else:
        return "cached"  # Saves 8-12 minutes!
Enter fullscreen mode Exit fullscreen mode

Step 3: Wire It All Together

# .github/workflows/ai-pipeline.yml
name: AI-Powered CI/CD

on:
  push:
    branches: [main]

jobs:
  ai-analyze:
    runs-on: ubuntu-latest
    outputs:
      risk_level: ${{ steps.analyze.outputs.risk }}
    steps:
      - uses: actions/checkout@v4
      - id: analyze
        run: python scripts/ai_analyze.py

  test:
    needs: ai-analyze
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python scripts/ai_test_agent.py

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python scripts/ai_build_agent.py

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - run: python scripts/ai_deploy_agent.py
Enter fullscreen mode Exit fullscreen mode

โš ๏ธ What Went Wrong (Honesty Time)

It wasn't all smooth sailing. Here's what broke:

๐Ÿค– Hallucinated Test Cases

The test agent occasionally generated tests for functionality that didn't exist. Fix: Added a validation step that runs generated tests against the codebase first.

๐ŸŒ Over-Conservative Risk Scoring

Early on, the orchestrator flagged everything as high-risk. Every deploy was a canary deploy. Fix: Trained it on 3 months of historical deploy data to calibrate risk scores.

๐Ÿ’ธ API Costs

Running GPT-4 on every commit got expensive fast. Fix: Used GPT-4 only for risk assessment, GPT-3.5-turbo for test generation and build optimization. Cost dropped 70%.

๐Ÿ”‡ Alert Fatigue (The Irony)

The deploy agent was too cautious and sent too many alerts. Fix: Added an "alert agent" that batches and deduplicates notifications.


๐Ÿงฐ The Tech Stack

For those who want the full picture:

Component Tool Why
๐Ÿง  LLM OpenAI GPT-4 + GPT-3.5 Best reasoning for risk; fast + cheap for routine
๐Ÿค– Agent Framework LangChain Tool use, memory, chaining
๐Ÿ”„ CI/CD GitHub Actions Native integration, easy webhooks
๐Ÿ“ฆ Container Docker + BuildKit Layer caching, multi-stage
๐Ÿš€ Deploy ArgoCD + Kubernetes GitOps, auto-sync, rollback
๐Ÿ“Š Monitoring Prometheus + Grafana Metrics for anomaly detection
๐Ÿ”” Notifications Slack Bot Smart, batched, contextual

๐Ÿ”ฎ What's Next

I'm currently working on:

  1. ๐Ÿง  Self-Healing Pipelines โ€” If a step fails, the agent diagnoses why and fixes it automatically
  2. ๐Ÿ“ˆ Predictive Deploys โ€” Agent suggests "deploy now" based on traffic patterns and team availability
  3. ๐Ÿค Multi-Repo Coordination โ€” Agents that understand microservice dependencies and deploy in the right order
  4. ๐Ÿ“ Auto-Generated Changelogs โ€” AI writes release notes from the actual code changes

๐ŸŽฏ Key Takeaways

  1. Start small โ€” The test agent alone saved us 2 hours/week
  2. Let AI decide, not just do โ€” The routing logic (risk assessment) is more valuable than the automation itself
  3. Monitor the monitor โ€” AI agents need oversight too; build in feedback loops
  4. Cost-optimize aggressively โ€” Use expensive models for decisions, cheap models for execution
  5. Be honest about failures โ€” Every system breaks; the goal is faster recovery

๐Ÿ’ฌ Let's Talk

Have you tried using AI agents in your DevOps workflow? What worked? What exploded?

Drop a comment below โ€” I'd love to hear your war stories. ๐Ÿ‘‡


If this post saved you time, it'll save your friends time too. Share it. ๐Ÿ”„

Follow me for more on AI-powered development workflows.

Top comments (0)