DEV Community

Cover image for Build an AI Code Review Agent in GitHub Actions (That Actually Reduces Incidents
Ravi Teja Reddy Mandala
Ravi Teja Reddy Mandala

Posted on

Build an AI Code Review Agent in GitHub Actions (That Actually Reduces Incidents

Build an AI Code Review Agent in GitHub Actions (That Actually Reduces Incidents)

A production-grade GitHub Actions workflow + an SRE reliability rubric that transforms AI from a code suggester into a structured risk detection system.

We tried AI code review in CI.

It was fast.
It was confident.
It was mostly noise.

It praised trivial refactors.
It nitpicked formatting.
It occasionally hallucinated “critical issues.”

And it did absolutely nothing to reduce production incidents.

The mistake wasn’t using AI.

The mistake was asking AI to “review code.”

In reliability engineering, we don’t ask:
“Is this code good?”

We ask:

  • What is the blast radius?
  • What is the rollback plan?
  • What happens under failure?
  • What is the operational risk?

So we rebuilt our AI reviewer using SRE principles.

This is the exact system.


🚨 Why Most AI Code Review Systems Fail

Most implementations:

• Run LLM over a PR diff

• Ask for general feedback

• Post suggestions as a comment

The result?

Unstructured opinions.

But production incidents are rarely caused by style issues.
They’re caused by:

  • Missing rollback strategy
  • Untested edge cases
  • Configuration drift
  • Silent failure paths
  • Inconsistent validation
  • Operational blind spots

If your AI doesn’t classify risk, it cannot reduce incidents.


🧠 The Shift: From “Suggestions” to “Structured Risk Classification”

We introduced a mandatory review schema.

AI must output:

  1. Category
  2. Severity
  3. Confidence
  4. Production Impact
  5. Required Action

If it cannot classify something within this structure — it doesn’t get posted.

This immediately reduced noise by ~60%.

Because vague suggestions were eliminated.


📋 The Reliability Review Rubric

This is the foundation.

Category Severity Confidence Required Output Production Lens
Reliability High Certain Rollback plan Data loss? Downtime?
Security High Likely Validation proof External input risk?
Testing Medium Certain Missing tests Edge-case exposure?
Operability Medium Likely Logging/metrics Debuggability risk?
Performance Medium Uncertain Benchmark proof Latency spike risk?

Key Rule:

The AI is not allowed to:

  • Approve code
  • Suggest stylistic improvements unless they impact reliability
  • Comment without severity classification

This converts AI from opinion engine → reliability signal engine.


⚙️ GitHub Actions Architecture

We designed the workflow in 4 stages:

  1. Diff Extraction
  2. Context Enrichment
  3. AI Risk Classification
  4. Structured PR Feedback

🛠️ Step 1: Extract the True PR Surface Area

We only feed:

  • Changed files
  • Unified diff
  • File ownership context
  • Environment metadata (service type, criticality level)
name: AI Reliability Code Review

on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  reliability-review:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Generate PR Diff
        run: git diff origin/main...HEAD > pr.diff

      - name: Collect Metadata
        run: |
          echo "service=payments-api" >> context.txt
          echo "tier=critical" >> context.txt
Enter fullscreen mode Exit fullscreen mode

Why this matters:

Context drastically improves classification accuracy.
A migration change in a critical payments service ≠ UI change in a dashboard.


🤖 Step 2: AI Classification Layer

Instead of prompting:

“Review this code.”

We prompt:

“Classify each risk under this schema. If uncertain, mark confidence as uncertain. Do not speculate.”

Example expected output:

AI Risk Report

Category Severity Confidence Finding
Reliability High Certain DB migration lacks rollback path
Testing Medium Likely No null-input test coverage
Operability Medium Certain No structured error logging

No essays.
No praise.
Just structured risk.


💬 Step 3: Structured PR Comment Output

We auto-generate:

### 🔎 AI Reliability Review

| Category | Severity | Confidence | Impact |
|----------|----------|------------|--------|
| Reliability | High | Certain | Possible migration failure without rollback |
| Testing | Medium | Likely | Edge case failure under null input |

### Required Actions
- [ ] Document rollback strategy
- [ ] Add null-input test
- [ ] Add structured logging for error path
Enter fullscreen mode Exit fullscreen mode

Now the human reviewer can triage immediately.


📊 Real Operational Impact

After deploying this system:

• Review noise reduced significantly

• Reviewers focused on high-severity items first

• Rollback plans increased across PRs

• Edge-case test coverage improved

• Incident retros showed fewer “missing test / missing rollback” root causes

AI didn’t reduce incidents.

Structured enforcement did.

AI simply enforced discipline consistently.


🔒 Guardrails That Made It Production-Safe

We added strict constraints:

  • AI cannot block merge directly
  • High severity items require human acknowledgment
  • Low confidence findings are labeled informational
  • The model cannot auto-edit code
  • Outputs must match strict JSON schema

If schema validation fails → comment not posted.

This prevents hallucination-driven chaos.


📦 Sample Rubric Configuration

{
  "rules": [
    {
      "name": "Rollback Plan Missing",
      "category": "reliability",
      "severity": "high",
      "confidence_threshold": "likely",
      "required_output": "Explicit rollback steps documented"
    },
    {
      "name": "Edge Case Test Missing",
      "category": "testing",
      "severity": "medium",
      "confidence_threshold": "certain",
      "required_output": "Add test covering null and boundary inputs"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

🧭 Why This Works (Engineering Psychology)

Developers ignore vague feedback.

They respond to:

  • Severity
  • Production impact
  • Explicit required actions

By aligning AI output with how SRE teams think during incidents, we shifted code review from “opinion discussion” to “risk mitigation workflow.”


🚀 Final Takeaway

AI in CI is not about automation.

It is about structured risk visibility at scale.

If you:

  • Force classification
  • Enforce severity
  • Include confidence
  • Require action
  • Validate schema

You transform AI from a novelty into a reliability multiplier.

The difference isn’t the model.

It’s the framework around it.


What reliability rule would you add to this rubric to prevent your most painful incident?

Drop it below. I’ll expand this framework in a follow-up post.

Top comments (0)