Ravi Teja Reddy Mandala

Posted on Feb 22

Build an AI Code Review Agent in GitHub Actions (That Actually Reduces Incidents

#ai #devops #sre #githubactions

Build an AI Code Review Agent in GitHub Actions (That Actually Reduces Incidents)

A production-grade GitHub Actions workflow + an SRE reliability rubric that transforms AI from a code suggester into a structured risk detection system.

We tried AI code review in CI.

It was fast.
It was confident.
It was mostly noise.

It praised trivial refactors.
It nitpicked formatting.
It occasionally hallucinated “critical issues.”

And it did absolutely nothing to reduce production incidents.

The mistake wasn’t using AI.

The mistake was asking AI to “review code.”

In reliability engineering, we don’t ask:
“Is this code good?”

We ask:

What is the blast radius?
What is the rollback plan?
What happens under failure?
What is the operational risk?

So we rebuilt our AI reviewer using SRE principles.

This is the exact system.

🚨 Why Most AI Code Review Systems Fail

Most implementations:

• Run LLM over a PR diff

• Ask for general feedback

• Post suggestions as a comment

The result?

Unstructured opinions.

But production incidents are rarely caused by style issues.
They’re caused by:

Missing rollback strategy
Untested edge cases
Configuration drift
Silent failure paths
Inconsistent validation
Operational blind spots

If your AI doesn’t classify risk, it cannot reduce incidents.

🧠 The Shift: From “Suggestions” to “Structured Risk Classification”

We introduced a mandatory review schema.

AI must output:

Category
Severity
Confidence
Production Impact
Required Action

If it cannot classify something within this structure — it doesn’t get posted.

This immediately reduced noise by ~60%.

Because vague suggestions were eliminated.

📋 The Reliability Review Rubric

This is the foundation.

Category	Severity	Confidence	Required Output	Production Lens
Reliability	High	Certain	Rollback plan	Data loss? Downtime?
Security	High	Likely	Validation proof	External input risk?
Testing	Medium	Certain	Missing tests	Edge-case exposure?
Operability	Medium	Likely	Logging/metrics	Debuggability risk?
Performance	Medium	Uncertain	Benchmark proof	Latency spike risk?

Key Rule:

The AI is not allowed to:

Approve code
Suggest stylistic improvements unless they impact reliability
Comment without severity classification

This converts AI from opinion engine → reliability signal engine.

⚙️ GitHub Actions Architecture

We designed the workflow in 4 stages:

Diff Extraction
Context Enrichment
AI Risk Classification
Structured PR Feedback

🛠️ Step 1: Extract the True PR Surface Area

We only feed:

Changed files
Unified diff
File ownership context
Environment metadata (service type, criticality level)

name: AI Reliability Code Review

on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  reliability-review:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Generate PR Diff
        run: git diff origin/main...HEAD > pr.diff

      - name: Collect Metadata
        run: |
          echo "service=payments-api" >> context.txt
          echo "tier=critical" >> context.txt

Why this matters:

Context drastically improves classification accuracy.
A migration change in a critical payments service ≠ UI change in a dashboard.

🤖 Step 2: AI Classification Layer

Instead of prompting:

“Review this code.”

We prompt:

“Classify each risk under this schema. If uncertain, mark confidence as uncertain. Do not speculate.”

Example expected output:

AI Risk Report

Category	Severity	Confidence	Finding
Reliability	High	Certain	DB migration lacks rollback path
Testing	Medium	Likely	No null-input test coverage
Operability	Medium	Certain	No structured error logging

No essays.
No praise.
Just structured risk.

💬 Step 3: Structured PR Comment Output

We auto-generate:

### 🔎 AI Reliability Review

| Category | Severity | Confidence | Impact |
|----------|----------|------------|--------|
| Reliability | High | Certain | Possible migration failure without rollback |
| Testing | Medium | Likely | Edge case failure under null input |

### Required Actions
- [ ] Document rollback strategy
- [ ] Add null-input test
- [ ] Add structured logging for error path

Now the human reviewer can triage immediately.

📊 Real Operational Impact

After deploying this system:

• Review noise reduced significantly

• Reviewers focused on high-severity items first

• Rollback plans increased across PRs

• Edge-case test coverage improved

• Incident retros showed fewer “missing test / missing rollback” root causes

AI didn’t reduce incidents.

Structured enforcement did.

AI simply enforced discipline consistently.

🔒 Guardrails That Made It Production-Safe

We added strict constraints:

AI cannot block merge directly
High severity items require human acknowledgment
Low confidence findings are labeled informational
The model cannot auto-edit code
Outputs must match strict JSON schema

If schema validation fails → comment not posted.

This prevents hallucination-driven chaos.

📦 Sample Rubric Configuration

{
  "rules": [
    {
      "name": "Rollback Plan Missing",
      "category": "reliability",
      "severity": "high",
      "confidence_threshold": "likely",
      "required_output": "Explicit rollback steps documented"
    },
    {
      "name": "Edge Case Test Missing",
      "category": "testing",
      "severity": "medium",
      "confidence_threshold": "certain",
      "required_output": "Add test covering null and boundary inputs"
    }
  ]
}

🧭 Why This Works (Engineering Psychology)

Developers ignore vague feedback.

They respond to:

Severity
Production impact
Explicit required actions

By aligning AI output with how SRE teams think during incidents, we shifted code review from “opinion discussion” to “risk mitigation workflow.”

🚀 Final Takeaway

AI in CI is not about automation.

It is about structured risk visibility at scale.

If you:

Force classification
Enforce severity
Include confidence
Require action
Validate schema

You transform AI from a novelty into a reliability multiplier.

The difference isn’t the model.

It’s the framework around it.

What reliability rule would you add to this rubric to prevent your most painful incident?

Drop it below. I’ll expand this framework in a follow-up post.

Top comments (4)

Pawar Shivam • Mar 10

The key insight here is powerful: AI shouldn’t review code quality — it should surface operational risk.

The schema approach (category + severity + confidence + action) is exactly how incident postmortems are structured in SRE teams.

I’d also add a rule for “silent failure paths” — cases where errors are swallowed without alerts, logs, or metrics.

Those are responsible for a surprising number of production incidents.

Ravi Teja Reddy Mandala • Mar 17

Really appreciate this insight, completely agree.

That shift from “code quality” → “operational risk” is exactly what made the system useful in practice. Traditional AI reviews tend to over-focus on style and refactoring, but incidents are rarely caused by those.

“Silent failure paths” is a great callout , we’ve seen similar patterns where missing logs/metrics or swallowed exceptions become the real root cause in production.

I’m actually thinking of extending the rubric to explicitly detect:
• missing observability (logs/metrics/traces)

• retry/timeout gaps

• unhandled edge-case paths

Would love to hear how you’ve approached detecting these in your setup.

chengkai • Mar 10

The JSON schema gate is the key insight here. We went a
similar direction with structured audit output in our own shell toolchain — the hard
part is getting the context enrichment right without sending raw kubectl dumps to the
LLM

Ravi Teja Reddy Mandala • Mar 17

Great point — the JSON schema gate was a turning point for us as well.

Without structured output, the signal-to-noise ratio was too low to act on. The schema forced the AI to think in terms of risk, severity, and actionability instead of generic suggestions.

Totally agree on context enrichment, that’s the hardest part. We avoided sending raw kubectl dumps and instead passed:
• summarized service context

• recent deployment diffs

• relevant config snippets

This helped keep the signal high without overwhelming the model.

Curious, how are you balancing context depth vs token limits in your pipeline?