Build an AI Code Review Agent in GitHub Actions (That Actually Reduces Incidents)
A production-grade GitHub Actions workflow + an SRE reliability rubric that transforms AI from a code suggester into a structured risk detection system.
We tried AI code review in CI.
It was fast.
It was confident.
It was mostly noise.
It praised trivial refactors.
It nitpicked formatting.
It occasionally hallucinated “critical issues.”
And it did absolutely nothing to reduce production incidents.
The mistake wasn’t using AI.
The mistake was asking AI to “review code.”
In reliability engineering, we don’t ask:
“Is this code good?”
We ask:
- What is the blast radius?
- What is the rollback plan?
- What happens under failure?
- What is the operational risk?
So we rebuilt our AI reviewer using SRE principles.
This is the exact system.
🚨 Why Most AI Code Review Systems Fail
Most implementations:
• Run LLM over a PR diff
• Ask for general feedback
• Post suggestions as a comment
The result?
Unstructured opinions.
But production incidents are rarely caused by style issues.
They’re caused by:
- Missing rollback strategy
- Untested edge cases
- Configuration drift
- Silent failure paths
- Inconsistent validation
- Operational blind spots
If your AI doesn’t classify risk, it cannot reduce incidents.
🧠 The Shift: From “Suggestions” to “Structured Risk Classification”
We introduced a mandatory review schema.
AI must output:
- Category
- Severity
- Confidence
- Production Impact
- Required Action
If it cannot classify something within this structure — it doesn’t get posted.
This immediately reduced noise by ~60%.
Because vague suggestions were eliminated.
📋 The Reliability Review Rubric
This is the foundation.
| Category | Severity | Confidence | Required Output | Production Lens |
|---|---|---|---|---|
| Reliability | High | Certain | Rollback plan | Data loss? Downtime? |
| Security | High | Likely | Validation proof | External input risk? |
| Testing | Medium | Certain | Missing tests | Edge-case exposure? |
| Operability | Medium | Likely | Logging/metrics | Debuggability risk? |
| Performance | Medium | Uncertain | Benchmark proof | Latency spike risk? |
Key Rule:
The AI is not allowed to:
- Approve code
- Suggest stylistic improvements unless they impact reliability
- Comment without severity classification
This converts AI from opinion engine → reliability signal engine.
⚙️ GitHub Actions Architecture
We designed the workflow in 4 stages:
- Diff Extraction
- Context Enrichment
- AI Risk Classification
- Structured PR Feedback
🛠️ Step 1: Extract the True PR Surface Area
We only feed:
- Changed files
- Unified diff
- File ownership context
- Environment metadata (service type, criticality level)
name: AI Reliability Code Review
on:
pull_request:
types: [opened, synchronize, reopened]
jobs:
reliability-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Generate PR Diff
run: git diff origin/main...HEAD > pr.diff
- name: Collect Metadata
run: |
echo "service=payments-api" >> context.txt
echo "tier=critical" >> context.txt
Why this matters:
Context drastically improves classification accuracy.
A migration change in a critical payments service ≠ UI change in a dashboard.
🤖 Step 2: AI Classification Layer
Instead of prompting:
“Review this code.”
We prompt:
“Classify each risk under this schema. If uncertain, mark confidence as uncertain. Do not speculate.”
Example expected output:
AI Risk Report
| Category | Severity | Confidence | Finding |
|---|---|---|---|
| Reliability | High | Certain | DB migration lacks rollback path |
| Testing | Medium | Likely | No null-input test coverage |
| Operability | Medium | Certain | No structured error logging |
No essays.
No praise.
Just structured risk.
💬 Step 3: Structured PR Comment Output
We auto-generate:
### 🔎 AI Reliability Review
| Category | Severity | Confidence | Impact |
|----------|----------|------------|--------|
| Reliability | High | Certain | Possible migration failure without rollback |
| Testing | Medium | Likely | Edge case failure under null input |
### Required Actions
- [ ] Document rollback strategy
- [ ] Add null-input test
- [ ] Add structured logging for error path
Now the human reviewer can triage immediately.
📊 Real Operational Impact
After deploying this system:
• Review noise reduced significantly
• Reviewers focused on high-severity items first
• Rollback plans increased across PRs
• Edge-case test coverage improved
• Incident retros showed fewer “missing test / missing rollback” root causes
AI didn’t reduce incidents.
Structured enforcement did.
AI simply enforced discipline consistently.
🔒 Guardrails That Made It Production-Safe
We added strict constraints:
- AI cannot block merge directly
- High severity items require human acknowledgment
- Low confidence findings are labeled informational
- The model cannot auto-edit code
- Outputs must match strict JSON schema
If schema validation fails → comment not posted.
This prevents hallucination-driven chaos.
📦 Sample Rubric Configuration
{
"rules": [
{
"name": "Rollback Plan Missing",
"category": "reliability",
"severity": "high",
"confidence_threshold": "likely",
"required_output": "Explicit rollback steps documented"
},
{
"name": "Edge Case Test Missing",
"category": "testing",
"severity": "medium",
"confidence_threshold": "certain",
"required_output": "Add test covering null and boundary inputs"
}
]
}
🧭 Why This Works (Engineering Psychology)
Developers ignore vague feedback.
They respond to:
- Severity
- Production impact
- Explicit required actions
By aligning AI output with how SRE teams think during incidents, we shifted code review from “opinion discussion” to “risk mitigation workflow.”
🚀 Final Takeaway
AI in CI is not about automation.
It is about structured risk visibility at scale.
If you:
- Force classification
- Enforce severity
- Include confidence
- Require action
- Validate schema
You transform AI from a novelty into a reliability multiplier.
The difference isn’t the model.
It’s the framework around it.
What reliability rule would you add to this rubric to prevent your most painful incident?
Drop it below. I’ll expand this framework in a follow-up post.
Top comments (0)