Build an AI Code Review Agent in GitHub Actions (That Actually Reduces Incidents)
A production-grade GitHub Actions workflow + an SRE reliability rubric that transforms AI from a code suggester into a structured risk detection system.
We tried AI code review in CI.
It was fast.
It was confident.
It was mostly noise.
It praised trivial refactors.
It nitpicked formatting.
It occasionally hallucinated “critical issues.”
And it did absolutely nothing to reduce production incidents.
The mistake wasn’t using AI.
The mistake was asking AI to “review code.”
In reliability engineering, we don’t ask:
“Is this code good?”
We ask:
- What is the blast radius?
- What is the rollback plan?
- What happens under failure?
- What is the operational risk?
So we rebuilt our AI reviewer using SRE principles.
This is the exact system.
🚨 Why Most AI Code Review Systems Fail
Most implementations:
• Run LLM over a PR diff
• Ask for general feedback
• Post suggestions as a comment
The result?
Unstructured opinions.
But production incidents are rarely caused by style issues.
They’re caused by:
- Missing rollback strategy
- Untested edge cases
- Configuration drift
- Silent failure paths
- Inconsistent validation
- Operational blind spots
If your AI doesn’t classify risk, it cannot reduce incidents.
🧠 The Shift: From “Suggestions” to “Structured Risk Classification”
We introduced a mandatory review schema.
AI must output:
- Category
- Severity
- Confidence
- Production Impact
- Required Action
If it cannot classify something within this structure — it doesn’t get posted.
This immediately reduced noise by ~60%.
Because vague suggestions were eliminated.
📋 The Reliability Review Rubric
This is the foundation.
| Category | Severity | Confidence | Required Output | Production Lens |
|---|---|---|---|---|
| Reliability | High | Certain | Rollback plan | Data loss? Downtime? |
| Security | High | Likely | Validation proof | External input risk? |
| Testing | Medium | Certain | Missing tests | Edge-case exposure? |
| Operability | Medium | Likely | Logging/metrics | Debuggability risk? |
| Performance | Medium | Uncertain | Benchmark proof | Latency spike risk? |
Key Rule:
The AI is not allowed to:
- Approve code
- Suggest stylistic improvements unless they impact reliability
- Comment without severity classification
This converts AI from opinion engine → reliability signal engine.
⚙️ GitHub Actions Architecture
We designed the workflow in 4 stages:
- Diff Extraction
- Context Enrichment
- AI Risk Classification
- Structured PR Feedback
🛠️ Step 1: Extract the True PR Surface Area
We only feed:
- Changed files
- Unified diff
- File ownership context
- Environment metadata (service type, criticality level)
name: AI Reliability Code Review
on:
pull_request:
types: [opened, synchronize, reopened]
jobs:
reliability-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Generate PR Diff
run: git diff origin/main...HEAD > pr.diff
- name: Collect Metadata
run: |
echo "service=payments-api" >> context.txt
echo "tier=critical" >> context.txt
Why this matters:
Context drastically improves classification accuracy.
A migration change in a critical payments service ≠ UI change in a dashboard.
🤖 Step 2: AI Classification Layer
Instead of prompting:
“Review this code.”
We prompt:
“Classify each risk under this schema. If uncertain, mark confidence as uncertain. Do not speculate.”
Example expected output:
AI Risk Report
| Category | Severity | Confidence | Finding |
|---|---|---|---|
| Reliability | High | Certain | DB migration lacks rollback path |
| Testing | Medium | Likely | No null-input test coverage |
| Operability | Medium | Certain | No structured error logging |
No essays.
No praise.
Just structured risk.
💬 Step 3: Structured PR Comment Output
We auto-generate:
### 🔎 AI Reliability Review
| Category | Severity | Confidence | Impact |
|----------|----------|------------|--------|
| Reliability | High | Certain | Possible migration failure without rollback |
| Testing | Medium | Likely | Edge case failure under null input |
### Required Actions
- [ ] Document rollback strategy
- [ ] Add null-input test
- [ ] Add structured logging for error path
Now the human reviewer can triage immediately.
📊 Real Operational Impact
After deploying this system:
• Review noise reduced significantly
• Reviewers focused on high-severity items first
• Rollback plans increased across PRs
• Edge-case test coverage improved
• Incident retros showed fewer “missing test / missing rollback” root causes
AI didn’t reduce incidents.
Structured enforcement did.
AI simply enforced discipline consistently.
🔒 Guardrails That Made It Production-Safe
We added strict constraints:
- AI cannot block merge directly
- High severity items require human acknowledgment
- Low confidence findings are labeled informational
- The model cannot auto-edit code
- Outputs must match strict JSON schema
If schema validation fails → comment not posted.
This prevents hallucination-driven chaos.
📦 Sample Rubric Configuration
{
"rules": [
{
"name": "Rollback Plan Missing",
"category": "reliability",
"severity": "high",
"confidence_threshold": "likely",
"required_output": "Explicit rollback steps documented"
},
{
"name": "Edge Case Test Missing",
"category": "testing",
"severity": "medium",
"confidence_threshold": "certain",
"required_output": "Add test covering null and boundary inputs"
}
]
}
🧭 Why This Works (Engineering Psychology)
Developers ignore vague feedback.
They respond to:
- Severity
- Production impact
- Explicit required actions
By aligning AI output with how SRE teams think during incidents, we shifted code review from “opinion discussion” to “risk mitigation workflow.”
🚀 Final Takeaway
AI in CI is not about automation.
It is about structured risk visibility at scale.
If you:
- Force classification
- Enforce severity
- Include confidence
- Require action
- Validate schema
You transform AI from a novelty into a reliability multiplier.
The difference isn’t the model.
It’s the framework around it.
What reliability rule would you add to this rubric to prevent your most painful incident?
Drop it below. I’ll expand this framework in a follow-up post.
Top comments (4)
The key insight here is powerful: AI shouldn’t review code quality — it should surface operational risk.
The schema approach (category + severity + confidence + action) is exactly how incident postmortems are structured in SRE teams.
I’d also add a rule for “silent failure paths” — cases where errors are swallowed without alerts, logs, or metrics.
Those are responsible for a surprising number of production incidents.
Really appreciate this insight, completely agree.
That shift from “code quality” → “operational risk” is exactly what made the system useful in practice. Traditional AI reviews tend to over-focus on style and refactoring, but incidents are rarely caused by those.
“Silent failure paths” is a great callout , we’ve seen similar patterns where missing logs/metrics or swallowed exceptions become the real root cause in production.
I’m actually thinking of extending the rubric to explicitly detect:
• missing observability (logs/metrics/traces)
• retry/timeout gaps
• unhandled edge-case paths
Would love to hear how you’ve approached detecting these in your setup.
The JSON schema gate is the key insight here. We went a
similar direction with structured audit output in our own shell toolchain — the hard
part is getting the context enrichment right without sending raw kubectl dumps to the
LLM
Great point — the JSON schema gate was a turning point for us as well.
Without structured output, the signal-to-noise ratio was too low to act on. The schema forced the AI to think in terms of risk, severity, and actionability instead of generic suggestions.
Totally agree on context enrichment, that’s the hardest part. We avoided sending raw kubectl dumps and instead passed:
• summarized service context
• recent deployment diffs
• relevant config snippets
This helped keep the signal high without overwhelming the model.
Curious, how are you balancing context depth vs token limits in your pipeline?