Emma_AkpanItoro

Posted on Feb 19

RFC: AI agent for validating MRs against acceptance criteria - does this solve your problem?

#discuss #ai #automation #agents

Request for Comments: Meridian

I'm building an AI-powered code review agent for the GitLab AI Hackathon and would love feedback from practicing engineers.

The Hypothesis

Problem 1: MRs get merged without fully implementing acceptance criteria, causing requirement drift and rework.

Problem 2: Developers change code without understanding historical design constraints, causing regressions.

Cost: Estimated 20-30% of merged code needs follow-up work (based on anecdotal observation).

The Proposed Solution

An autonomous agent that:

Acceptance Criteria Validation

Issue #123:
  criteria:
    - Export to CSV ✓
    - Export to JSON ✗
    - Include all fields ✗

MR Analysis:
  implemented: 1/3 criteria
  action: Block merge
  recommendation: Complete remaining criteria or update issue scope

Historical Context Surfacing

File: auth_flow.py
Lines changed: 45-67

Historical Context:
  original_mr: #89 (8 months ago)
  design_decision: "SSO requires token refresh every 30s"
  edge_case: "Enterprise customers need persistent sessions"
  warning: "Your changes remove refresh logic. SSO may break."

Technical Approach

LLM: Anthropic Claude 3.5 Sonnet (semantic understanding)
Platform: GitLab Duo Agent Platform
Architecture: Event-driven (webhooks → async analysis → automated comments)
Stack: Python, FastAPI, PostgreSQL, Redis

Questions for You

1. Problem Validation

Does this problem exist in your team?

[ ] Yes, constantly
[ ] Yes, occasionally
[ ] Rarely
[ ] No, not a problem

2. Solution Validation

Would automated blocking help or create friction?

Scenarios:

MR implements 3/5 criteria → Agent blocks merge
Dev changes old code → Agent warns about design constraint
Both scenarios happen

Your reaction:

[ ] This would save us hours
[ ] This would be annoying
[ ] Depends on accuracy

3. Workflow Fit

Do you document acceptance criteria in a parseable format?

[ ] Yes (checkboxes, bullet points in issues)
[ ] Partially (sometimes)
[ ] No (verbal/Slack/tribal knowledge)

4. Alternative Solutions

What have you tried?

PR templates with checklists?
Manual gating process?
Code ownership + tribal knowledge?
Nothing?

5. False Positive Tolerance

How accurate would this need to be?

50% accurate → Would you use it?
70% accurate → Would you use it?
90% accurate → Would you use it?
100% accurate or nothing?

Why This Matters

Building for GitLab AI Hackathon (45-day timeline). Targeting $10K prize, but more importantly:

Learning distributed systems
Leveling up engineering practices
Building something people actually want

I'd rather pivot now than build something useless.

How to Give Feedback

Comment below with:

Your role (engineer/lead/manager)
Team size
Answers to questions above
Any other thoughts

Thanks for your time! 🙏

Top comments (2)

nivcmo • Feb 19

This is a really interesting approach to the "acceptance criteria drift" problem. I've seen this exact issue in multiple teams — the PR looks good, tests pass, code is clean, but it only implements 60% of what was actually requested.

A few thoughts from the trenches:

1. The blocking approach needs nuance
I'd suggest a "confidence score" rather than a binary block. 90%+ confidence = block; 70-89% = warning with required acknowledgment; below 70% = just a comment. This gives teams a dial to tune based on their false positive tolerance.

2. Historical context is the real killer feature
The acceptance criteria validation is useful, but the historical design constraint surfacing is where the 10x value lives. "You just broke the enterprise SSO flow that was carefully architected 8 months ago" — that's the kind of thing that saves days of debugging and customer escalations.

3. Consider the "why" not just the "what"
If you can capture the rationale behind design decisions (not just the decisions themselves), you'd have something truly powerful. "Don't just tell me we need token refresh every 30s — tell me why (enterprise security policy #47) so I know if my workaround is acceptable."

Team context: Former tech lead, 12-person team, GitLab self-hosted. We absolutely have this problem — probably 15-20% of merged MRs need follow-up work for missed edge cases.

Would definitely try this at 70%+ accuracy. Great RFC, good luck with the hackathon!