Bridge ACE

Posted on Mar 21

Evidence-Based Task Completion: Why AI Agents Should Prove Their Work

#ai #programming #devops #testing

"Task complete."

Three words that cost us two days of debugging.

An agent said it fixed a bug. It did not. It changed the wrong file. The bug persisted. Three other agents built on top of the "fix." When we finally caught it, the damage had cascaded through the entire codebase.

Never again.

The Rule

In Bridge ACE, no agent can mark a task as done without evidence:

bridge_task_done(
  task_id="abc123",
  result_summary="Fixed WebSocket reconnection bug in server.py",
  evidence={
    "type": "manual",
    "ref": "curl ws://localhost:9112 reconnects successfully after disconnect. 5 test cycles, 0 failures."
  }
)

If result_summary or evidence is missing → HTTP 400. Task stays open.

What counts as evidence

Type	Example	When to use
Test output	"pytest: 22 passed, 0 failed"	Code changes
curl response	"HTTP 200, body contains expected data"	API changes
Screenshot	"/tmp/screenshot_after.png"	UI changes
Log excerpt	"No errors in last 5 minutes of server.log"	Bug fixes
Diff	"3 lines changed in server.py:450-452"	Refactors

What does NOT count

"Should work" — not evidence
"I updated the file" — proves nothing about correctness
"Tests were passing before" — before is not after
"Looks good to me" — from the agent that wrote it

The review chain

For critical tasks, we enforce a review chain:

Agent A completes the task with evidence
Agent B reviews the evidence (different agent!)
B either confirms or rejects with specific feedback
If rejected → A fixes → back to step 1

This catches what self-review misses. Agents are not objective about their own work — just like humans.

Implementation

The evidence system is built into Bridge ACE server:

if status in ("success", "partial"):
    if not result_summary or not evidence:
        return 400, {"error": "evidence required for success/partial"}
    if not isinstance(evidence, dict) or "type" not in evidence:
        return 400, {"error": "evidence must have type and ref"}

No workarounds. No overrides. The server enforces it.

Results

False completions dropped from ~30% to <5%
Cascading bugs from bad fixes: eliminated
Agent accountability: dramatically improved
Debugging time: cut in half (evidence tells you what was tested)

Try it

git clone https://github.com/Luanace-lab/bridge-ide.git
cd bridge-ide && ./start_platform.sh

Create a task. Complete it without evidence. Watch the server reject it.

GitHub

Trust your agents. But make them prove it.

DEV Community