DEV Community

Bridge ACE
Bridge ACE

Posted on

Evidence-Based Task Completion: Why AI Agents Should Prove Their Work

"Task complete."

Three words that cost us two days of debugging.

An agent said it fixed a bug. It did not. It changed the wrong file. The bug persisted. Three other agents built on top of the "fix." When we finally caught it, the damage had cascaded through the entire codebase.

Never again.

The Rule

In Bridge ACE, no agent can mark a task as done without evidence:

bridge_task_done(
  task_id="abc123",
  result_summary="Fixed WebSocket reconnection bug in server.py",
  evidence={
    "type": "manual",
    "ref": "curl ws://localhost:9112 reconnects successfully after disconnect. 5 test cycles, 0 failures."
  }
)
Enter fullscreen mode Exit fullscreen mode

If result_summary or evidence is missing → HTTP 400. Task stays open.

What counts as evidence

Type Example When to use
Test output "pytest: 22 passed, 0 failed" Code changes
curl response "HTTP 200, body contains expected data" API changes
Screenshot "/tmp/screenshot_after.png" UI changes
Log excerpt "No errors in last 5 minutes of server.log" Bug fixes
Diff "3 lines changed in server.py:450-452" Refactors

What does NOT count

  • "Should work" — not evidence
  • "I updated the file" — proves nothing about correctness
  • "Tests were passing before" — before is not after
  • "Looks good to me" — from the agent that wrote it

The review chain

For critical tasks, we enforce a review chain:

  1. Agent A completes the task with evidence
  2. Agent B reviews the evidence (different agent!)
  3. B either confirms or rejects with specific feedback
  4. If rejected → A fixes → back to step 1

This catches what self-review misses. Agents are not objective about their own work — just like humans.

Implementation

The evidence system is built into Bridge ACE server:

if status in ("success", "partial"):
    if not result_summary or not evidence:
        return 400, {"error": "evidence required for success/partial"}
    if not isinstance(evidence, dict) or "type" not in evidence:
        return 400, {"error": "evidence must have type and ref"}
Enter fullscreen mode Exit fullscreen mode

No workarounds. No overrides. The server enforces it.

Results

  • False completions dropped from ~30% to <5%
  • Cascading bugs from bad fixes: eliminated
  • Agent accountability: dramatically improved
  • Debugging time: cut in half (evidence tells you what was tested)

Try it

git clone https://github.com/Luanace-lab/bridge-ide.git
cd bridge-ide && ./start_platform.sh
Enter fullscreen mode Exit fullscreen mode

Create a task. Complete it without evidence. Watch the server reject it.

GitHub


Trust your agents. But make them prove it.

Top comments (0)