"Task complete."
Three words that cost us two days of debugging.
An agent said it fixed a bug. It did not. It changed the wrong file. The bug persisted. Three other agents built on top of the "fix." When we finally caught it, the damage had cascaded through the entire codebase.
Never again.
The Rule
In Bridge ACE, no agent can mark a task as done without evidence:
bridge_task_done(
task_id="abc123",
result_summary="Fixed WebSocket reconnection bug in server.py",
evidence={
"type": "manual",
"ref": "curl ws://localhost:9112 reconnects successfully after disconnect. 5 test cycles, 0 failures."
}
)
If result_summary or evidence is missing → HTTP 400. Task stays open.
What counts as evidence
| Type | Example | When to use |
|---|---|---|
| Test output | "pytest: 22 passed, 0 failed" | Code changes |
| curl response | "HTTP 200, body contains expected data" | API changes |
| Screenshot | "/tmp/screenshot_after.png" | UI changes |
| Log excerpt | "No errors in last 5 minutes of server.log" | Bug fixes |
| Diff | "3 lines changed in server.py:450-452" | Refactors |
What does NOT count
- "Should work" — not evidence
- "I updated the file" — proves nothing about correctness
- "Tests were passing before" — before is not after
- "Looks good to me" — from the agent that wrote it
The review chain
For critical tasks, we enforce a review chain:
- Agent A completes the task with evidence
- Agent B reviews the evidence (different agent!)
- B either confirms or rejects with specific feedback
- If rejected → A fixes → back to step 1
This catches what self-review misses. Agents are not objective about their own work — just like humans.
Implementation
The evidence system is built into Bridge ACE server:
if status in ("success", "partial"):
if not result_summary or not evidence:
return 400, {"error": "evidence required for success/partial"}
if not isinstance(evidence, dict) or "type" not in evidence:
return 400, {"error": "evidence must have type and ref"}
No workarounds. No overrides. The server enforces it.
Results
- False completions dropped from ~30% to <5%
- Cascading bugs from bad fixes: eliminated
- Agent accountability: dramatically improved
- Debugging time: cut in half (evidence tells you what was tested)
Try it
git clone https://github.com/Luanace-lab/bridge-ide.git
cd bridge-ide && ./start_platform.sh
Create a task. Complete it without evidence. Watch the server reject it.
Trust your agents. But make them prove it.
Top comments (0)