DEV Community

Xu Bian
Xu Bian

Posted on • Originally published at marlinbian-site.pages.dev

Evidence Contract: AI Delivery Must Come With Proof

The most dangerous sentence in AI delivery is: "It is done."

That sentence is not evidence. AI can write confidently. A summary can look complete. A PR description can be polished. None of that proves the work is actually complete.

A project-specific AI delivery pipeline should redefine "done" as an evidence question: what reviewable proof supports each acceptance criterion?

That is the evidence contract.

Tests matter, but they are not everything

Tests are one of the most important forms of evidence. They are not the only form.

A backend function fix may be covered by unit and integration tests. A frontend interaction change may also need screenshots or a recording. A data-link fix may need API output, logs, read-only SQL, or queue observation. A SketchUp modeling tool may need a design model diff, bridge trace, top-view screenshot, and live bridge smoke.

The question is not only "did tests run?" The question is "what evidence does this delivery require?"

Evidence must map to acceptance criteria

Many projects enforce evidence by changed file type. If frontend files changed, screenshots are required. If service or database code changed, data proof is required.

That is already much better than no evidence. But the stronger version maps evidence to acceptance criteria.

If the task has three acceptance criteria, the manifest should answer:

  • which test or screenshot proves the first one;
  • which API output or log proves the second one;
  • whether the third is uncovered, and why.

That lets reviewers decide whether the AI solved the user problem, not merely whether it ran some commands.

The evidence manifest should be a file

Evidence should not live only in chat.

An evidence manifest can include:

  • task ID or PR;
  • change summary;
  • acceptance criteria;
  • evidence for each criterion;
  • test commands and results;
  • screenshot or data proof paths;
  • checks that were not run and why;
  • residual risks;
  • generation time;
  • worker or tool version.

The manifest does not guarantee correctness. It gives reviewers something durable to inspect.

Different projects need different evidence

Evidence contracts must be project-specific.

In systems like TidalFi, changes that touch APIs, services, databases, queues, Redis, or event flows cannot rely only on unit tests. They need data proof. Frontend flow changes need screenshots. Release-related changes need a release boundary and production verification.

In SketchUp Agent Harness, "there is a visible model" is not enough. The project needs to know where the model came from, whether the structured design model is consistent, whether the bridge trace is explainable, whether the SketchUp scene came from a clean replay, and whether visual review is backed by source evidence.

In knowledge publication, "the article was generated" is not enough. The system needs source trace, bilingual siblings, series metadata, language switching, site build validation, and clear ownership between knowledge and site.

Without evidence, it is not done

This rule changes AI behavior.

Without an evidence gate, AI tends to declare completion in natural language. With an evidence contract, AI must collect test results, screenshots, logs, traces, and risk notes during execution.

It behaves more like an engineering worker and less like a chat assistant.

Conclusion

The completion standard for AI delivery should not be "AI believes it is done."

Done should mean that the acceptance criteria in the task contract have matching evidence, missing coverage is explicitly stated, and high-risk boundaries were not crossed silently.

That is the value of the evidence contract.


Originally published on my personal site:
https://marlinbian-site.pages.dev/writing/evidence-contract-for-ai-delivery/

More links: GitHub · YouTube · LinkedIn · Bluesky · Mastodon · Discord

Top comments (0)