Dmytro Huz

Posted on Mar 9

I Built ac-trace to Check What Tests Actually Protect

#testing #ai #development #opensource

AI-assisted coding is making one part of software development much faster than another.

It is now easier than ever to generate implementation code, unit tests, fixtures, mocks, and even test structure. But while output is getting faster, confidence is not automatically getting deeper. In fact, the opposite can happen: the more quickly code and tests are produced, the easier it becomes to confuse visible testing activity with real protection.

That gap is exactly why I built ac-trace, a new open-source tool.

The core problem is simple: passing tests are often a weaker signal than teams think. Coverage is not enough either. A test suite can be green, a code path can be exercised, and the intended behavior can still be only weakly defended.

What I care about is not just whether code ran, or whether assertions passed. The harder question is this:

Are the acceptance criteria actually protected?

The problem: green tests do not prove much by themselves

In many teams, these ideas get blended together:
• tests are passing
• code is covered
• therefore the requirement is safe

But those are different signals.

A passing test tells you that some expectation held in one scenario. Coverage tells you that code executed. Neither one, by itself, proves that the important business behavior is strongly defended against breakage.

This becomes more important with AI-assisted coding.

AI is good at producing plausible implementations and plausible tests very quickly. That is useful. But it also lowers the cost of producing code that looks well tested. You get more test files, more green checks, more visible structure — and sometimes only shallow confidence underneath.

A concrete example

Imagine a billing service with this acceptance criterion:

Premium users must never be charged above their contractual monthly cap.

Now imagine the code has tests for invoice creation. It has tests for premium-user billing flow. It has good coverage around the billing function. The relevant lines all execute. The pipeline is green.

Looks fine.

But now remove the cap check. Or flip the comparison. Or mutate the mapped billing logic in a way that breaks the intended behavior.

Do the tests fail?

If they do not, then the acceptance criterion was never really protected. The system had tests. The code was covered. But the thing that mattered was still weakly defended.

That is the gap I wanted to make more visible.

That is why I built ac-trace

ac-trace (Repo: https://github.com/DmytroHuzz/ac-trace) is an open-source tool that maps acceptance criteria to code and tests, then mutates the mapped code to verify whether the tests actually catch the breakage.

In plain terms:

it tries to answer whether the tests defend the behavior they are supposed to defend.

This is not just traceability for documentation. The point is not only to show links between requirements, code, and tests. The point is to test whether those links have teeth.

How it works

The current workflow is intentionally simple:
1. Define acceptance criteria
2. Map them to relevant source code and tests
3. Infer some links from annotated tests
4. Mutate the mapped implementation
5. Run the relevant tests
6. Generate a report showing what failed and what survived

So the flow is roughly:

acceptance criteria → code → tests → mutation → report

If the mapped code is changed and the linked tests fail, that is a useful sign.

If the mapped code is changed and the linked tests still pass, that is also useful — because it shows a confidence gap that might otherwise stay hidden behind a green suite.

Why this matters more now

I do not think AI-assisted coding is the problem by itself.

The problem is that AI increases output faster than it increases justified confidence.

When implementation and tests both become cheap to generate, teams need better ways to distinguish between:
• code that looks tested
• code that is covered
• code whose important behavior is actually defended

Without that distinction, it becomes very easy to over-trust green pipelines.

That is the broader reason for ac-trace. I wanted something practical that pushes on this exact point.

Current scope

ac-trace is still early and intentionally narrow.

Right now it focuses on:
• Python
• pytest
• YAML manifests
• inferred links from annotated tests
• generated reports

I kept the scope small on purpose. I would rather build a narrow tool around one precise question than make broad claims too early.

This is an experiment in making one software-quality problem more concrete.

Launch note

So this post is also the announcement: ac-trace is now open source: https://github.com/DmytroHuzz/ac-trace.

If you work on backend systems, care about software quality, or are thinking seriously about how AI changes testing and confidence, I think this problem is worth exploring.

I built ac-trace because I kept coming back to the same thought:

Passing tests are useful, but they do not necessarily mean the acceptance criteria are protected.

I want a more direct way to inspect that gap.

Conclusion

ac-trace is my open-source attempt to make the gap between green tests and justified confidence more visible.

CTA

Check out the repo, try it on a small Python project, and tell me where the idea is useful, naive, or worth pushing further.

DEV Community