Alejandro Sosa

Posted on Apr 1

Your Test Coverage Is Lying to You

#testing #go #ai #opensource

$ testreg trace auth.login

  Feature: User Login (auth.login)
  Priority: critical

route:/login                                              apps/web/src/router.tsx:142
└─ LoginPage                                              apps/web/src/pages/auth/LoginPage.tsx:13
   └─ useAuth                                             apps/web/src/hooks/useAuth.ts:19
      └─ authApi.login                                    apps/web/src/services/api/auth.ts:46
         └─ POST /api/v1/auth/login                       src/infrastructure/http/handlers/auth_handler.go:576
            └─ AuthHandler.Login                          src/infrastructure/http/handlers/auth_handler.go:249
               └─ authService.Login                       src/application/services/auth_service.go:172
                  ├─ JWTGenerator.GenerateTokenPair        src/infrastructure/auth/jwt_generator.go:70
                  │  ├─ JWTGenerator.GenerateAccessToken   src/infrastructure/auth/jwt_generator.go:97
                  │  └─ JWTGenerator.GenerateRefreshToken  src/infrastructure/auth/jwt_generator.go:123
                  ├─ authRepository.StoreRefreshToken      src/domain/repositories/auth_repository.go:329
                  ├─ repositories.HashToken                src/domain/repositories/auth_repository.go:90
                  └─ sql:GetUserByEmail                    src/domain/repositories/queries/user.sql:21

That's one command. It traces every function your login feature touches — from the React route to the SQL query. It tells you which of those 11 nodes have tests, which don't, and which are critical enough to block a release.

Your coverage tool tells you a percentage. This tells you where you'll get paged at 2 AM.

It's not a magic wand. The tool auto-maps roughly 95% of call edges through AST parsing, but the remaining edges need lightweight annotations — a comment above a handler, a tag on a test file. The kind of overhead that takes seconds per file. The initial setup is real. What it replaces is realer.

Here's a question most teams can't answer: which business features does your test suite actually validate?

Not which files have tests. Not which lines were executed. Which features — the things your users pay for, the things your PM tracks, the things that wake you up at night — are meaningfully covered across every layer of the stack?

I worked on a full-stack nutrition platform — 749 test files. CI was green. Coverage sat at a comfortable number. And I couldn't tell you whether the login flow was tested end-to-end or whether the meal logging feature had a single integration test that hit a real database.

The problem wasn't a lack of tests. It was a lack of visibility. The metrics we use — line coverage, branch coverage, "all tests pass" — answer the wrong question. They tell you how much code was executed during tests. They say nothing about which business outcomes those tests protect.

When I dug in, here's what I found:

78% of tests used mocks. Only 22% hit real services.
22 source files had zero test files — invisible behind the aggregate coverage number.
A load test had been "passing" for weeks. It was testing a rate limiter that was never wired into the middleware chain. The test could not fail because the feature it tested didn't exist in production.
A Maestro E2E test for the training session screen checked for the presence of labels (assertVisible: text: 'Total Volume') but not their values. Total Volume: 0 kg was a passing test.

Every one of these problems was invisible to standard coverage tools. Green dashboard. Broken assumptions.

Full-Stack Dependency Tracing: From React Route to SQL Query

I built testreg to answer the question coverage tools can't: for each business feature, what is the test coverage across every layer of the stack?

It works by combining three things:

1. Static analysis of the actual source code. testreg parses Go source files using go/ast and go/parser — the same packages the Go compiler uses. It doesn't instrument your code or run your tests. It reads the AST, discovers functions and methods, resolves struct field types, follows call chains through handlers → services → repositories → SQL queries. For projects that need exact cross-package type resolution, an experimental type_checking: true option enables go/types — the same type checker the Go compiler uses. It's not battle-tested yet, but the direction is clear: opt-in precision when the default heuristic isn't enough. For TypeScript, it runs a scanner using the TypeScript compiler API to trace React Router routes → components → hooks → API service calls.

2. Dependency injection and query resolution. testreg parses DI framework registrations to resolve interface-to-concrete mappings automatically — Google Wire (wire.Bind()), Uber Fx (fx.Provide(), fx.Options()), and Dig (dig.Provide()). If you use SQLC for database queries, it maps generated Go methods back to their source SQL files. These integrations mean the graph doesn't stop at an interface boundary — it follows the real implementation.

3. A feature registry. You define your business features in YAML — auth.login, meals.log-create, billing.checkout — with their API surfaces and priority levels. Test files get a // @testreg auth.login annotation. The tool maps features to tests, traces the dependency graph from each feature's entry point, and produces a per-feature health score weighted by architectural layer:

Health = (handler_coverage × 0.30)
       + (service_coverage × 0.30)
       + (repository_coverage × 0.25)
       + (query_coverage × 0.15)

The output looks like this:

Feature: auth.login (critical)  Health: 74%
═══════════════════════════════════════════════════════

  Dependency Chain:
    route:/login                                            ✓ tested
    └─ LoginPage                                            ✓ tested
       └─ useAuth                                           ✓ tested
          └─ authApi.login                                  ✓ tested
             └─ POST /api/v1/auth/login                     ✓ tested
                └─ AuthHandler.Login                        ✓ tested
                   └─ authService.Login                     ✓ tested
                      ├─ JWTGenerator.GenerateTokenPair     ✓ tested
                      ├─ authRepository.StoreRefreshToken   ✘ NO TEST
                      ├─ repositories.HashToken             ✘ NO TEST
                      └─ sql:GetUserByEmail                 ✘ NO TEST

  Gaps (3):
    ✘ [CRITICAL] authRepository.StoreRefreshToken — no unit test
    ✘ [HIGH]     repositories.HashToken — no unit test
    ✘ [MEDIUM]   sql:GetUserByEmail — no query-level test

That's not a percentage. That's a map.

The --format prompt flag produces the same gap data as a structured work order, designed for AI agents:

## Feature: auth.login
Priority: critical | Health: 74% | Target: 100%

### Gaps (3):
1. CRITICAL: authService.Login has no unit test
   - Source: src/application/services/auth_service.go:172
   - Write: unit test for authService.Login
   - Annotation: // @testreg auth.login #real

Every field is actionable: source file path, line number, what test to write, where to put it, the exact annotation to add. The AI receives a work order, not a reading list.

The graph powers more than gap analysis. testreg contract auth.login shows the full API contract and implementation chain — not just which functions are called, but what data flows through each layer with exact function signatures:

$ testreg contract auth.login

  Feature: Login (auth.login)
  Entry:   POST /api/v1/auth/login
  ═══════════════════════════════════════════════════════════════

  Layer 1: Endpoint
    File: apps/web/src/router.tsx:142
    Delegates to: LoginPage

  Layer 2: Component
    File: apps/web/src/pages/auth/LoginPage.tsx:13
    Delegates to: useAuth

  Layer 3: Hook
    File: apps/web/src/hooks/useAuth.ts:19
    Delegates to: authApi.login

  Layer 4: Service
    File: apps/web/src/services/api/auth.ts:46
    Delegates to: POST /api/v1/auth/login

  Layer 5: Handler
    File: src/infrastructure/http/handlers/auth_handler.go:249
    func (*AuthHandler) Login(w http.ResponseWriter, r *http.Request)
    Delegates to: authService.Login

  Layer 6: Service
    File: src/application/services/auth_service.go:172
    func (*authService) Login(ctx context.Context, email string,
                              password string) (*AuthResponse, error)
    Also calls: JWTGenerator.GenerateTokenPair,
                authRepository.StoreRefreshToken,
                repositories.HashToken, sql:GetUserByEmail

  Layer 7: Service
    File: src/infrastructure/auth/password.go:68
    func (*Argon2Hasher) Verify(password string,
                                encodedHash string) (bool, error)

  Test Coverage: 31 test files across Go, TypeScript, Playwright,
                 and Maestro — unit, integration, and E2E

Eight layers, from the React route to the password hasher, with exact function signatures and file locations. A frontend developer runs one command and sees the complete implementation chain without reading Go source. An AI agent implementing a new consumer of this endpoint gets the exact function signatures and delegation chain. The contract is always current because it's derived from source code, not documentation.

How Static Analysis Speeds Up AI-Assisted Development

Without structured tooling, every AI-assisted session starts the same way: scan the codebase, load files into context, figure out what's tested and what isn't. The same orientation work, repeated every session, burning context window before any productive work begins.

The capability gap matters more than the token savings. An AI reading source files as text cannot replicate what AST-based dependency tracing produces. It can't systematically resolve Wire bindings across 839 Go files. It can't trace struct field types through multi-level call chains. It can't map SQLC-generated methods to their source SQL files. These require parsed abstract syntax trees, not pattern matching on text. No amount of context window changes this.

But the token savings are real too. I measured them on a 2,122-file production codebase (184 features, 16 domains):

testreg command	Output size (chars)	~Tokens
`sprint -n 10` (prioritized gap list)	939	~235
`audit --summary` (health by priority tier)	386	~97
`gaps --feature auth.login --format prompt` (work order)	3,645	~911

That sprint + summary is roughly 330 tokens of structured, prioritized output — exact file paths, line numbers, gap severities, weighted health scores — for a codebase that totals approximately 5 million tokens of source code.

An AI exploring that same codebase from scratch to answer "what should I work on next?" needs to read directory structures, sample test files, cross-reference source directories, and reason about coverage. Conservative estimate: 10,000-25,000 tokens of exploration, producing a qualitative approximation. The ratio is roughly 30x-75x for orientation tasks, depending on how selectively the AI reads. For single-feature work on a codebase the AI already knows, the ratio narrows to roughly 5x-12x — still meaningful, but the real value is that the output is structured and actionable, not approximate. (The "with testreg" numbers are measured from actual CLI output at ~4 chars/token, ±30%. The "without" numbers are conservative estimates, not benchmarks.)

Where the savings compound most: parallel agent dispatch. In a sprint where I dispatched 25 agents across 7 batches, each agent received a structured work order of ~900 tokens. Without testreg, each agent independently explores the codebase — multiplying the orientation cost by the number of agents. Five agents exploring independently: five times the orientation cost. Five work orders from one testreg run: the analysis is paid once, offline, in milliseconds.

The pattern: cheap deterministic analysis (static tool, milliseconds) feeds expensive creative execution (AI agent, minutes). The tool finds gaps in milliseconds. The AI writes tests in minutes. Neither alone achieves what both together produce.

This isn't just about testing. The same dependency graph powers testreg diagnose auth.login --symptom "401 unauthorized" — match an error pattern against the dependency chain and rank which files to check first. It powers testreg trace for understanding a feature before refactoring it. It powers testreg sprint for data-driven sprint planning instead of gut-driven guessing.

The graph is the product. The commands are lenses.

Bugs That Test Coverage Metrics Missed

Theory is cheap. Here's what happened when I applied this methodology across production codebases.

The 5-Bug Cascade Nobody Could See

On the nutrition platform, testreg audit flagged training.record-exercise at 0% health despite having 20+ passing unit tests at the service layer. The audit showed why: the outer GraphQL resolver — the entry point that sits between the API and the business logic — had zero tests.

I traced the dependency chain:

mutationResolver.TrainingLogSet        ← NO TESTS (resolver layer)
  → TrainingResolver.LogSet            ← delegation
    → sessionService.LogSet            ← tested, but...
      → setRepo.Create                 ← repository
        → sql:CreateExerciseSet        ← SQL

The unit tests covered the service layer and below. Nobody had tested the resolver — the integration seam where data transformations happen. Five bugs were hiding there:

Layer	Bug
Go resolver	`set.exercise` always null — a comment said "resolved via DataLoader" but no DataLoader existed
Go service	`ListHistory` returned sessions without segments — stats showed 0 kg, 0 exercises
React Native state	Stale closure captured initial values — `handleCompleteSet` sent weight=0, reps=0
React Native navigation	Race condition on summary screen — re-fetch competed with cache invalidation
React Native persistence	Unhandled SQLite foreign key errors polluting error state

All existing tests passed. The Maestro E2E test passed — it checked for label presence, not values. Traditional code coverage would show the resolver as "covered" because other test paths touched it. testreg showed it was uncovered for this specific feature's entry point.

No single test, no code review, no coverage tool would catch all five. They required tracing the full stack and finding the untested seam.

The Rate Limiter That Was Never There

On a different project, testreg audit flagged load test files as "unknown language." Adding JavaScript assertion patterns made them auditable — which led to actually running the load tests. Which revealed:

All GraphQL load tests returned 401 (no auth headers in the test)
The rate limiter load test always passed — because the rate limiter was never wired into the middleware chain

A "passing" test that can never fail is worse than no test at all. The chain reaction: testreg audit → unknown language flag → add JS support → run tests → find 401s → fix auth → find rate limiter passing with 0 rejections → discover it was never wired → wire it → verify.

Without the audit flagging an "unknown language," none of this surfaces.

The Numbers

In one sprint against the nutrition platform:

Metric	Before	After
Features at 80%+ health	29	46
Critical features at 100%	10 of 23	20 of 23
Tests written	—	500+
Production bugs found	—	5 (multi-layer cascade)
Audit time (184 features)	1m 52s	14s

The 7.9x performance improvement came from a single architectural change: build the dependency graph once and query it per-feature, instead of rebuilding it 184 times. A 2-minute command doesn't get integrated into CI. A 14-second command runs on every push.

Running Static Analysis on a Codebase You've Never Seen

Everything above was tested on projects I built. Here's what happens on a project I didn't.

Metro-Grama is a university transit app built with Echo (Go) and React/TypeScript — a stack I didn't build and hadn't seen before. I cloned it and ran testreg with zero annotations.

The entire setup:

# 1. Clone the project
git clone https://github.com/MetroTech-UNIMET/Metro-Grama.git
cd Metro-Grama

# 2. Create .testreg.yaml (4 lines)
cat > .testreg.yaml << 'EOF'
graph:
  backend_root: server
  frontend_roots:
    - client/src
  max_depth: 10
EOF

# 3. Auto-discover routes and generate features
testreg init --discover

# 4. Scan for existing tests
testreg scan

Results — no annotations written, no features manually defined:

Discovered 43 routes → 43 features across 15 domains
  auth: 5 features
  careers: 5 features
  enroll: 5 features
  subjects: 4 features
  ...

Scan complete.
  Total test files: 6
  Mapped:           3 (auto-mapped by directory proximity)
  Unmapped:         3

testreg init --discover parsed the Echo router, found all 43 routes, grouped them by module, and generated features with correct API surfaces. testreg scan found 6 existing test files and auto-mapped 3 of them to features by matching directory names to domains — server/modules/auth/auth_test.go mapped to auth features, client/src/features/student/Profile/Profile.test.tsx mapped to student features.

  ┌──────────────────────┬───────┬──────────┬──────────┬──────────┐
  │ Domain               │ Total │ Unit     │ Integ.   │ E2E      │
  ├──────────────────────┼───────┼──────────┼──────────┼──────────┤
  │ auth                │ 5    │ 5/5 OK  │ 0/5 !!  │ 0/5 !!  │
  │ careers             │ 5    │ 0/5 !!  │ 0/5 !!  │ 0/5 !!  │
  │ enroll              │ 5    │ 3/5 ✓   │ 0/5 !!  │ 0/5 !!  │
  │ student             │ 3    │ 3/3 OK  │ 0/3 !!  │ 0/3 !!  │
  │ ...                 │      │          │          │          │
  ├──────────────────────┼───────┼──────────┼──────────┼──────────┤
  │ TOTAL               │ 43   │ 26% !!  │ 0% !!   │ 0% !!   │
  └──────────────────────┴───────┴──────────┴──────────┴──────────┘

In under a minute, with 4 lines of configuration, testreg produced a complete coverage dashboard for a foreign codebase. The dependency graph was auto-discovered from the Echo router. The test mapping used directory proximity, not annotations. The coverage gaps are immediately visible: auth has unit tests, careers has nothing, nobody has integration or E2E tests.

The annotations would make it more precise — mapping specific test functions to specific features. But even without them, the tool produces actionable output on day zero.

What This Tool Can't Do (And Why That's the Point)

testreg makes specific assumptions about your codebase. Understanding them upfront determines whether it's useful to you.

Router support:

Router	Support Level
Chi	Auto-detected
Echo	Auto-detected
stdlib `net/http`	Auto-detected
Gin, Fiber, gorilla/mux	Via `@api` annotations

Chi is battle-tested across 184 features on a production monorepo. Echo was validated against Metro-Grama (43 routes). Other routers are supported through @api annotations but haven't seen the same volume of real-world usage.

Dependency injection support:

DI Framework	Support Level
Google Wire	Full — parses `wire.Bind()` and provider functions
Uber Fx / Dig	Full — parses `fx.Provide()`, `fx.Options()`, `dig.Provide()`
Manual wiring (struct fields)	Full
Manual wiring (constructor params)	Partial — constructor visible, internal calls not traced
Closures / globals	None

Type resolution:

By default, testreg uses go/ast heuristics — fast, works on any source code, no build required. An experimental type_checking: true option enables go/types for cross-package resolution — not yet battle-tested, but the foundation for exact interface resolution in future versions.

Data access:

Tool	Support
SQLC	Full — maps generated methods to SQL source files
GORM, ent, raw SQL	None — too dynamic for AST analysis

Frontend:

Framework	Support
React Router + TanStack Query	Full tracing
Next.js, Vue, Svelte, Angular	None

Architecture assumptions:

The health score weights assume a layered architecture: handler → service → repository → query. Flat architectures get skewed scores. The graph still builds, but the weights won't reflect your actual risk distribution.

What's heuristic, not deterministic:

Test status is file existence, not pass/fail. A broken test still shows as "tested."
partial coverage counts the same as tested in the health score — which can inflate it.
Gap severity is fixed by architectural layer, not code complexity or change frequency.

The hidden dependency: coding conventions.

None of this works without disciplined engineering practices underneath it. testreg can classify a function as a "handler" because the directory is named handlers/. It can trace h.service.Login() because dependencies are struct fields, not closures. It can weight health scores by layer because the architecture has layers. The tool doesn't create structure — it reads structure that's already there.

Naming conventions, folder organization, layered architecture, struct field injection, code-generation DI — these aren't just "clean code" preferences. They're machine-readable metadata. Every convention your team follows is a signal that static analysis can consume. Every shortcut — a global variable here, a closure-captured dependency there — is an edge the graph can't trace.

This is why testreg rewards teams that already follow good practices and offers less to codebases that don't. The tool is a consequence of disciplined engineering, not a substitute for it.

I'm explicit about these limitations because they are the design. I built testreg on tools that already exist in the Go ecosystem — go/ast, go/parser, Wire, SQLC, the TypeScript compiler API (with experimental go/types support in progress). I didn't invent a new static analysis framework. I composed existing ones into a workflow that answers a question they couldn't answer individually.

Why Your Stack Choice Matters for AI-Assisted Development

There's a thesis hiding inside this tool that has nothing to do with testing.

The reason testreg works is not that the approach is clever. It's that Go's ecosystem provides the raw materials for deterministic static analysis:

go/ast and go/parser are standard library — zero dependencies to parse any Go file
go/types provides full type resolution — opt-in, first-party, currently experimental in testreg but the capability exists in the stdlib
Wire and SQLC use code generation, not runtime reflection — their bindings are visible to static analysis
Even Uber Fx/Dig, which use runtime reflection at execution time, register providers through static Go code that AST parsing can trace

Compare this to ecosystems where the same approach is structurally harder:

Static Analysis Feasibility	Examples
Traceable — static typing + stdlib AST tooling	Go, Rust, Java, C#
Partially traceable — provider registration parseable, runtime behavior not	Uber Fx/Dig (Go), Spring (Java)
Not statically resolvable — dynamic typing, metaprogramming	Python, Ruby, plain JavaScript

This doesn't mean Go is the only viable stack for AI-assisted development. Java and C# have excellent static analysis ecosystems. Rust's type system is even stricter. TypeScript's compiler API powers testreg's frontend scanning. Python has the ast module and tools like mypy.

But the principle is clear: stacks with accessible AST tooling and deterministic DI resolution give you a structural advantage for building the kind of tooling that makes AI-assisted development efficient. The graph testreg builds is possible because Go was designed with tooling in mind. Every piece — the parser, the type checker, the convention of struct field injection — contributes to making the codebase statically analyzable.

The next time you evaluate a tech stack, consider asking: does this ecosystem give me the tools to build deterministic static analysis? Because in AI-assisted development, the stack that's easiest to analyze statically is the stack where AI delivers the most value.

The Future: OpenAPI, Richer Metadata, and Self-Describing Codebases

What follows are directions, not shipped features.

testreg uses a minimal set of custom annotations — @testreg for feature mapping, @api for route binding, #mocked/#real for test classification. Even this minimal metadata layer produces significant value. But there's more on the table.

Projects that already have structured metadata — OpenAPI specs, GraphQL schemas, Swagger comments — are sitting on information that tools like testreg could consume. An OpenAPI spec already defines routes, request/response schemas, and authentication requirements. A GraphQL schema already defines the type contracts between layers. Today, testreg doesn't parse these standards. It could. Integrating with OpenAPI would eliminate the need for @api annotations entirely on projects that already document their APIs — one less custom tag, one less thing to maintain.

Beyond existing standards, richer annotations could unlock more precise AI work orders. If a test annotation included the expected invariant (@invariant "never stores plaintext password"), the AI could write assertion-rich tests targeting specific guarantees instead of generic happy-path coverage. If features declared dependencies (@depends-on auth.session), the tool could answer "if auth breaks, which downstream features are at risk?"

There's also the reverse direction. The same conventions testreg reads for analysis — directory structure, DI patterns, test naming — could drive generation in reverse. Define meals.log-create in the registry YAML, run a scaffold command, and get backend stubs — handler, service, repository, SQL query, test files — all wired, all annotated, all following the conventions the tool already understands. Frontend scaffolding from the same definition is a longer-term goal that requires bridging the Go-to-TypeScript generation boundary. The feature registry becomes a blueprint, not just documentation. Scaffolding generates the structure; an AI agent fills in the implementation; testreg verifies the coverage. The tool bookends the creative work.

Each additional annotation makes AI work orders more precise. Each one is also maintenance burden. An annotation that drifts from reality is worse than no annotation: it gives false confidence. The right strategy is: derive metadata from code wherever possible (testreg already does this with AST parsing and init --discover), build on existing standards where they cover the use case (OpenAPI, GraphQL schemas), and add custom annotations only for what no standard covers (business feature mapping, test classification).

The more self-describing your codebase is, the less an AI agent needs to explore. testreg is one implementation of this principle for one stack. The principle is stack-agnostic. And we're just getting started.

This is still early. testreg solves visibility into test coverage — not test quality, not code correctness, not security. It's a memory bank and a gap finder for specific workflows: sprint planning, codebase onboarding, bug triage, agent dispatch. One tool in a larger toolkit that's still being built.

But the piece it solves sits at a foundation layer. Knowing which business features are at risk, across every layer of the stack, without re-exploring the codebase every session — that's what other tools can build on. And the pattern underneath it — structured metadata feeding AI execution, static analysis compressing exploration into structured output, ecosystem tooling making codebases self-describing — that pattern has a long way to run.

Your coverage percentage is lying to you. Now you know what to ask instead.

testreg on GitHub

Top comments (1)

Sergionx • Apr 3

What a interesting tool! I would definitely use it on my future projects, I just have 2 questions:

You mentioned that it works with Go and Typescript; for the Typescript you tested with React. Will it work with a typescript-based backend like Express or Nestjs?
If I understood correctly, this uses heuristics to accurately detect not only which features are unit tested, but also integration and e2e tested, how does it do that? It uses AI to analye or static analysis of the code is enough to have that info?