Test File Discovery Is Still Unsolved

#testdiscovery #mutationtesting #developertooling #architecture

Test File Discovery Is Still Unsolved

Given a file like src/pages/checkout/index.tsx, which test files should you look at? Sounds simple. It's not.

We build an AI agent that writes tests. Before the agent starts, we need to find existing test files so it can match the project's testing patterns. We looked at the agent's logs for one real run: 34 iterations total, and 18 of them were spent just reading files - fetching imported modules, searching for type definitions, re-reading files it had already seen. The agent can read 2-3 files per iteration in parallel, but it still burned half its budget on discovery instead of writing tests.

The agent can solve this on its own - it does search, read, and eventually find the right files. But each iteration costs tokens. We want to pre-load as much context as possible before the agent loop begins, doing deterministically what the agent would do heuristically. Same work, but programmatic, stable, and cheaper. The discovery algorithm is the hard part - especially when you're language-agnostic and can't rely on any single project's conventions.

Approach 1: Stem Matching

Extract the filename stem and search the tree. Say you have src/auth/SessionProvider.tsx - the stem is SessionProvider. Walk the file tree, find test files containing "SessionProvider" in their path. This works for most files.

It fails for generic stems. A file like src/pages/checkout/index.tsx has stem index. Grepping for "index" across a codebase matches almost everything - 29 test files in one real repo. The signal drowns in noise.

We considered falling back to the parent directory name for generic stems (index -> checkout). This helps for some cases, but "generic" is a judgment call. Is utils generic? config? handler? Every heuristic creates a new edge case.

Approach 2: Content Grepping

Instead of matching paths, grep test file contents for the stem. If a test file imports SessionProvider, it references that implementation. This catches tests in completely different directories - e.g. a test in src/pages/checkout/ might import ../../auth/SessionProvider.

But content grep has a different failure mode. Many JavaScript projects use barrel exports (index.ts re-exporting everything). A test might import from '@/pages/checkout' which resolves to index.tsx at runtime, but the string "index" never appears in the import. The connection exists at the module resolution level, not the string level.

PHP and Go have the same problem differently. A PHP test file might reference InvoiceService by class name without any file path in the import. A Go test lives in the same package directory and imports nothing explicitly.

Approach 3: Hybrid (Current)

We now combine both approaches. Path matching (walk the tree for test files whose path contains the stem) plus content grep (find test files that reference the stem in source code). Take the union. This catches both colocated tests and distant tests that import the file.

The problem shifts from discovery to ranking. A real repo produces 29 test file hits for index.tsx (from 51 raw grep matches). Five of them are highly relevant (in src/pages/checkout/ subtree). The other 24 are noise. Which 5 do we load into context?

The Ranking Bug That Toy Tests Missed

We scored each test file: +100 for name match, +50 for same directory, +30 for shared parent, -1 per distance. We wrote tests with 3 handcrafted files. They passed.

Then we ran the ranker against 29 real file paths from a production repo. src/index.test.tsx (the root app test, completely unrelated) ranked #2. src/pages/checkout/components/PayButton/index.test.tsx (actually relevant) ranked #4.

The bug: +30 was a flat bonus for any shared parent. One shared component (src/) got the same +30 as three shared components (src/pages/checkout/). With 3 synthetic inputs, other scoring factors dominated. With 29 real inputs at varying depths, the flat bonus broke everything.

The fix was one line: change +30 to common_len * 10 so deeper shared paths score higher.

This is the mutation testing principle. Imagine an "evil coder" who changes your constant: +30 to +0 or +1000. Do your tests fail? With 3 synthetic inputs, no. The tests pass regardless of the constant's value. That means they prove nothing about it. Only 29 real inputs exposed the flaw.

What Remains Unsolved

The fundamental issue is that the mapping between implementation files and test files is a convention, not a computable relationship. Every project invents its own rules:

Colocated: Button.tsx and Button.test.tsx side by side
Mirror tree: src/auth/Provider.tsx tested by tests/auth/Provider.test.tsx
Separate dir with different naming: core/app/Services/Foo.php tested by core/tests/Unit/Service/FooTest.php
Framework magic: Go tests in the same package, Python tests discovered by pytest markers
Barrel re-exports: The actual file path never appears in any import statement

No single algorithm handles all of these. Path matching fails for different directory structures. Content grep fails for barrel exports and framework-level imports. Even the hybrid approach requires a ranking function, and that ranking function needs real data to validate - not 3 handcrafted inputs.

If you're building developer tooling that needs to answer "which test covers this file?" - there's no clean answer. The best we've found is: try multiple discovery methods, take the union, rank aggressively, and validate with real repository data at real scale. And even then, you'll miss cases.