Alexey Spinov

Posted on Jun 23 • Originally published at finops.spinov.online

Dependency Gap in AI Code: Declared 1, Imported 4

#testing #ai #devops #python

The dependency gap in AI-generated code is what the source imports minus what the manifest declares. A green CI run only proves the packages were already installed on the author's machine, not on a fresh checkout. So measure the gap statically, before merge. repro_probe.py reads the source with ast and never runs it. On a project that declared one package but imported four: gap=3, coverage 25.0%, exit 1.

AI disclosure: I drafted this with an AI writing assistant. The tool, the three fixtures, and every number below come from a real local run on Python 3.13.5, stdlib only, no network. I ran it, checked the exit codes, hashed the STDOUT twice to confirm it's byte-for-byte deterministic, and edited every line before publishing.

Green CI is a machine telling you it agreed with itself. The interesting question is what happens on a machine that isn't yours.

Here's the failure I keep seeing. An agent writes a feature. It imports pandas, numpy, a YAML parser. The tests pass, because the agent's environment already had those installed from some earlier task. The diff lands. A teammate pulls it, runs pip install -r requirements.txt, runs the code, and gets ModuleNotFoundError: No module named 'numpy'. The manifest never learned about the import. Nobody lied. The information just never made it from the source file into the dependency list.

That gap is small, boring, and statically measurable. So I measured it.

What I actually ran

TL;DR.

A green checkmark proves the suite ran where the packages were already present. It does not prove the project installs from its own manifest.
repro_probe.py walks every *.py with the stdlib ast, reads the manifest as text, and never imports, installs, or runs anything in the target project.
Four deterministic rules per import: R1 stdlib, R2 local module, R3 declared-match, R4 undeclared third-party (the gap).
On a project that declared requests and imported four third-party packages: gap=3 (numpy, pandas, yaml), coverage 25.0%, exit 1.
On an honest project where every import is declared, local, or stdlib: gap=0, coverage 100.0%, exit 0. The claim is falsifiable and it passed the honest case.
STDOUT is byte-for-byte identical across two runs (sha256 matched). No key, no network. No manifest → exit 2.

The run comes before the argument, because the run is the argument.

The contrarian bit: passing is not reproducible

Most discussion of AI-generated code stops at "does it run." Did the agent produce something? Did the tests go green? Ship it. The expensive part is the part nobody checks at merge time: would this install and import on a clean machine, from nothing but its own manifest?

There's recent evidence the gap is real and not rare. In AI-Generated Code Is Not Reproducible (Yet) (arXiv 2512.22387, v3, March 2026), Vangala, Adibifar, Gehani and Malik generated 300 projects from 100 prompts across Claude Code, OpenAI Codex and Gemini, then tried to run each one. Their abstract reports that only 68.3% of projects execute out-of-the-box, so roughly a third fail on first run. By language the spread is wide: Python 89.2%, Java 44.0%. And the line that made me build this tool: they measured a 13.5x average expansion from declared to actual runtime dependencies. Declared three, needed dozens. That is exactly the shape my broken fixture imitates, just smaller.

I want to be careful with those numbers. They are their measurement, not mine: a controlled study of generated projects, cited here with the source so you can read the methodology yourself. My own numbers below are only from my fixtures.

The four rules

The whole thing is one file. Every import in the source gets sorted into exactly one bucket, deterministically:

R1 stdlib. The import is in sys.stdlib_module_names (Python 3.10+ ships this set). os, json, pathlib: no declaration needed.
R2 local. There's a name.py or name/__init__.py in the project. It's your own module, not a dependency.
R3 declared-match. The normalized import name is in the manifest. This is where the annoying cases live: you import yaml but the package is PyYAML; you import cv2 but install opencv-python. A small map handles the common ones.
R4 undeclared. Not stdlib, not local, not declared. That's the gap. It imports a third-party package the manifest never mentions, so a fresh pip install won't pull it.

The metric is just the size of the R4 set. Here's the classification loop, verbatim:

for name in sorted(imports):
    if name in std:
        rule = "R1 stdlib"
    elif name in local:
        rule = "R2 local"
    elif norm(DIST_MAP.get(name, name)) in decl:
        rule = "R3 declared"
    else:
        rule = "R4 UNDECLARED"; gap.append(name)
    verdicts.append((name, rule))

No model call. No pip. No subprocess. It reads text and walks an AST.

The broken project: declared one, imported four

The fixture is a tiny "agent-style" file. It imports requests, numpy, pandas, and yaml, plus a local utils module and two stdlib modules. The requirements.txt has exactly one line: requests. Here is the real, unedited output:

repro_probe v1 | project=broken_project
  logging          R1 stdlib
  numpy            R4 UNDECLARED
  os               R1 stdlib
  pandas           R4 UNDECLARED
  requests         R3 declared
  utils            R2 local
  yaml             R4 UNDECLARED
reproducibility_gap=3 declared_coverage=25.0% gate=0
imported_but_not_declared=numpy,pandas,yaml
verdict=BROKEN exit=1

Three undeclared third-party imports out of four. Coverage 25.0%. Exit 1, which in CI fails the build.

Look at yaml. It's flagged R4 here because the manifest doesn't list its distribution name PyYAML. That's the import-name vs distribution-name trap that makes grep import so unreliable: the strings don't match even when the dependency is "obvious." The tool normalizes through its map, so it catches the case a naive text search would miss, and (as the next section shows) clears it when PyYAML actually is declared.

The honest project: the claim has to be falsifiable

A check that flags everything is worthless. If my contrarian line ("passing is not reproducible") is real, then a genuinely clean project must come back clean. So the second fixture declares every third-party import it uses, imports a local helpers module, and leans on stdlib. Same tool, same rules:

repro_probe v1 | project=clean_project
  helpers          R2 local
  io               R1 stdlib
  json             R1 stdlib
  os               R1 stdlib
  pathlib          R1 stdlib
  requests         R3 declared
  yaml             R3 declared
reproducibility_gap=0 declared_coverage=100.0% gate=0
imported_but_not_declared=(none)
verdict=CLEAN exit=0

Gap 0, coverage 100.0%, exit 0. Note yaml is R3 declared here. Same import, opposite verdict, because this manifest lists PyYAML. The map cuts both ways: it doesn't false-flag a dependency that's declared under its real distribution name. The honest project passes. That's the part that makes the tool a check and not a rubber stamp.

The third fixture has source but no requirements.txt and no pyproject.toml. It exits 2. Bad input, refuse to guess. A tool that invented a verdict from a missing manifest would be worse than no tool.

Is it deterministic?

A pre-merge gate that flickers is noise. I ran each fixture twice and hashed STDOUT only (service messages go to stderr, kept out of the hashed stream):

clean_project   run1=5177ac0a...  run2=5177ac0a...  -> IDENTICAL
broken_project  run1=52a677f9...  run2=52a677f9...  -> IDENTICAL
bad_project     run1=e3b0c442...  run2=e3b0c442...  -> IDENTICAL

Same input, same bytes, every time. The output is sorted, so there's no set-ordering wobble. You can wire the exit code into CI and trust that a clean tree stays green.

What this is NOT

This is the part that keeps the tool honest, so read it before you quote a number.

A dependency gap is a reproducibility signal, not proof the project won't run. It says "a third-party package is imported but not declared," which is a strong reason a fresh install would fail, but not a guarantee. Maybe the package is provided by the base image. Maybe it's an optional code path. The gap flags risk; it doesn't render a verdict on the project's fate. I called the metric gap, not brokenness, on purpose.

And it has real blind spots I'm not going to hide:

It doesn't check version pins. numpy declared but pinned to a version that doesn't exist, or a range that conflicts, sails through as R3. The gap is about presence, not resolvability.
It doesn't follow transitive dependencies. It sees your direct imports, not what those packages drag in. The 13.5x expansion the paper measured lives mostly in the transitive layer, which a static import-vs-manifest check can't reach.
It doesn't understand extras or environment markers. package[extra] and ; sys_platform == ... are normalized down to the base name; the extra is not verified.
The stdlib allowlist is version-bound. It's whatever sys.stdlib_module_names reports on the Python you run it with. Run it on a different minor version and a module could shift buckets.
It only reads requirements.txt and PEP 621 [project] dependencies. Poetry's [tool.poetry.dependencies], setup.py/setup.cfg, and optional-dependencies are not parsed as the dependency set. Point it at a poetry project and every import comes back undeclared, a false BROKEN, not a real gap. Run it on requirements-based projects, or read the verdict knowing this.
Local detection is top-level only. It treats name.py or name/__init__.py in the project root as local (R2). A src/-layout package or a PEP 420 namespace package (a local dir with no __init__.py) gets flagged R4 instead. That's a false positive on a perfectly reproducible project, not a missing dependency.

These false-BROKEN cases all err the safe way for a gate (it complains too much, never too little), but they're why you read the named list, not just the exit code, before you trust a verdict.

So: a clean exit 0 is necessary, not sufficient. It rules out the dumbest, most common failure (imported but never declared) and nothing more. That one class is worth ruling out, because it's the one that turns a green PR into a teammate's ModuleNotFoundError.

How this sits next to the other checks

This isn't a runtime guard. It's an artifact check that runs before the merge button, on the diff, with nothing executing. It's a cousin of the green-checkmark auditor, which asks whether passing tests carry independent signal; here the question is whether the manifest matches the imports. Both come from the same suspicion that a green status is a claim, not a proof. It's the same suspicion behind your agent returns 200 and lies. If you already parse manifests for drift, it rhymes with pinning and verifying MCP tool manifests. And the whole instinct, put the barrier before the irreversible step rather than after, is the pre-execution gate applied to the merge instead of the runtime.

Run it on your own repo

The tool is one Python file, stdlib only. Point it at a real project:

python3 repro_probe.py path/to/your/project
echo "exit: $?"

Exit 0 means every imported third-party package is declared (or it's stdlib/local). Exit 1 hands you the list of imported-but-not-declared names. Exit 2 means there's no manifest to check against. Add --gate N if you want to tolerate a known gap of N while you fix it.

It won't catch a bad version pin or a missing transitive dep. It will catch the most common reason an AI-written diff passes CI and then won't install: the package the source imports and the manifest forgot.

If you run it on something an agent wrote recently, I'd genuinely like to know the gap you got, and whether any of it was the import-name vs distribution-name trap. That's the case I'm least sure my little map covers well. Follow along for the next batch of numbers, and drop your worst gap in the comments.

DEV Community