SEN LLC

Posted on Apr 15

Dead Code in Python Is Undecidable — So I Built a Detector That Admits It

#python #staticanalysis #cli #tutorial

Dead Code in Python Is Undecidable — So I Built a Detector That Admits It

An AST-based dead-code detector for Python with explicit confidence tiers and a first-class allowlist, designed to be introduced to a legacy codebase without drowning the team in false-positive noise.

Every Python codebase older than two years has dead code in it. Refactors leave orphans. Feature flags outlive the features they gated. A helper gets copy-pasted and the original is never deleted. Manually, you can sometimes find this stuff with git grep — but Python's dynamism fights you: attribute access is string-keyed, imports are runtime-evaluated, decorators rewrite things, frameworks invoke handlers by name. Grep lies to you in both directions.

There's an existing tool for this: vulture. It's good, it's mature, and if you already use it, great. I wanted something with three specific properties vulture doesn't quite give me:

Explicit confidence tiers — not a single opaque 0–100 score, but a small labeled set you can gate on.
A first-class allowlist format with reasons — because on a 100k-line legacy codebase, day-one CI integration is impossible if every finding is a blocker.
CI-friendly output out of the box — GitHub Actions annotations, a sane default that won't explode.

So I built deadcode-py. Zero runtime dependencies, stdlib ast and tomllib only. This post is about what I learned about Python's static-analysis limits while writing it, and how the design decisions fall out of those limits.

GitHub: https://github.com/sen-ltd/deadcode-py

The problem: why grep is insufficient

Let's make the pain concrete. Here's a function in a legacy module:

def _parse_v1_header(raw):
    parts = raw.split(":", 1)
    return parts[0].strip(), parts[1].strip()

Is it dead? git grep _parse_v1_header shows only the definition. Seems dead. Delete it?

No — because:

# elsewhere
handler = getattr(module, f"_parse_{version}_header")
result = handler(raw)

Grep won't find that call site. The function name is synthesized at runtime. You can't delete it without breaking the dynamic dispatch.

The opposite pitfall exists too. Here's another function:

def cleanup():
    ...

git grep cleanup shows thirty hits. Most of them are unrelated methods called cleanup on completely different classes. Grep has no idea that self.cleanup() on Foo and other.cleanup() on Bar are different symbols. So you get a dozen false positives and give up.

Manual dead-code hunting in Python is this dance between under- and over-approximation, and it doesn't scale. Static analysis can do better — but not all the way to "perfect", and that partial-correctness is the interesting part.

Design decision 1: AST, not regex

The first thing deadcode-py does right that grep does not: it uses ast.parse and walks the tree, so it understands the difference between a definition and a use, and it understands scope.

import ast

def analyze_source(source: str, file: str) -> FileAnalysis:
    tree = ast.parse(source, filename=file)
    analysis = FileAnalysis(file=file)
    for node in tree.body:
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
            analysis.symbols.append(Symbol(
                name=node.name, kind="function",
                lineno=node.lineno, file=file,
                decorators=_decorator_names(node.decorator_list),
                is_dunder=_is_dunder(node.name),
            ))
        elif isinstance(node, ast.ClassDef):
            analysis.symbols.append(Symbol(name=node.name, kind="class", ...))
            for child in node.body:
                if isinstance(child, (ast.FunctionDef, ast.AsyncFunctionDef)):
                    analysis.symbols.append(Symbol(
                        name=child.name, kind="method",
                        parent=node.name, ...
                    ))

Top-level FunctionDef is a function. A FunctionDef inside a ClassDef is a method — and crucially, we keep the parent class name, because C.foo and D.foo are different. We also record decorators, because @app.route("/login") marks login as used even if nothing else ever calls it.

References are the second pass:

for node in ast.walk(tree):
    if isinstance(node, ast.Name) and isinstance(node.ctx, ast.Load):
        analysis.references.add(node.id)
    elif isinstance(node, ast.Attribute):
        analysis.references.add(node.attr)
    elif isinstance(node, ast.ImportFrom):
        for alias in node.names:
            analysis.references.add(alias.asname or alias.name)

The key insight: ast.walk gives us every reference in the file, and the Load context filter distinguishes x = 1 (definition) from print(x) (use). For attributes, we record the attribute name — obj.foo contributes foo to the reference set. This is a deliberate over-approximation: we don't know which foo got hit, so we treat any foo as "used". That's how we avoid false-positive explosions on methods.

Try doing any of this with regex. You'll burn an afternoon and it'll still be wrong.

Design decision 2: confidence tiers are a concession to undecidability

Here's the uncomfortable truth: static dead-code detection in Python is undecidable in the general case. Not "hard", not "tricky" — undecidable. The reasons:

getattr(obj, name) can call any method if name comes from a runtime string. We can't know name statically.
Duck typing means obj.process() resolves to whichever process method exists on whichever class obj turned out to be. There's no type information.
Import-by-string: Django's INSTALLED_APPS, Celery task routing, pytest plugin discovery, entry points in setup.cfg — all bring names into reachability without syntactic import statements.
Metaclasses and class decorators can rewrite class bodies at runtime.

Any tool that tells you "this function is definitely dead" is lying to you in some subset of cases. The question isn't "can we be perfect?" (we can't), it's "how do we present our findings so humans can act on them safely?"

deadcode-py uses three tiers:

def _classify_symbol(sym, analysis, global_refs, exported, used_decorators):
    if sym.is_dunder: return None
    if sym.name in exported: return None
    if sym.name in global_refs: return None
    if _is_pytest_test(sym, analysis): return None
    for dec in sym.decorators:
        if dec in used_decorators:
            return None

    is_private = sym.name.startswith("_")

    if sym.kind == "function":
        return "high" if is_private else "medium"
    if sym.kind == "method":
        # Methods are fundamentally undecidable statically.
        return "low"
    if sym.kind == "class":
        return "high" if is_private else "medium"
    if sym.kind == "var":
        return "low" if is_private else None

The logic:

high: module-private function (name starts with _), never referenced, not in __all__, not a dunder, not a route handler, not a test. This one's safe enough to fail CI on. If you have a private helper that nothing in the whole scanned set references, it really is dead 99% of the time. The 1% is getattr, and that's what the allowlist is for.
medium: public function, unreferenced in the scanned set. Might be used by an external consumer — a library user, a plugin, a CLI entry point. Don't auto-break CI on this tier for libraries. For applications where you control all the code, it's actionable.
low: methods. We literally cannot tell statically whether SomeClass.foo is reachable, because obj.foo() anywhere in the codebase could hit it. We flag it only so a human can look.

The naming is deliberate. "high" confidence doesn't mean "certainly dead" — it means "high confidence that this is worth looking at, and low risk of false positive". Confidence is about your action, not about ground truth.

Design decision 3: heuristics for framework conventions

Most of the "false positives" a naive analyzer produces are symbols that Python's dynamism marks as used without a syntactic call site. The trick is knowing the common patterns:

DEFAULT_USED_DECORATORS = frozenset({
    "app.route", "bp.route",                              # Flask
    "app.get", "app.post", "router.get", "router.post",   # FastAPI
    "receiver", "register.filter", "register.simple_tag", # Django
    "pytest.fixture", "fixture",                          # pytest
    "cli.command", "click.command",                       # click
    "app.task", "celery.task",                            # celery
})

If a function is decorated with any of these, we treat it as used. This is a heuristic, not sound — someone could define their own route decorator that doesn't mark the function as reachable — but it's the right heuristic for the 90% case, and the --used-decorator flag lets projects extend it.

Similarly, pytest tests are discovered by name convention:

def _is_pytest_test(sym, analysis):
    if sym.kind != "function": return False
    if not sym.name.startswith("test_"): return False
    basename = analysis.file.rsplit("/", 1)[-1]
    return basename.startswith("test_") or basename.endswith("_test.py")

A function called test_foo inside a file called test_bar.py is never flagged, because pytest will discover and run it. test_foo inside src/utils.py is still fair game — that's a name collision, not a test.

And __all__ is respected as module export policy:

if len(node.targets) == 1 and isinstance(node.targets[0], ast.Name):
    if node.targets[0].id == "__all__":
        analysis.exported |= _literal_strings(node.value)

Names listed in __all__ are the module's explicit public surface. We trust that declaration.

Design decision 4: the allowlist is a first-class concern

Here's the failure mode I most wanted to avoid: team runs deadcode-py on their legacy codebase, gets 800 findings, half are false positives due to reflection-based plugin discovery, CI is red forever, tool gets abandoned. This is the story of every half-landed static-analysis adoption.

The fix is to make "suppress this specific finding, with a reason" a first-class operation, and to make bootstrapping easy:

# Day one: bootstrap an allowlist of everything currently reported.
deadcode-py --allowlist-emit src/ > deadcode.toml

# Commit it. CI passes. Nothing is broken.
git add deadcode.toml

The generated file looks like this:

# deadcode.toml — generated by deadcode-py --allowlist-emit
[[allow]]
name = "_build_old_payload"
file = "src/utils/legacy.py"
reason = "TODO: high-confidence, kind=function"

[[allow]]
name = "UserRequest.format_v2"
file = "src/handlers.py"
reason = "TODO: high-confidence, kind=method"

Every entry has a reason field pre-populated with a TODO marker. The deal with the team is: over time, each TODO gets either a real reason ("called by external SDK consumer") or the underlying symbol gets deleted. Either way, the file shrinks. You're converting dead code from "invisible" into "tracked tech debt".

Default --min-confidence high matters here too: on day one, deadcode-py only reports the obviously-dead private helpers. That's a manageable number to review even without the allowlist escape hatch.

The tradeoffs I can't fix

Honest disclosure time. Here's what deadcode-py will miss:

Dynamic attribute access: getattr(module, name) where name is a runtime string. If your plugin loader works this way, you'll get false positives unless you allowlist.
Magic methods defined on new classes you register with a framework: e.g. a Django Field subclass's from_db_value method. It's a method, so we'd only report at "low" anyway, but the symbol is never directly called.
Plugin entry points: declared in pyproject.toml or setup.cfg, imported by string at runtime. Invisible to AST analysis.
Reflection-heavy DI containers: anything that uses type hints to decide what to call.
Conditional definitions: if sys.version_info >= (3, 12): def modern_impl(): .... We see the definition; we might not see the call.
Methods called via super() in subclasses you don't scan: if base.py defines Base.foo and subclasses live in a different package, cross-package resolution is fragile.

And here's what it will over-report:

Library public API: any public function that constitutes your library's surface area and isn't called internally. That's why public functions are medium, not high. The correct response for library authors is to use __all__ or a rigorous __init__.py re-export pattern — both of which deadcode-py understands.
Methods used via duck typing: if things is a list of objects that all have .serialize(), and the only call site is [t.serialize() for t in things], then .serialize() on any class is live because the attribute name serialize appears in the reference set. We get this right. But if the class defining .serialize() is the only one, and you rename the attribute somewhere, we'll report correctly. This is the over-approximation earning its keep.

None of this is fixable without type inference, and full type inference in Python is… another undecidable problem. If you want that level of rigor, use pyright or mypy with --strict and let their unused-function diagnostics do some of this work. deadcode-py occupies a different niche: lightweight, dependency-free, easy to adopt, honest about its limits.

Method resolution: why I couldn't do better

The method case deserves one more paragraph because it's the most interesting undecidability.

Consider:

class Handler:
    def process(self, req):
        return req.data

class AltHandler:
    def process(self, req):
        return req.data * 2

def dispatch(handler, req):
    return handler.process(req)

Is AltHandler.process dead? Syntactically, it's defined. The only call site is handler.process(req) in dispatch — but statically, we don't know whether handler is a Handler or an AltHandler. Without type inference, both methods are reachable via that one call site. Without type inference, neither method is reachable either — we can't resolve handler.process to a concrete class.

deadcode-py chooses over-approximation: the attribute name process appears in the reference set, so both methods are "referenced". Neither is flagged. This is almost always the right call, because the alternative (flagging both as dead) would be useless noise.

The cost: a truly dead method on a class nobody instantiates will survive. You'll only notice via the "low" tier, which is suggestive hints, not a CI gate. This is the concession.

Try it in 30 seconds

docker run --rm -v "$PWD":/work ghcr.io/sen-ltd/deadcode-py:latest src/

# JSON for tooling
docker run --rm -v "$PWD":/work ghcr.io/sen-ltd/deadcode-py:latest --format json src/

# GitHub Actions annotations
docker run --rm -v "$PWD":/work ghcr.io/sen-ltd/deadcode-py:latest --format github src/

# Bootstrap a legacy allowlist
docker run --rm -v "$PWD":/work ghcr.io/sen-ltd/deadcode-py:latest --allowlist-emit src/ > deadcode.toml

Or pip install deadcode-py if you prefer. No runtime dependencies. About 1,000 lines of Python including tests.

Takeaways

Static analysis in Python has hard limits, but "hard limits" is not the same as "useless". You can catch a large fraction of real dead code with AST analysis plus a handful of framework heuristics.
Confidence tiers are an honest UX for any static analyzer whose underlying problem is undecidable. Don't hide uncertainty behind a single number — label it so the user knows which findings to trust.
First-class allowlists with reasons turn dead code from an invisible liability into tracked tech debt. This is the difference between a tool that gets adopted and a tool that gets disabled.
CI-friendly defaults matter more than thoroughness. --min-confidence high by default means day-one adoption doesn't fail.
Write your fixtures as Python strings and ast.parse them in tests. It makes the analyzer trivially testable. 62 tests, no filesystem for most of them.

The whole tool is about 1,000 lines split across scanner.py, analyzer.py, classifier.py, allowlist.py, formatters.py, and cli.py. Each module is independently testable because the boundaries follow the data flow: files → sources → AST → symbol graph → classification → output. When an analyzer is this cleanly layered, adding heuristics is cheap and regressions are obvious.

Go dig in your own repos. I bet you find at least one _helper from 2021.

DEV Community

Dead Code in Python Is Undecidable — So I Built a Detector That Admits It

Dead Code in Python Is Undecidable — So I Built a Detector That Admits It

The problem: why grep is insufficient

Design decision 1: AST, not regex

Design decision 2: confidence tiers are a concession to undecidability

Design decision 3: heuristics for framework conventions

Design decision 4: the allowlist is a first-class concern

The tradeoffs I can't fix

Method resolution: why I couldn't do better

Try it in 30 seconds

Takeaways

Top comments (0)