DEV Community: Nick-111

I Built a Machine-Verifiable Contract System for Python Code — Here's How It Works

Nick-111 — Sat, 20 Jun 2026 12:28:57 +0000

I Built a Machine-Verifiable Contract System for Python Code — Here's How It Works
Last week I wrote about building a governance layer for an autonomous AI system. One module got more questions than anything else: ContractVerifier.

People asked: "How do you automatically verify that code matches its documentation?" and "Does this actually catch bugs?"

Here's the deep dive.

The Problem
I have 1,240 Python files. When I change a function signature in one module, at least three things can break silently:

The INTERFACE.md for that module is now wrong
Every caller that uses the old signature is now broken (but Python won't tell you until runtime)
Any test that mocks the old function name now tests nothing
In a team, code review catches some of this. In a solo autonomous system with 380K lines, there is no code review. There's only the CI pipeline.

I needed the CI pipeline to understand Python code.

The Core Idea: AST → Contract → Verification
ContractVerifier has three stages:

Stage 1: Extract the truth. Parse the actual Python source with ast, walk every node, and extract every public function, class, method, and async function — with their full signatures, return types, docstrings, and raised exceptions.

Stage 2: Generate the contract. Write an INTERFACE.md that formally lists every public API entry, grouped by kind, with machine-parseable structure.

Stage 3: Verify the contract. Compare the generated INTERFACE.md against the actual code. Any mismatch — a function in code but not in docs, or in docs but not in code — is a violation.

The key word is machine-parseable. This is not documentation for humans. It's a verifiable behavior specification that a script can check.

Stage 1: Walking the AST
Here's what the AST visitor extracts from a single Python file:

From hermes/strategies/grid_strategy.py

class GridStrategy:
def place_orders(self, mid_price: float, levels: int = 5) -> list[Order]:
"""Place grid orders around a mid price."""
...

async def run_grid_loop(symbol: str, interval: int = 60) -> None:
"""Main grid trading loop."""
...
The visitor produces:

GridStrategy kind=class signature=class GridStrategy
GridStrategy.place_orders kind=method signature=place_orders(mid_price: float, levels: int = 5) -> list[Order]
run_grid_loop kind=async_function signature=run_grid_loop(symbol: str, interval: int = 60) -> None
It handles relative imports (resolving from . import X to full dotted paths), nested classes, async functions, and exception tracking. It knows when it's inside an init.py and adjusts the module name accordingly.

The visitor is ~140 lines of AST walking code. No regex, no string matching — it uses the same parser that Python itself uses.

Stage 2: Generating the Contract
The contract generator takes the extracted API entries and produces an INTERFACE.md:

INTERFACE.md — hermes.strategies.grid_strategy

Auto-generated by ARES ContractGenerator at 2026-06-20 10:15:00

Module Info

Path: hermes/strategies/grid_strategy.py
Lines: 312
Public APIs: 8

Public API

Classs

`class GridStrategy`

Grid trading strategy with configurable level count and spacing.

Line: 24

Methods

`GridStrategy.place_orders(mid_price: float, levels: int = 5) -> list[Order]`

Place grid orders around a mid price.

Returns: list[Order]

Line: 45

Async Functions

`run_grid_loop(symbol: str, interval: int = 60) -> None`

Main grid trading loop.

Line: 287

Dependencies

hermes.exchanges.adapter
shared.models

Imported By

hermes.live_runner
scripts.run_grid Two output formats: per-module (one INTERFACE.md per Python file) and consolidated (one INTERFACE.md summarizing an entire package). The consolidated format uses N modules | M public APIs | K lines in the header for the parser to detect the format automatically.

Stage 3: Verification
This is where it gets useful. ContractVerifier.verify() compares the INTERFACE.md against the actual code:

verifier = ContractVerifier()
report = verifier.verify(
"hermes/strategies/grid_strategy.py",
"hermes/strategies/INTERFACE.md"
)
It returns:

{
"consistent": false,
"module": "hermes/strategies/grid_strategy.py",
"code_api_count": 8,
"doc_api_count": 7,
"missing_in_doc": ["GridStrategy.cancel_all_orders"],
"missing_in_code": ["GridStrategy.old_legacy_method"]
}
Two violations:

missing_in_doc: cancel_all_orders exists in the code but nobody documented it. This blocks the build — you either document it or you explain why it shouldn't be public.

missing_in_code: old_legacy_method is in the INTERFACE.md but the code removed it. The contract is stale. This also blocks the build — regenerate the contract.

batch_verify() does this for every INTERFACE.md in a directory tree, handling both single-module and consolidated formats automatically.

The CI Gate That Makes It Real
Generating contracts is easy. Enforcing them is what matters.

CiEnforcer has three gates. Gate 2 is the contract gate:

def _gate_contract(self, modified, deleted):
for mod_path in modified:
py_mtime = source_file.stat().st_mtime
md_mtime = interface_md.stat().st_mtime
if py_mtime > md_mtime:
# Source was edited after the contract was generated.
# Contract is stale. BLOCK.
violations.append(...)
The check is timestamp-based, not content-based. This is deliberate: comparing timestamps is O(1) and happens before any commit lands. Full AST verification (Stage 3) runs in the nightly audit. The timestamp gate catches 95% of contract drift at commit time with zero overhead.

The complete enforcement flow:

git commit
→ pre-commit hook
→ CiEnforcer.enforce(modified=["hermes/strategies/grid_strategy.py"])
→ Gate: dead code → PASS (not a new module)
→ Gate: contract → BLOCK (source modified, INTERFACE.md stale)
→ Gate: integration → PASS

Commit rejected. Run:
python scripts/_ares_gen_contracts.py
git add hermes/strategies/INTERFACE.md
git commit
The Bonus: Automatic Test Generation
Once you have formal contracts, you can generate tests from them. ContractTestGenerator produces two types:

Provider stub tests — verify the module's public API is importable and has the expected attributes:

def test_GridStrategy_place_orders_stub():
from hermes.strategies.grid_strategy import GridStrategy
assert hasattr(GridStrategy, 'place_orders')
Consumer mock tests — scaffold for verifying that callers use the API correctly:

def test_GridStrategy_place_orders_consumer_mock():
from unittest.mock import patch
with patch("hermes.strategies.grid_strategy.GridStrategy.place_orders") as mock_fn:
# Replace with actual consumer import
pass
These are scaffolds, not complete tests. They ensure every contract entry has a corresponding test file. Fill in the actual assertions as you use each API. If you delete a function, the stub test fails because hasattr returns False. If you rename a method, the mock target breaks.

What This Doesn't Solve
ContractVerifier tells you whether the code matches the documented interface. It does NOT tell you:

Whether the interface is well-designed. A function that takes 15 parameters can be perfectly documented and still terrible.

Whether the behavior is correct. def withdraw(amount: float) -> bool can return False for every call and still be contract-compliant. The contract verifies the signature, not the semantics.

Whether the contract is complete. A module can have 100% contract compliance and still have undocumented side effects (file I/O, network calls, database writes).

These are fundamental limitations of any static analysis approach. The contract verifies form, not function.

The next layer — runtime monitoring via ActivityMonitor — catches some of what static analysis misses. If a function's contract says it returns bool but it actually raises an unhandled exception 5% of the time, the runtime monitor catches that. The contract verifier doesn't.

Connection to AI Safety
The AI safety community has a concept called behavioral specification: how do you formally specify what an AI agent should do, and then verify it does that?

ContractVerifier is a micro-scale implementation of this idea applied to Python modules. The "agent" is a Python file. The "behavioral specification" is INTERFACE.md. The "verification" is the AST comparison.

This is obviously not a solution to the full AI alignment problem. But it's a concrete, working example of the pattern: specify → verify → enforce. If you're designing governance infrastructure for multi-agent systems where each agent is a Python module (or an LLM with tool calls), this pattern scales.

The key insight I'd offer to anyone building agent governance systems: don't make the specification and the verification the same person. The closed evaluation loop I wrote about last week applies here too. If the same code generates the contract and verifies it, you haven't proven anything — you've proven the generator and verifier agree with each other.

The only real verification is external use. In ContractVerifier's case, that means: someone else running it on their codebase and finding violations I never anticipated.

What's Next
I'm extracting ContractVerifier as the next open-source module from ARES. Same approach as dead-scanner: zero dependencies, pure Python, works on any project.

If you're interested in the intersection of static analysis, CI enforcement, and AI governance — I'd like to hear from you. Especially if you're working on similar problems at the agent level rather than the module level.

GitHub: github.com/Nick-lll/dead-scanner X: x.com/senlin

This is part 2 of a series on building governance infrastructure for autonomous AI systems. Part 1: Building a Governance Layer Without Knowing AI Safety Existed

Building a Governance Layer for an Autonomous AI System Without Knowing AI Safety Existed

Nick-111 — Sat, 20 Jun 2026 09:38:30 +0000

Building a Governance Layer for an Autonomous AI System Without Knowing AI Safety Existed
I spent one month building an autonomous multi-agent trading system alone. Six engines, sixty trading strategies, twenty-five ML models, 8,600 tests. After 380,000 lines of Python, I discovered something that changed how I think about AI systems entirely.

The strategies don't make money. But the governance layer I built to manage them might be more valuable than any trading strategy.

What I Built
ZEUS is an autonomous trading system with six engines: HERMES (execution), ATHENA (ML/quant), AEGIS (security), APOLLO (evolution), HADES (infrastructure), and ARES (governance).

ARES is the one that matters for this post. It has 12 modules designed to answer a single question: how do you know if any part of your autonomous system is broken?

The answer turned out to be surprisingly close to what the AI Safety community calls "agent governance infrastructure." I just didn't know that term when I built it.

The Problem: Solo Development at Scale
When you write 380,000 lines of code alone in one month, you hit a wall that has nothing to do with programming skill. You open a file you wrote three weeks ago and have no idea if it's still used. You find modules that import other modules that import modules that go nowhere. You discover that 40% of your "production system" is dead code that nothing touches.

This isn't a code quality problem. It's a system observability problem. In a traditional team, institutional knowledge tells you what's alive and what's dead. When you're one person and Claude, that institutional knowledge lives in your head — and your head is unreliable.

So I built tools to externalize it.

The ARES Governance Stack
Here's what 12 modules of AI system governance looks like in practice:

Module Vitality: SilenceScanner The most fundamental question: is this module alive?

SilenceScanner parses the entire Python AST, builds a directed import graph, and classifies every module into five categories:

truly_dead: Exported symbols exist but nothing references them. These are candidates for deletion.
fake_alive: Imported by something that is itself dead. A zombie module — looks alive, does nothing.
standalone: CLI scripts, database migrations, build tools. No imports in, but that's by design.
island: Imported by something, but missing its INTERFACE.md contract. It works, but nobody can verify it does what it claims.
structural: init.py files and API surfaces. The skeleton that holds everything together.
The key insight: not all unreferenced code is dead. A database migration script has zero imports because it's called by Alembic at deploy time, not by other Python modules. If you flag it as "dead," you'll break production. SilenceScanner knows the difference because it has a taxonomy of standalone patterns (CLI, migrations, build scripts, test infra, server entrypoints).

Contract Compliance: ContractVerifier Once you know which modules are alive, the next question: do they do what they claim?

ContractVerifier reads Python AST, generates an INTERFACE.md describing every public function, class, and method with their signatures, and then verifies that the actual code matches the documented contract.

This is not documentation for humans. It's a machine-verifiable behavior contract. If you change a function signature without updating the contract, CI blocks the merge. If you add a new public method without documenting it, CI blocks the merge.

In AI Safety terms: this is a basic form of behavioral specification enforcement. The system cannot drift from its documented behavior without explicit approval.

Dependency Graph Analysis: IntegrationGraph Builds a directed graph of all module dependencies. Detects:

Dead ends: modules that import nothing and are imported by nothing. True orphans.
Cycles: A imports B imports C imports A. These are maintainability time bombs.
Hubs: modules imported by 50+ others. Changes here have massive blast radius.
Bridges: modules that are the sole connection between two subsystems. Single points of failure.
This matters because in an autonomous system, dependency structure is attack surface. A cycle means a bug in one module can propagate back to itself through the loop. A hub means a single module's failure cascades to half the system.

Gate Enforcement: CiEnforcer Three automated gates run before any commit lands:

Dead code gate: any new truly_dead module blocks the build. (Standalone modules are exempt.)
Contract gate: any public function without a contract entry blocks the build.
Integration gate: any new cycle or orphan blocks the build.
The gates are enforced by the CI pipeline — not by human code review. This is crucial because humans get tired, miss things, or make exceptions. Automated gates don't.

Technical Debt Tracking: DebtTracker Zero tolerance. Any module flagged as dead, any contract violation, any integration gap — it gets a debt item with a timestamp and severity. The CI gate treats all debt as blocking.

This sounds extreme. In a team setting, you'd negotiate debt. In a solo autonomous system, there's nobody to negotiate with. Either the debt is real (fix it now) or the rule is wrong (change the rule). No exceptions.

Runtime Monitoring: ActivityMonitor Static analysis tells you what the code says. Runtime monitoring tells you what it does.

ActivityMonitor hooks into the production event bus and counts every inter-module call. A module that appears alive in the import graph but has zero runtime calls in 24 hours is fake-alive — it exists, it's wired up, but nothing actually uses it.

7-12. Lifecycle, Self-Destruct, Negotiation, Twin, Patterns, Health
The remaining modules handle what happens after you detect a problem:

LifecycleManager: four-stage module lifecycle (active → deprecated → removed → archived). No module gets deleted without going through this pipeline.
SelfDestructManager: orchestrates the deactivation of silent modules. No Python file just gets rm -rf'd — that breaks imports. Deactivation means: log the dependency tree, verify no live callers, move to archive, regenerate contracts, verify the build still passes.
NegotiationRegistry: before a module calls another module's function, it checks the version contract. If the callee's interface has changed, the caller gets a warning before production breaks.
DigitalTwin: a sandboxed copy of the dependency graph where you can simulate removing a module and see what breaks. Like git branch for architecture changes.
PatternMiner: five anti-pattern detection engines — circular dependency, god module (too many imports), shotgun surgery (too many callers), feature envy (importing across unrelated subsystems), and unstable dependency (depending on a module that changes frequently).
HealthScorer: 0-100 score per module based on four weighted factors: references (30%), contract compliance (20%), test coverage (25%), runtime call count (25%).
What I Got Wrong
Here's the honest part.

For months, ARES reported: zero dead modules, 100% contract compliance, health score 100, all six engines rated 3/3.

This was a lie. Not an intentional lie — a structural lie.

When the same person designs the scoring criteria, writes the scoring tools, runs the scoring process, and interprets the results, the output is guaranteed to look good. I had built a closed evaluation loop — a system that validates itself against standards it defined for itself.

This is not unique to my project. Every AI system that evaluates its own safety without external reference points has this problem. If you define "safe" as "passes my safety tests," and you wrote the safety tests, you haven't proven safety — you've proven that your tests match your assumptions.

The only way to break the loop is to expose the system to an evaluation framework you don't control.

For a trading system, that's the market. Sharpe ratios don't care about your self-assessment. For an AI safety tool, that's external users. GitHub issues, bug reports, people using your tool in ways you didn't anticipate.

What I'm Doing About It
I extracted the first module — SilenceScanner — as a standalone open-source tool:

pip install dead-scanner
dead-scanner /path/to/your/project
GitHub: github.com/Nick-lll/dead-scanner

It's MIT licensed, zero dependencies, pure Python. It works on any Python project. It's not the most sophisticated module in ARES — ContractVerifier and PatternMiner are deeper — but it's the one that's easiest to verify independently. Download it, run it, and in 30 seconds you know whether it's useful.

I expect to be wrong about a lot of things. Maybe dead module detection isn't as valuable as I think. Maybe the classification taxonomy needs adjustment. Maybe the whole approach of static analysis + graph theory is the wrong frame.

That's exactly why I'm doing this. I need to be wrong in public, with data, so I can become less wrong. The alternative — being wrong in private while the system tells me everything is 100% — is far worse.

Connection to AI Safety
I didn't build ARES because I read AI safety papers. I built it because managing a 380,000-line autonomous system alone forced me to solve problems that the AI safety community has been thinking about for years:

Behavioral specification: how do you verify an agent does what it claims?
Containment: how do you safely decommission an agent without breaking the system?
Observability: how do you know if any part of the system has silently stopped working?
Dependency integrity: how do you prevent a failure in one agent from cascading to others?
These are not theoretical problems when you have 60 strategies running autonomous trading loops. A "hallucination" in this context isn't a weird chatbot response — it's a trade at the wrong size, or a position held through a stop-loss because the risk module was silently deactivated.

ARES is not a solution to AI safety. It's a set of concrete engineering patterns that address a subset of AI governance problems at the module/agent level. I'm sharing them because I suspect other people building autonomous agent systems are hitting similar walls, and the patterns might be useful even if the implementation is imperfect.

What I'm Looking For
I'm not selling anything. I'm not raising money. I'm looking for:

Technical feedback: what am I wrong about? What would you do differently?
Related work: are there papers, projects, or people I should be reading?
Collaboration: if you're working on AI agent governance, observability, or safety infrastructure, I want to talk.
If you're at Anthropic, DeepMind, or any team working on making AI systems more governable — I'd especially like to hear from you. I built this in isolation. I know there's a lot I'm missing. I want to learn from people who've been thinking about these problems longer than I have.

I built a trading system that doesn't make money. But the governance layer I built to manage it — contract compliance, lifecycle management, dead code detection, dependency graph analysis, pattern mining, digital twin simulation — might actually be useful to people building autonomous AI systems. I open-sourced the first module. If this resonates, reach out: github.com/Nick-lll