I Built a Machine-Verifiable Contract System for Python Code — Here's How It Works
Last week I wrote about building a governance layer for an autonomous AI system. One module got more questions than anything else: ContractVerifier.
People asked: "How do you automatically verify that code matches its documentation?" and "Does this actually catch bugs?"
Here's the deep dive.
The Problem
I have 1,240 Python files. When I change a function signature in one module, at least three things can break silently:
The INTERFACE.md for that module is now wrong
Every caller that uses the old signature is now broken (but Python won't tell you until runtime)
Any test that mocks the old function name now tests nothing
In a team, code review catches some of this. In a solo autonomous system with 380K lines, there is no code review. There's only the CI pipeline.
I needed the CI pipeline to understand Python code.
The Core Idea: AST → Contract → Verification
ContractVerifier has three stages:
Stage 1: Extract the truth. Parse the actual Python source with ast, walk every node, and extract every public function, class, method, and async function — with their full signatures, return types, docstrings, and raised exceptions.
Stage 2: Generate the contract. Write an INTERFACE.md that formally lists every public API entry, grouped by kind, with machine-parseable structure.
Stage 3: Verify the contract. Compare the generated INTERFACE.md against the actual code. Any mismatch — a function in code but not in docs, or in docs but not in code — is a violation.
The key word is machine-parseable. This is not documentation for humans. It's a verifiable behavior specification that a script can check.
Stage 1: Walking the AST
Here's what the AST visitor extracts from a single Python file:
From hermes/strategies/grid_strategy.py
class GridStrategy:
def place_orders(self, mid_price: float, levels: int = 5) -> list[Order]:
"""Place grid orders around a mid price."""
...
async def run_grid_loop(symbol: str, interval: int = 60) -> None:
"""Main grid trading loop."""
...
The visitor produces:
GridStrategy kind=class signature=class GridStrategy
GridStrategy.place_orders kind=method signature=place_orders(mid_price: float, levels: int = 5) -> list[Order]
run_grid_loop kind=async_function signature=run_grid_loop(symbol: str, interval: int = 60) -> None
It handles relative imports (resolving from . import X to full dotted paths), nested classes, async functions, and exception tracking. It knows when it's inside an init.py and adjusts the module name accordingly.
The visitor is ~140 lines of AST walking code. No regex, no string matching — it uses the same parser that Python itself uses.
Stage 2: Generating the Contract
The contract generator takes the extracted API entries and produces an INTERFACE.md:
INTERFACE.md — hermes.strategies.grid_strategy
Auto-generated by ARES ContractGenerator at 2026-06-20 10:15:00
Module Info
-
Path:
hermes/strategies/grid_strategy.py - Lines: 312
- Public APIs: 8
Public API
Classs
class GridStrategy
Grid trading strategy with configurable level count and spacing.
- Line: 24
Methods
GridStrategy.place_orders(mid_price: float, levels: int = 5) -> list[Order]
Place grid orders around a mid price.
- Returns:
list[Order]- Line: 45
Async Functions
run_grid_loop(symbol: str, interval: int = 60) -> None
Main grid trading loop.
- Line: 287
Dependencies
hermes.exchanges.adaptershared.models
Imported By
hermes.live_runner-
scripts.run_gridTwo output formats: per-module (one INTERFACE.md per Python file) and consolidated (one INTERFACE.md summarizing an entire package). The consolidated format uses N modules | M public APIs | K lines in the header for the parser to detect the format automatically.
Stage 3: Verification
This is where it gets useful. ContractVerifier.verify() compares the INTERFACE.md against the actual code:
verifier = ContractVerifier()
report = verifier.verify(
"hermes/strategies/grid_strategy.py",
"hermes/strategies/INTERFACE.md"
)
It returns:
{
"consistent": false,
"module": "hermes/strategies/grid_strategy.py",
"code_api_count": 8,
"doc_api_count": 7,
"missing_in_doc": ["GridStrategy.cancel_all_orders"],
"missing_in_code": ["GridStrategy.old_legacy_method"]
}
Two violations:
missing_in_doc: cancel_all_orders exists in the code but nobody documented it. This blocks the build — you either document it or you explain why it shouldn't be public.
missing_in_code: old_legacy_method is in the INTERFACE.md but the code removed it. The contract is stale. This also blocks the build — regenerate the contract.
batch_verify() does this for every INTERFACE.md in a directory tree, handling both single-module and consolidated formats automatically.
The CI Gate That Makes It Real
Generating contracts is easy. Enforcing them is what matters.
CiEnforcer has three gates. Gate 2 is the contract gate:
def _gate_contract(self, modified, deleted):
for mod_path in modified:
py_mtime = source_file.stat().st_mtime
md_mtime = interface_md.stat().st_mtime
if py_mtime > md_mtime:
# Source was edited after the contract was generated.
# Contract is stale. BLOCK.
violations.append(...)
The check is timestamp-based, not content-based. This is deliberate: comparing timestamps is O(1) and happens before any commit lands. Full AST verification (Stage 3) runs in the nightly audit. The timestamp gate catches 95% of contract drift at commit time with zero overhead.
The complete enforcement flow:
git commit
→ pre-commit hook
→ CiEnforcer.enforce(modified=["hermes/strategies/grid_strategy.py"])
→ Gate: dead code → PASS (not a new module)
→ Gate: contract → BLOCK (source modified, INTERFACE.md stale)
→ Gate: integration → PASS
Commit rejected. Run:
python scripts/_ares_gen_contracts.py
git add hermes/strategies/INTERFACE.md
git commit
The Bonus: Automatic Test Generation
Once you have formal contracts, you can generate tests from them. ContractTestGenerator produces two types:
Provider stub tests — verify the module's public API is importable and has the expected attributes:
def test_GridStrategy_place_orders_stub():
from hermes.strategies.grid_strategy import GridStrategy
assert hasattr(GridStrategy, 'place_orders')
Consumer mock tests — scaffold for verifying that callers use the API correctly:
def test_GridStrategy_place_orders_consumer_mock():
from unittest.mock import patch
with patch("hermes.strategies.grid_strategy.GridStrategy.place_orders") as mock_fn:
# Replace with actual consumer import
pass
These are scaffolds, not complete tests. They ensure every contract entry has a corresponding test file. Fill in the actual assertions as you use each API. If you delete a function, the stub test fails because hasattr returns False. If you rename a method, the mock target breaks.
What This Doesn't Solve
ContractVerifier tells you whether the code matches the documented interface. It does NOT tell you:
Whether the interface is well-designed. A function that takes 15 parameters can be perfectly documented and still terrible.
Whether the behavior is correct. def withdraw(amount: float) -> bool can return False for every call and still be contract-compliant. The contract verifies the signature, not the semantics.
Whether the contract is complete. A module can have 100% contract compliance and still have undocumented side effects (file I/O, network calls, database writes).
These are fundamental limitations of any static analysis approach. The contract verifies form, not function.
The next layer — runtime monitoring via ActivityMonitor — catches some of what static analysis misses. If a function's contract says it returns bool but it actually raises an unhandled exception 5% of the time, the runtime monitor catches that. The contract verifier doesn't.
Connection to AI Safety
The AI safety community has a concept called behavioral specification: how do you formally specify what an AI agent should do, and then verify it does that?
ContractVerifier is a micro-scale implementation of this idea applied to Python modules. The "agent" is a Python file. The "behavioral specification" is INTERFACE.md. The "verification" is the AST comparison.
This is obviously not a solution to the full AI alignment problem. But it's a concrete, working example of the pattern: specify → verify → enforce. If you're designing governance infrastructure for multi-agent systems where each agent is a Python module (or an LLM with tool calls), this pattern scales.
The key insight I'd offer to anyone building agent governance systems: don't make the specification and the verification the same person. The closed evaluation loop I wrote about last week applies here too. If the same code generates the contract and verifies it, you haven't proven anything — you've proven the generator and verifier agree with each other.
The only real verification is external use. In ContractVerifier's case, that means: someone else running it on their codebase and finding violations I never anticipated.
What's Next
I'm extracting ContractVerifier as the next open-source module from ARES. Same approach as dead-scanner: zero dependencies, pure Python, works on any project.
If you're interested in the intersection of static analysis, CI enforcement, and AI governance — I'd like to hear from you. Especially if you're working on similar problems at the agent level rather than the module level.
GitHub: github.com/Nick-lll/dead-scanner X: x.com/senlin
This is part 2 of a series on building governance infrastructure for autonomous AI systems. Part 1: Building a Governance Layer Without Knowing AI Safety Existed
Top comments (0)