Don Johnson

Posted on May 19

AI Tools Need Contracts, Not Prompts

#rust #cli #ai #devtools

The executable as the interface agents can discover, verify, and trust

An AI agent can read your README, scan your tests, inspect your source tree, and
infer a plausible architecture. Then it can still break your tool by renaming a
flag, weakening an exit condition, or "simplifying" JSON output another script
depends on.

The failure is not always that the model misunderstood the code. Often the
failure is simpler: the contract was not executable.

For AI-assisted engineering, a CLI is no longer just a human interface. It is
the narrowest place where intent, behavior, and verification can meet. It is the
surface an agent can run, observe, validate, and compose.

That is the premise behind entropyx.

entropyx is a local-first Rust CLI for codebase forensics. It scans a git
repository and emits a typed summary of temporal, structural, authorship, and
semantic signals. It can tell you which files absorb change, which ones carry
coupling stress, which public APIs drifted without tests, and which events in
the repository explain the pattern.

But the important idea is not the specific metric set. The important idea is the
shape of the interface.

The executable is the contract.

Not a prompt. Not a dashboard. Not a model-specific integration layer. The
binary exposes what it can do, what it accepts, what it emits, and how to ask
for evidence. Humans can run it. CI can run it. An AI assistant can run it. All
three see the same contract surface.

This is the design pattern: make the tool legible to agents by making the
interface more explicit, not more magical.

The Failure Mode

Many developer tools have implicit contracts.

The README says one thing. --help says another. The tests exercise internal
functions, but not the CLI behavior downstream tools depend on. The JSON output
looks stable until a field is renamed. A command returns text that is easy for a
person to read but brittle for a machine to parse. A failure exits with 0
because a human would notice the error line on stderr.

Humans compensate for this with context. They remember the old behavior. They
know which fields are load-bearing. They know that "pretty output" is not the
automation path. They know which flags are part of the public contract even if
the code does not say so.

Agents do not get that memory for free.

They can infer. They can search. They can run tests. But if the contract lives
in prose, convention, and tribal knowledge, an agent will eventually step on it.
It may even make a locally reasonable change that passes the test suite while
breaking the actual interface.

That is the operational problem AI-first tooling has to solve.

The answer is not more prompt text explaining how to behave. The answer is a
contract surface the agent can execute.

The Reframe: CLI As Protocol

The command line is often treated as a wrapper around the real product. For
AI-first tools, that is backwards.

The CLI is the protocol.

It has named commands, explicit inputs, observable outputs, exit codes, and a
natural place for versioning. It runs locally. It composes with files, pipes, CI
jobs, scripts, and terminals. It is already the interface most coding agents can
operate.

That makes it a good contract boundary.

The point is not that every product should be only a CLI. The point is that if a
tool claims to support AI-assisted engineering, it should expose a surface that
agents can discover and verify without guessing.

For entropyx, that surface is intentionally small:

entropyx describe
entropyx schema
entropyx scan /path/to/repo
entropyx explain /path/to/repo file:<blob-prefix>
entropyx calibrate --summary summary.json --labels labels.json

The core agent loop is the first four commands. calibrate is the offline
weight-fitting path: take a prior summary, join it with labeled file scores, fit
new weights, and feed those weights back into a later scan.

Five commands are enough because the commands are not just verbs. They are the
contract anatomy.

The Contract Anatomy

An AI-first executable needs more than --help.

--help is optimized for a person who already knows what kind of thing they are
running. An agent needs a machine-readable answer to a different set of
questions:

What is this executable?
What can it do?
What inputs does it accept?
What output shapes does it produce?
What invariants does it promise?
How expensive is this operation likely to be?
How do I drill from a summary into evidence?

In entropyx, describe answers the first set of questions:

entropyx describe

It emits the protocol root as JSON: name, version, contract version, purpose,
capabilities, input types, output formats, cost model, and invariants.

schema answers the shape question:

entropyx schema > tq1-schema.json

It emits the JSON Schema for the tq1 summary envelope. The schema $id is tied
to the protocol contract version, so a breaking output change is not invisible.

scan answers the evidence question:

entropyx scan /path/to/repo > summary.json

It walks local git history and emits a dense repository summary.

explain answers the drill-down question:

entropyx explain /path/to/repo file:<blob-prefix>
entropyx explain /path/to/repo commit:<sha>
entropyx explain /path/to/repo range:<base>..<head>

The summary currently mints file:<blob-prefix> handles for HEAD blobs.
explain also accepts commit:<sha> and range:<base>..<head> address forms,
which lets an agent ask for commit or release-window evidence without relying on
a prior summary entry.

That distinction matters. A contract should say exactly which identifiers it
emits and which identifiers it accepts.

What The Contract Looks Like

The most important output path is not prose. It is tq1, a typed JSON envelope.

A simplified summary looks like this:

{
  "schema": {"name": "tq1", "version": "0.1.0"},
  "dict": {
    "files": ["src/lib.rs"],
    "authors": ["a@example.com"],
    "metrics": [
      "change_density",
      "author_entropy",
      "temporal_volatility",
      "coupling_stress",
      "blame_youth",
      "semantic_drift",
      "test_cooevolution",
      "composite"
    ]
  },
  "files": [
    {
      "file": 0,
      "values": [0.4, 0.2, 0.1, 0.7, 0.3, 0.6, 0.5, 0.43],
      "lineage_confidence": 1.0,
      "signal_class": "api_drift"
    }
  ],
  "events": [],
  "handles": {
    "file:abc123def456": {
      "kind": "file",
      "file": 0,
      "blob_prefix": "abc123def456"
    }
  },
  "enrichments": {"pull_requests": {}}
}

This is not pretty terminal output that an agent has to reverse-engineer. It is
the protocol.

The dictionary pins the string tables. The metric column order is explicit. The
file rows are dense. Events are typed. Handles are keyed by their canonical
string form. Optional enrichments live in a sidecar.

That design gives an agent a stable path:

Read the schema and contract version.
Decode the dictionaries.
Rank or filter file rows.
Inspect typed events.
Ask explain for evidence by address.

The final answer can be prose. The instrument should emit structure.

Entropyx As The Case Study

entropyx measures seven axes for each file:

D_n: change density
H_a: author dispersion
V_t: temporal volatility
C_s: coupling stress
B_y: blame youth
S_n: semantic drift
T_c: test co-evolution

The composite score is useful, but it is not the contract by itself. Every score
decomposes into the axes that produced it.

That is the difference between a measurement and an opinion. If a file scores
high because coupling stress is high, the assistant should say that. If the
score is driven by semantic drift without test co-evolution, that is a different
claim. If author dispersion is rising on a hot file, that is an ownership
problem, not a generic risk label.

The rule-based signal taxonomy is also part of the protocol:

incident_aftershock
coupled_amplifier
refactor_convergence
api_drift
ownership_fragmentation
frozen_neglect

And the event stream has five variants:

rename
hotspot
incident_aftershock
ownership_split
api_drift

These are not generated explanations. They are typed outputs. An assistant can
use them, cite them, filter them, and ask for the evidence behind them.

That is the central division of labor:

The executable measures.
The protocol preserves structure.
The assistant chooses what to inspect.
The assistant explains the evidence to the user.

Determinism Is Agent UX

Determinism is not just a testing preference. It is user experience for agents.

If the same command on the same repository produces different results, the
assistant has to reason about the measurement system instead of the codebase.
Did the repository change? Did the tool change? Did a clock, random seed,
network call, parallel reduction, or model update move the result?

entropyx keeps the core scan deterministic for the same repo state, entropyx
version, flags, and local inputs.

That means:

no wall-clock reads in the core measurement layer
deterministic floating-point reductions
stable interning across serialization round trips
local git history as the source of truth
no ML scoring in the deterministic physics layer
versioned protocol contracts

Optional GitHub enrichment is deliberately separate. It can attach pull request
metadata to event commit SHAs, but that remote sidecar is not the foundation of
the local measurement.

This separation is important. A deterministic core can be cached, diffed,
tested, signed, and reproduced. Remote enrichment can add context without
turning the core finding into a network-dependent claim.

For a human, this makes the output auditable. For an agent, it makes the output
safe to build on.

Evidence Before Interpretation

An AI assistant is good at interpretation. That does not mean the tool should
ask the assistant to invent the evidence.

entropyx starts with the repository. Commits, diffs, authorship, renames, blame
snapshots, public API deltas, test co-change, and co-change graphs become the
measurement layer. The assistant does not need to infer the whole history from
raw git log output before it can answer a question.

The result is a smaller and more reliable loop.

Instead of:

Read the README.
Guess which git commands to run.
Inspect a pile of diffs.
Infer which files matter.
Hope the answer is grounded.

The assistant can:

Run entropyx describe.
Run entropyx scan.
Inspect the typed summary.
Ask entropyx explain for the few addresses that matter.
Write an answer tied to concrete evidence.

This is not about replacing engineering judgment. It is about giving judgment a
better input.

The same pattern applies beyond codebase forensics. Test selection, dependency
analysis, migration planning, security review, release risk, documentation
drift, and compliance evidence all benefit from the same contract shape:
summarize the domain, expose typed findings, and let the agent fetch proof.

Handles Make The Protocol Navigable

Handle-addressable evidence is the most important part of the pattern.

A summary should be compact enough to read as a map. It should not dump the
entire warehouse into the agent's context. But a summary without drill-down is
just another report.

Handles bridge that gap.

A handle is a stable, user-facing pointer from a finding to evidence that can be
retrieved on demand. In entropyx, file handles are content-addressed by blob
prefix. Commit and range address forms let explain resolve git objects and
release windows directly.

That gives the assistant a clean workflow:

read the map
choose the region
fetch the evidence
cite the address
repeat only where needed

This matters because context is expensive, even when the context window is
large. A tool that emits everything at once forces the assistant to pay for
everything before it knows what matters. A handle-driven protocol lets the
assistant spend attention where the evidence points.

It also helps humans. A reviewer can take the same handle, run the same command,
and inspect the same evidence. The handle becomes a shared reference, not an
opaque explanation generated in a chat transcript.

Honest Absence

AI systems are vulnerable to confident falsehoods. Tools should not make that
worse.

entropyx is designed to return explicit absence when it cannot measure
something. If a language backend is unknown, semantic drift contributes zero for
that file. That zero means "unmeasured by this axis," not "stable." If optional
GitHub enrichment is missing, the pull request sidecar is empty. If a handle
cannot be resolved, the command fails cleanly.

This is less flashy than a tool that always has an answer. It is more useful.

An AI-first instrument should not try to sound intelligent. It should be precise
about what it knows, where the measurement came from, and where the evidence
stops.

The assistant can then say, "the scan did not measure semantic drift for this
file type," rather than treating silence as safety.

The Tradeoff

Executable contracts are less flexible than informal interfaces.

That is a feature.

If a JSON field is part of the protocol, changing it should feel like changing
an API. If an exit code communicates failure, weakening it should break a test.
If a schema version is pinned to a contract version, a breaking change should be
visible to consumers. If a command emits handles, the accepted handle forms
should be documented and tested.

This creates friction. It should.

AI agents need stable edges. CI needs stable edges. Human operators need stable
edges during incidents. A tool that changes shape casually forces every consumer
to rediscover the boundary.

The lesson is not that a CLI can never evolve. It is that the evolution should
be explicit. Version the contract. Preserve compatibility where possible. Break
it deliberately when necessary.

The Blueprint

The entropyx pattern generalizes to other AI-first developer tools.

Start with a small executable surface:

tool describe
tool schema
tool scan <target>
tool explain <target> <address>

Add domain-specific commands only when they have a clear role, as calibrate
does for entropyx.

Then make the contract explicit:

describe exposes capabilities, inputs, outputs, costs, and invariants.
schema exposes machine-readable output shapes.
scan produces a dense typed map of the target domain.
explain resolves stable addresses into evidence.
exit codes are commitments, not decoration.
structured output is the automation path.
prose is for humans and final answers.
optional network enrichment is sidecar data, not the measurement core.
absence is explicit.
breaking changes bump the contract.

This is the difference between a prompt-shaped tool and a contract-shaped tool.

A prompt-shaped tool relies on instructions around the tool. A contract-shaped
tool exposes behavior the agent can run and verify.

What Changes Next

The next generation of developer tools will not be judged only by how well
humans can read their documentation. They will also be judged by how reliably
agents can discover, run, validate, and compose their behavior.

That changes what a CLI is.

It is not the wrapper around the product.

It is the contract boundary.

If the contract lives only in prose, an agent can misunderstand it. If it lives
only in internal tests, the agent may never see it. If it lives in the
executable surface, the agent can run it.

That is the standard AI-first developer tools should meet: not documented
intent, but executable commitment.

entropyx is one concrete implementation of that idea. It measures codebase
history, emits a typed protocol, preserves deterministic local evidence, and
lets an assistant drill from summary to proof.

The broader pattern is the part worth carrying forward.

Build tools the agent can call. Make them describe themselves. Give them
schemas. Give them stable addresses. Make the core deterministic. Keep evidence
local when you can. Tell the truth when you cannot measure something.

The code already knows more than the incident room usually remembers.

The tool's job is to read it back in a form both humans and agents can trust.

Attribution and Disclosure

Written by Don / copyleftdev from the entropyx project.

This article was drafted and edited with AI assistance, then reviewed against
the entropyx source and DEV's current publishing guidance before submission.

DEV Community