A judge or the math: two trust models for autonomous agent settlement

#mcp #ai #cryptocurrency #blockchain

When an AI agent settles a trade with no human watching, something has to make that trade trustworthy. There are two serious ways to do it, and they are not the same. One puts a judge in the loop. The other replaces the judge with math. Most of the current debate about "trust layers for agents" is really an argument between these two models, often without naming them. This post names them and tries to be fair to both.

The problem, stated narrowly

An autonomous agent composes a trade — pay this, receive that, maybe across two chains — and executes it without a human approving each step. The risk is the obvious one: the agent does its half and the counterparty doesn't do theirs, or does a worse version of it. Something has to guarantee that "I paid" and "I got what I was promised" either both happen or neither does.

The two models diverge on what provides that guarantee.

Model 1: the evaluator (a judge in the loop)

In the evaluator model, a trusted party observes the deal and rules on whether it was performed correctly, then releases the escrowed funds accordingly. Ethereum's ERC-8183 draft formalizes a clean version of this with a Client / Provider / Evaluator job structure: a client commissions work, a provider performs it, and an evaluator attests to the outcome before payment settles. It is designed to pair with agent-identity and authorization work like ERC-8004 and payment rails like x402.

The reason this model exists is that a lot of real agent commerce is subjective. "Did the provider deliver a correct summary, a working integration, a usable dataset?" is not a question a hash function can answer. It needs judgment. An evaluator can read the artifact, apply a rubric, and rule. Pure cryptography is blind to all of that — it can only see whether bytes and balances moved, not whether they were the right bytes.

So credit where due: where the dispute is "was this work done well," the evaluator is doing something cryptography fundamentally cannot. That is not overhead. That is the entire point.

The cost is also real, and worth saying plainly:

Trust assumption. Someone has to be the evaluator. Whoever that is, you now trust them — to be correct, available, and honest.
Liveness dependency. If the evaluator is offline, slow, or captured, settlement stalls or skews. You have added a component that can fail.
Selection and incentives. Who chooses the evaluator? Who pays it? A judge the seller picks and a judge the buyer picks are different judges.

None of these are fatal. They are the price of being able to rule on ambiguity.

Model 2: cryptographic atomicity (delete the judge)

The second model removes the third party entirely. A hash-time-locked contract (HTLC) makes a trade clear as a single unit or not at all. Both sides lock their assets against the same hash H = hash(s). Revealing the secret s unlocks one side, and the same reveal unlocks the other. If the secret is never revealed, every lock refunds after its timeout. There is no moment where one party has paid and the other has not.

Extend the same idea across a multi-leg path: put the same hashlock on every leg. Reveal s once and the whole path opens; never reveal it and the whole path refunds. There is no stranded "leg one done, leg two pending" state to arbitrate, because there is nothing to arbitrate. No evaluator attests anything. No escrow is "released." No one is chosen, paid, or waited on.

What you get in exchange for deleting the judge:

No trust assumption beyond the hash function and the chains' own liveness. The guarantee is a property of the construction, not the goodwill of a party.
No discretionary failure mode. There is no judge to bribe, mis-select, or knock offline.
MEV-resistance by construction. The preimage reveal is atomic; there's no intermediate state a searcher can sit in the middle of.

And the honest limitation, stated unprompted because a settlement claim should come with its boundaries:

Atomicity can't judge. It verifies that assets moved as locked. It cannot tell you whether an off-chain deliverable was any good. For "was the work acceptable," math has nothing to say.
Capital locks on every leg at once, there is a free-option problem to manage, and someone has to come online to reveal the secret and finish. You get safety, not guaranteed completion — the path clears or refunds, it never half-settles.

So which one — judge or math?

The useful answer is not "always math." It is: match the model to the dispute.

If the contested question is "did the assets move as agreed?" — a swap, a payment, an asset-for-asset settlement — an evaluator is overhead you can cryptographically delete. You are paying a judge to rule on something a hashlock already makes impossible to get wrong. Here, math is the stronger default precisely because it removes a component rather than adding one.

If the contested question is "was this off-chain work performed acceptably?" — a graded deliverable, a service, anything subjective — atomicity alone can't close the loop. You need someone, or something, to render judgment. The evaluator earns its trust cost.

The interesting design space is where they compose: cryptographic atomicity as the settlement backend, with an evaluator only over the genuinely subjective slice — so the judge rules on the one thing math can't see, and math handles everything else with no judge at all. ERC-8183's authorized intents could, in principle, settle through an atomic backend rather than a custodial one. These are layers, not rivals.

How we think about it

We build the math half. Hashlock fuses sealed-bid RFQ with HTLC atomic settlement so a whole agent trade — including a multi-leg, multi-chain path — clears as one unit or refunds as one, with no custodian and no evaluator in the settlement path. We don't claim it can grade subjective work; it can't, and we'd point you at an evaluator model for that. What it can do is make the asset-movement half of agent commerce trustworthy without adding a party you have to trust.

Stated plainly, because chain status should never be fuzzy: Ethereum mainnet is live end-to-end. Bitcoin HTLCs are signet-validated, mainnet pending. Sui contracts are deployed and CLI-tested, with gateway wiring in progress. We don't call Sui or BTC "live" until they are.

How it works, and the 6 MCP tools: https://hashlock.markets/about/?utm_source=devto&utm_medium=post&utm_campaign=2026-06-04-judge-or-math

The MCP server and docs: https://hashlock.markets/docs/?utm_source=devto&utm_medium=post&utm_campaign=2026-06-04-judge-or-math

The formal version (SSRN): https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6712722

Your turn

Draw the line for the trades you're building: where does your agent genuinely need a judge to rule on something subjective, and where is the evaluator just a trust assumption you could cryptographically remove? I'd like to know where you'd put the boundary.