Rajeev Ramani

Posted on Feb 18 • Originally published at rajeevramani.substack.com on Feb 18

My Agent Did Everything Right, Then Gave Up

In my last post, I talked about the metadata problem — how deciding what to expose through MCP was harder than the protocol itself. I wrote about the token tax, about how tool definitions eat context before the model does anything useful.

I took that lesson to heart. Our MCP responses became minimal. {”ok”: true} with a reference to what was created. Lean. Token-efficient. Exactly what I thought agents needed.

I was wrong about who I was optimising for.

The Trace

We’ve been testing two Python agents against Flowplane’s MCP server — an ops agent for diagnostics and a dev agent for deployments. I gave the dev agent a straightforward task:

Deploy httpbin at localhost:8001 on path /v1/api/users -> /anything at port 10018

Here’s what the agent did:

-> dev_preflight_check({”path”:”/v1/api/users”,”listen_port”:10018,”cluster_name”:”httpbin-svc”})

OK: port available, no name conflicts

-> cp_create_cluster({”name”:”httpbin-svc”,”endpoints”:[{”address”:”localhost”,”port”:8001}]})

OK: {”ok”:true,”ref”:{”type”:”cluster”,”name”:”httpbin-svc”,”id”:”a3f1...”}}

-> cp_create_route_config({”name”:”httpbin-routes”,”virtual_hosts”:[

{”name”:”httpbin-vhost”,”domains”:[”*”],”routes”:[

{”name”:”httpbin-route”,”match”:{”prefix”:”/v1/api/users”},”route”:{”cluster”:”httpbin-svc”,”prefix_rewrite”:”/anything”}}

]}

]})

OK: {”ok”:true,”ref”:{”type”:”route_config”,”name”:”httpbin-routes”,”id”:”b7e2...”}}

Three calls in, the agent had done everything right. Correct creation order, good naming, inline virtual hosts and routes in a single call. The reasoning was flawless.

Then it fell apart.

-> cp_list_virtual_hosts({”route_config_id”:”b7e2...”})

OK: [{”name”:”httpbin-vhost”,”id”:”c4d8...”}]

-> cp_get_virtual_host({”id”:”c4d8...”})

OK: {”name”:”httpbin-vhost”,”domains”:[”*”],...}

-> cp_list_routes({”virtual_host_id”:”c4d8...”})

OK: [{”name”:”httpbin-route”,”id”:”d9a1...”}]

-> cp_get_cluster({”name”:”httpbin-svc”})

OK: {”name”:”httpbin-svc”,”endpoints”:[...]}

-> cp_list_listeners({”route_config”:”d9a1...”})

OK: []

Five verification calls. The agent was checking its own work — confirming that the virtual hosts, routes, and cluster it just created actually existed. It already knew they did. It created them. But it didn’t trust the response enough to move on.

Worse, look at that last call. It passed a route UUID where a route_config name was expected. Got zero results. And then it stopped. Never created the listener.

End result: cluster exists, route config exists with proper virtual hosts and routes — all sitting in the database, correctly configured. But the agent concluded the deployment failed because it couldn’t verify its own work.

The Wrong Kind of Efficient

When I wrote about the token tax in the first post, I was thinking about tool definitions — the schemas and descriptions that eat context before anything happens. So we made our responses lean too. Create a route config with inline virtual hosts and routes? Here’s your confirmation:

{”ok”: true, “ref”: {”type”: “route_config”, “name”: “httpbin-routes”, “id”: “b7e2...”}}

A human developer sees that and knows exactly what happened. They’ve read the docs, they understand the API contract, and they’ve written application code that handles this response based on that understanding. The response doesn’t need to be self-explanatory — the knowledge lives in the consuming application code, not in the payload.

An agent has none of that. It sees ok: true and a reference to the top-level object. Were the inline virtual hosts created? How many routes ended up in the database? Is the route config ready to be attached to a listener, or does it need more configuration? The response doesn’t say.

So the agent does what any reasonable system would do when it lacks confidence: it investigates. It calls cp_list_virtual_hosts to confirm they exist. It calls cp_get_virtual_host to check the details. It calls cp_list_routes to verify the routes landed. Each call burns tokens and introduces another point where things can go sideways — like passing a UUID where a name was expected. The agent passed a route UUID to an endpoint that expected a route_config name — a consistency gap in our API surface that deserves its own post."

I was optimising for the wrong consumer. Token-efficient responses are great when your consumer already has the mental model. When your consumer is building the mental model from your responses alone, brevity becomes ambiguity.

What the Agent Actually Needed

After studying the trace, the fix was straightforward. Not more data — more relevant data:

{

“ok”: true,

“ref”: {”type”: “route_config”, “name”: “httpbin-routes”, “id”: “b7e2...”},

“created”: {

“virtual_hosts”: 1,

“routes”: 1

},

“next_step”: “Create a listener referencing route_config ‘httpbin-routes’ with cp_create_listener”

}

Three additions:

Confirmation of nested effects. The created field tells the agent that its inline virtual hosts and routes were actually persisted. No verification calls needed.

Next step guidance. The next_step field tells the agent what to do now. This sounds hand-holdy, but agents don’t have muscle memory. A DevOps engineer who’s deployed twenty services knows the listener comes next. An agent running this workflow for the first time — or the hundredth time with a blank context window — doesn’t.

Names, not just IDs. Notice the next step says route_config ‘httpbin-routes’, not route_config ‘b7e2...’. Our agent knew the name “httpbin-routes” because it chose that name. But the response ecosystem kept handing back UUIDs, and the agent started using those instead. When it passed a UUID to an endpoint expecting a name, it got zero results and assumed failure.

The Verification Loop

There’s a pattern here worth naming: the verification loop. An agent creates a resource, gets a minimal acknowledgment, then spends 3-5 additional calls confirming what it just did. Each call costs tokens. Each call introduces a chance for ID/name confusion or hitting unexpected edge cases. And the information was available at creation time — we just didn’t return it.

In our trace, the verification loop consumed more tokens than the actual deployment. Five GET calls to verify three creates. The agent was doing more reading than writing, and all of it was unnecessary.

The fix isn’t to prevent agents from making verification calls. It’s to make them unnecessary. If your create response confirms what was created, includes the side effects, and points to the next step, the agent has no reason to look back.

The Tension

There’s a real tension here that I don’t think has a clean answer yet.

Token efficiency says: return less. Every byte in the response is a byte the model has to process. Keep it lean.

Agent confidence says: return more. Every ambiguity in the response triggers verification behaviour. The agent will spend those tokens anyway — either reading your response or making follow-up calls. Follow-up calls cost more.

My current thinking: responses should be informationally dense but structurally simple. A flat created object with counts is cheaper than the agent making three list calls. A one-line next_step string is cheaper than the agent reasoning about workflow ordering from scratch. You’re not adding bloat — you’re preventing it downstream.

What I Didn’t Expect

In the first post, I described a two-layer challenge: tool design (what to expose) and metadata design (how to describe it). I’d now add a third: response design — what you send back after the tool runs.

REST has conventions for status codes and resource representations. GraphQL lets clients specify what they want back. But neither tradition accounts for a consumer that needs to build confidence about what just happened and decide what to do next, all from a single response.

API design for agents will become its own discipline. It borrows from REST, from GraphQL, from conversational UI design — but it’s not quite any of them. The consumer isn’t rendering a page or populating a cache. It’s making a decision. Your response is the input to that decision.

We’re still early in figuring this out. The pattern I’ve landed on — confirmation of effects plus next-step guidance — works for our deployment workflows. Whether it generalizes, I don’t know yet. But the principle feels right: _ design your responses for the consumer that has to reason about them, not the one that already knows what they mean. _

If you're building MCP servers and running into similar patterns — or if you've found different solutions — I'd love to hear about it. You can find Flowplane at github.com/rajeevramani/flowplane, or connect with me on LinkedIn.

DEV Community