DEV Community

Cover image for The boring secret to a cheap AI coding agent — a byte-stable prompt prefix
YHH
YHH

Posted on

The boring secret to a cheap AI coding agent — a byte-stable prompt prefix

I tried using Claude Code for a few weeks and quietly stopped. Not because it was bad — it's great — but because I'd hesitate before kicking off any non-trivial task, doing the math on what a 30-minute debugging session would cost. That's the wrong incentive for an agent. The whole pitch of "let it run" depends on the bill not scaring you.

So I went the other direction. I built Reasonix — same shape as Claude Code or Aider, runs in your terminal, plan mode, tool calls, MCP — but it only talks to DeepSeek, and the entire loop is engineered around one invariant:

The prompt prefix must be byte-identical to the previous turn's prefix.

That's it. That's the whole architectural constraint. Everything else falls out of it.

This post is about why that constraint matters, what silently breaks it, and what the loop ends up looking like when you take it seriously.

The mechanic

DeepSeek's API has prefix caching. If the first N tokens of your request match a recent request byte-for-byte, those tokens are billed at roughly 1/10th the normal input price. Cache TTL is generous — minutes — long enough that within a single conversation turn you basically always hit it if you didn't break the prefix.

Most providers have something like this now. What's different is the price ratio and the granularity. On DeepSeek, if your agent is built right, every turn after the first is mostly cached, and a long session costs cents instead of dollars.

The catch: "built right" turns out to be load-bearing.

What silently breaks the cache

These are the things I found that quietly destroy your cache hit rate. None of them throw errors. Your agent works fine. You just pay full price every turn and don't know why.

1. Non-deterministic JSON.stringify of tool schemas

// Looks fine. Is poison.
const toolSchema = JSON.stringify({
  name: "edit_file",
  parameters: someObject,
});
Enter fullscreen mode Exit fullscreen mode

JSON.stringify does not guarantee key order across runs in all engines, and even when it does, your someObject may have been built from an object spread that depends on insertion order. One reordered key — {path, content} vs {content, path} — and the cached prefix is gone.

Fix: serialize tool schemas with a deterministic stringifier (sorted keys), and freeze the output. Once at startup, never re-serialize per turn.

2. Timestamps or run IDs in the system prompt

const systemPrompt = `You are a coding agent. Session: ${sessionId}. Started: ${now}.`;
Enter fullscreen mode Exit fullscreen mode

Looks harmless. Destroys cache on every single turn because the prefix differs every run.

Fix: nothing variable goes into the system prompt. Session metadata, if you really need the model to see it, goes into the first user message — which is fine because that's already turn-1-only content.

3. Re-rendering tool results with one whitespace difference

This one is sneaky. You call a tool, get a result, format it into a message, send it back. Next turn, you re-render the same tool result from your event log — but a pretty-printer adds a trailing newline this time, or strips one. Cache gone.

Fix: format tool results once, store the exact rendered string, append-only. Never re-derive past content from upstream sources.

4. In-place edits to message history

Summarization, truncation, "let me clean up this old turn so we fit in the context window" — every one of these mutates the prefix. Even if the new shortened version is what you want the model to see, the cache was built against the old version.

Fix: history is append-only. If you need to compress old context, do it as a new turn ("here's a summary, ignore turns 1–8"), not as an in-place edit.

5. Switching tool definitions mid-session

Adding a tool, removing one, even reordering the tools array — all of these change the system message that includes the tool schemas, which is part of the prefix.

Fix: pin the tool set at session start. If you need dynamic tools, accept the cache miss as a deliberate event and surface it in your cost dashboard.

What doesn't break it

For completeness, things I worried about that turned out to be fine:

  • Streaming vs non-streaming — same prefix, same cache.
  • Function-calling format vs JSON-tool-call format — pick one and stick, but either works.
  • Adding new turns — obviously, that's the entire point. Append is free.
  • Tool result content changing turn-to-turn — only the formatting of past results matters. Future tool calls returning different data is normal and doesn't touch the prefix.

What the loop looks like when you take this seriously

Reasonix's loop is built backward from the byte-stability requirement:

Turn N+1 prefix = Turn N prefix + (assistant turn N) + (user turn N+1)
                                  ↑                    ↑
                          rendered once,        rendered once,
                          stored as string      stored as string
Enter fullscreen mode Exit fullscreen mode

That's the whole shape. There is no code path anywhere in the loop that re-derives past content from upstream sources at request time. Past content is strings, in an array, appended to. Period.

A few specific design consequences:

  • System prompt is a constant. Compiled at startup, frozen. No template variables.
  • Tool schemas are serialized once with a sorted-key stringifier and concatenated into the system message.
  • Tool results are formatted at the moment of receipt and stored as the exact bytes that will be sent. The loop replays bytes, not objects.
  • There is no summarization step in the main loop. When context gets large, the user can /compact explicitly — which is a new turn containing a summary, not an in-place rewrite.
  • Permissions / plan mode / hooks all operate on what the user sees, never on what gets sent to the model.

This is more restrictive than a generic agent framework wants to be. That's the point. The constraint is the feature.

What this gets you

On a long debugging session — say, an hour of back-and-forth on a real codebase, 50–80 turns, lots of tool calls — Reasonix bills come in around 5–15 cents depending on how much code the model reads. The same session through a non-cache-aware framework on DeepSeek would be roughly $1–$3. Through Claude (Sonnet) it'd be $5–$15.

The cheapness isn't the goal — the goal is changing the posture. When a session costs cents, you stop curating prompts and start delegating real chunks of work. You leave it running while you go to lunch. That's the whole user-experience shift.

Things I'm honest about

  • It's DeepSeek-only. That's a feature, not a bug — every layer is tuned to one provider's cache mechanic. But if DeepSeek goes down, you're down. I think the cost ratio is worth it; you may not.
  • DeepSeek is a Chinese provider. Some companies can't use it. That's a real constraint, not something I can engineer around.
  • R1's quality on agentic tool-use is a notch below Sonnet. Closer than you'd expect, but it's there. The cost ratio still wins for me.
  • The prefix-stability discipline is contagious — once you've enforced it in the loop, you start noticing every place else in your stack that quietly mutates state for no reason.

Try it

npx reasonix code
Enter fullscreen mode Exit fullscreen mode

Repo: github.com/esengine/reasonix

MIT, TypeScript, Node 22+. Works on macOS, Linux, Windows (PowerShell, Git Bash, Windows Terminal).

Architecture writeup with the four-pillar breakdown is in docs/ARCHITECTURE.md.

If you've measured cache hit rates in your own agent setup — generic framework or otherwise — I'd genuinely like to see numbers. The thing I can't tell from the outside is whether everyone is silently eating full-price tokens, or whether some of the popular frameworks have quietly fixed this and I missed it.

Top comments (0)