The token tax nobody budgets for — and why it hits the tightest budgets hardest

#ai #productivity #opensource #machinelearning

Most write-ups about AI agents are about prompts, tools, and evals. Almost none are about the line item that quietly dominates a real deployment: the context tokens you pay for on every single turn.

Here is the mechanic. A typical agent loop re-sends the whole conversation so far on each step, so the model can "remember" what happened. Turn 1 sends a little. Turn 20 re-sends everything from turns 1–19 again. Across a session, the input-token cost does not grow with the amount of work done — it grows roughly O(N²) in the number of turns.

On a generous budget you might never notice. If you are a solo builder in Nairobi or a small team in Lagos or Accra paying for every token in hard currency, you notice on day one: the bill tracks the length of the work, not its value. A task that runs for an afternoon can cost more than the feature it shipped — and a product that has to run that loop for thousands of users multiplies the same waste across every one of them.

Measure it before you trust anyone — including this post

There is an open, offline benchmark that counts exactly this. It models a realistic coding-assistant session across three sittings and counts input/context tokens under two strategies — re-sending the full transcript every turn, versus recalling a small, bounded set of memory cells each turn:

git clone https://github.com/citw2/saihm-token-benchmark
cd saihm-token-benchmark && npm install
node benchmark.mjs

It runs fully offline, no API key, tokenizing with gpt-tokenizer (cl100k_base). On the bundled scenario it reports 62.8%–85.9% fewer context tokens with bounded recall, and the gap widens the longer the session runs. Change --recall-cap and watch the trade-off move. The point is not the headline number — it is that you can reproduce it on your own session instead of taking a vendor's word for it.

The fix: recall a bounded set, don't replay the transcript

The expensive habit is treating the whole conversation as the agent's memory. The cheaper design is to keep durable facts — decisions, conventions, file paths, the things you actually need later — as separate memory cells, and recall only a small capped set each turn. That turns the quadratic resend into roughly O(N · cap): cost grows with the work, not with how long the transcript has gotten.

This is the idea behind SAIHM, a sovereign memory layer any MCP-capable AI client can call. Durable facts live as encrypted cells you hold the keys to; each turn pulls a bounded working set instead of replaying history. Because the memory is addressed through an open protocol, the same store works whether you are calling Claude, GPT, DeepSeek, Qwen, Kimi, or GLM — useful if you switch models to chase a better price-per-token, which on a tight budget you will.

Why the tight-budget case is the strongest case

Two things compound for builders working against hard-currency API costs:

Every token is FX. A 70–85% cut in context tokens on long sessions is not a rounding error when the bill is denominated in a currency your revenue is not.
You are often building for scale on small margins. The next billion users are coming online on agents that have to be cheap to run per interaction. Re-sending the transcript per user, per turn, is the opposite of cheap.

The same property that makes memory cheaper also makes it portable and erasable: you hold the key, a delete destroys that key and is provable on a public chain, and you can share a single record with another agent and revoke it. But the budget case stands on its own — flatten the O(N²) curve and the rest is upside.

Try it without spending anything to find out

The benchmark above is one asset; the runnable demos are the other. They let you ground a memory you own in every major model and then prove you can erase it, each running offline in about a minute, no account needed:

Demos: https://citw2.github.io/saihm-demos/
Benchmark: https://github.com/citw2/saihm-token-benchmark

SAIHM itself is a paid product with no free tier — stated up front rather than buried behind a trial. But the benchmark and the demos are open source and run locally, so you can verify the claim and try the integration before deciding anything.

Independence notice: SAIHM is an Apache-2.0 protocol authored independently. It is not affiliated with OpenAI, Anthropic, Google, or any AI client vendor. The benchmark is open source and reproducible offline; the figures are produced by the published script and depend on session length and scenario.