DEV Community: Azard Tennant-Hosein

Persistent memory for Ollama, in about five minutes

Azard Tennant-Hosein — Sat, 27 Jun 2026 07:14:26 +0000

Originally published on the Sieve blog. Sieve is an open-source (Apache 2.0) context-reduction proxy — I work on it. This is a how-to, not a pitch; the steps work against any Ollama setup.

Ollama gives you a local LLM endpoint that is fast, private, and completely stateless. Close the chat, and everything you told the model is gone. Keep the chat open, and every turn re-sends a growing history until the context window fills up. Ask a local model about something it was never told, and — depending on the model — it may simply make something up.

This guide adds a persistent, encrypted memory to any Ollama setup using Sieve, without changing your client code beyond one URL.

The shape of the problem

Three separate annoyances show up when you run agents or long-lived chats against a local model, and they have a common root.

Nothing survives the session. Tell your assistant on Monday that you prefer Python and your deploy target is a Raspberry Pi, and on Tuesday it knows neither. The model has no state; the application has to carry all of it, every time.

The payload only grows. The standard workaround is to re-send history: system prompt, tool schemas, every prior turn, on every request. I measured the consequences of that pattern in The hidden cost of context — the short version is that per-turn cost grows with conversation length, and on local hardware that growth comes out of your tokens-per-second.

Absence becomes fabrication. When a question falls outside the context you did send, smaller models in particular tend to answer anyway. A model that was never told your colleague's name will, often enough, invent one.

The common root: the endpoint is stateless and the burden of memory falls on whatever sits in front of it. Most memory frameworks ask you to adopt an SDK and call add()/search() yourself. The approach here is different — put the memory in the traffic path, so the client stays unchanged. I wrote up why I prefer the proxy shape in Why Sieve.

What you'll end up with

your client ──► Sieve (127.0.0.1:11435) ──► Ollama (127.0.0.1:11434)
                 │
                 └── encrypted store at ~/.sieve/memory.db

Sieve speaks Ollama's native /api/chat as well as the OpenAI-compatible /v1/chat/completions, so anything that can talk to Ollama can talk to Sieve. On each turn it strips repeated instructions, tool schemas, and stale history from the outbound payload; learns durable facts from the conversation; and injects the relevant ones back in when a later turn actually needs them. The reply comes back to your client unchanged.

Step 1 — install Sieve

You need Python 3.11+ and a running Ollama. The recommended installer is pipx:

pipx install llm-sieve
sieve --version   # sieve, version 1.0.0 or later

Then run the guided setup:

sieve-install

If Ollama is running on 127.0.0.1:11434, the installer auto-detects it, shows you the models you already have pulled, downloads a ~50 MB embedding model (one-time), creates the encrypted store, and offers to start the proxy — with optional autostart on reboot. For a scripted, no-prompts install:

sieve-install --no-input \
  --provider http://127.0.0.1:11434 \
  --model qwen3.5:9b

Step 2 — move one URL

Sieve listens on 11435 — deliberately one port up from Ollama's 11434. Wherever your client points at Ollama, point it at Sieve instead.

Ollama-native clients:

export OLLAMA_HOST=http://127.0.0.1:11435

OpenAI-compatible clients:

client = OpenAI(
    base_url="http://127.0.0.1:11435/v1",  # was: http://127.0.0.1:11434/v1
    api_key="not-used-by-sieve",           # still forwarded upstream
)

Or just curl it:

curl http://127.0.0.1:11435/api/chat \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.5:9b",
    "messages": [{"role": "user", "content": "Hi, my name is Alex and I work on embedded firmware."}],
    "stream": false
  }'

Model names, request shapes, response formats, streaming — all unchanged. The client does not know Sieve exists.

Step 3 — prove it's working

Two built-in commands, both sandboxed (they never touch your real store):

sieve demo

runs a short scripted conversation: an identity introduces itself, shares facts, asks for them back, and closes with a question about a person who was never mentioned. What you want to see: recall hits on the seeded facts, and a refusal — not a fabrication — on the trap question.

sieve benchmark

sends the same 15 messages directly to your model and through Sieve, then prints a delta table: tokens in vs out, facts learned, response times, and the trap verdict. Five to ten minutes depending on your hardware, and the numbers are yours rather than mine.

Watching it work

Every response Sieve touches carries diagnostic headers, so you don't have to take the proxy's behaviour on faith:

Header	What it tells you
`X-Sieve-Inbound-Tokens`	Payload size before the trim
`X-Sieve-Outbound-Tokens`	Payload size actually sent to Ollama
`X-Sieve-Phase`	`OBSERVE` / `ACCUMULATE` / `ACTIVATE`
`X-Sieve-Fact-Count`	Facts in the store right now
`X-Sieve-Proxy-Us`	Sieve's own overhead, in microseconds

The inbound/outbound pair is the one to watch first: it's the per-request answer to "is this actually doing anything?" The full list is in the diagnostic headers reference.

One thing to expect: the first few turns feel like pass-through. Sieve activates progressively — it observes before it accumulates, and accumulates before it actively trims and injects. X-Sieve-Phase tells you exactly where it is in that ramp, and sieve status shows the fact count growing.

Where your data lives

Everything stays on your machine. Facts, entities, and episodes land in a SQLCipher-encrypted SQLite database at ~/.sieve/memory.db, with the keyfile alongside it. There is no cloud component, no account, and no telemetry — the proxy talks to exactly one remote party, and it's the LLM endpoint you configured. If that endpoint is Ollama on localhost, nothing leaves the box at all.

The store belongs to you, not to the package: upgrades via pipx upgrade llm-sieve never touch ~/.sieve/, and the only command that deletes user data is sieve uninstall --hard, which makes you type DELETE first.

Honest caveats

Small models still have small-model problems. Sieve can put the right facts in front of the model and refuse to let absence turn into invention on the turns it gates, but a 1–3B model under ambiguity is still a 1–3B model. The demo's trap turn is the honest check — run it against the model you actually plan to use. Models in the 8B+ class are where the absence-handling shines.

Cold start is real. A memory layer with nothing in it can't save you tokens yet. Budget a handful of turns before the deltas get interesting.

Port collisions happen. If something already owns 11435, run sieve start --port 11436 and point your client there instead.

Five minutes, summarised

pipx install llm-sieve
sieve-install            # auto-detects Ollama, guided from there
export OLLAMA_HOST=http://127.0.0.1:11435
sieve demo               # watch the recall hits and the trap refusal

One URL changed, no SDK adopted, no client code rewritten — and your Ollama models stop forgetting who you are between sessions.

Sieve is open source under Apache 2.0: github.com/llmsieve/llm-sieve.

The two causes of your token bill

Azard Tennant-Hosein — Wed, 17 Jun 2026 09:49:27 +0000

Originally published on the Sieve blog. Sieve is an open-source (Apache 2.0) context-reduction proxy — I work on it, and I've tried to keep this post about the problem rather than the tool.

If you run an LLM agent for real work, the bill is the part nobody warned you about. It starts small, it grows with use, and the worst of it is invisible — most of what you pay for on any given turn is text the model has already seen, or text you never meant to send.

There's a temptation to treat this as one problem with one fix. It isn't. An agent's token bill has two distinct causes, and they need two genuinely different kinds of tool. This post is about telling them apart — because once you can, the question stops being "which tool wins" and becomes "which of my two problems am I looking at right now."

The bill is mostly things you didn't choose

Start with where the tokens actually go, because it's rarely where people assume.

When your agent calls a tool, the model doesn't just pay for your request — it pays for the machinery of asking. Anthropic's own pricing documentation spells this out: the tools parameter alone adds hundreds of tokens of schema to every request, the bash tool adds a fixed overhead, and a single web fetch pulls the fetched page straight into your context — "Average web page (10 kB): ~2,500 tokens... Research paper PDF (500 kB): ~125,000 tokens". A tool result you glance at once and never need again can cost more than the entire conversation around it.

Now add the part that repeats. On every turn, a typical agent re-sends its system prompt, its full tool catalogue, its persona, and the conversation so far. The variable part of the request — what you actually typed — is often the smallest thing in the payload. The fixed overhead, multiplied across every turn of a long session, is the bill.

So the cost has two shapes, and they're not the same shape:

Verbose machine output — JSON tool results, logs, search dumps, fetched pages, code listings. Big, one-off, and mostly structural noise around a small signal.
Repeated standing context — the system prompt, tool schemas, persona, and history that ride along on every single turn, plus the absence of any memory that would let the agent not re-send it all.

These call for different interventions, and conflating them is why "just reduce my tokens" never quite works.

Two different jobs

Compressing verbose output is a content problem. You have a 10,000-token JSON blob; you want the model to get its meaning at a fraction of the size without losing the parts that matter. This is hard in an interesting way — it's about understanding the shape of the content (a deeply nested object, an AST, a log stream) and squeezing it losslessly enough that the answer doesn't change.

Reducing repeated context is a traffic problem. The model has already seen your tool schemas and your standing instructions; the fix is to stop re-sending what it's seen, and to remember durable facts so they can be supplied on demand instead of permanently parked in the prompt. This isn't about any single payload's shape — it's about what crosses the wire, turn after turn, and what gets remembered between turns.

You can have either problem without the other. An agent that does a lot of web research and tool-calling has a verbose-output problem even in a short session. A long-running personal assistant that mostly chats has a repeated-context problem even though no individual message is large. Most real agents have both, in different proportions — which is exactly why one tool rarely covers the whole bill.

Two tools, two halves

This is where it's worth being concrete, and fair to the projects doing this work.

Headroom is, in its own words, "the context compression layer for AI agents" — it targets the first problem. Its job is taking verbose content and making it smaller while accuracy is preserved on standard benchmarks: JSON, code, logs, the bulky machine output that coding agents generate constantly. It's Apache 2.0, runs locally, and offers library, proxy, agent-wrap, and MCP modes. If your bill is dominated by tool outputs and search results, that's the shape of problem it's built for.

Sieve — the project I work on — targets the second. It's a proxy that strips the context the model has already seen from every outbound turn, and backs that with an encrypted local store of durable facts it can inject only when a turn needs them, rather than keeping everything in the prompt forever. It also refuses to invent answers about things it was never told. If your bill is dominated by the same standing apparatus re-sent on every turn, and by an agent that forgets you between sessions, that's its half.

Notice these are different halves. One makes a big payload smaller; the other stops a payload from being re-sent and gives the agent a memory so it doesn't have to be. They're not competing for the same job — they're addressing the two causes named above. In principle they compose: compression handling the verbose one-off content, a reduction-and-memory layer handling the repeated standing content.

What this is worth to you

Set the percentages aside for a moment — every tool in this space quotes a big reduction number, and the numbers depend entirely on your workload. The value to you as a user is more concrete than any headline figure:

Sessions that don't fall over. The most common real complaint isn't the monthly invoice — it's hitting a limit or a context wall in the middle of work. Spending fewer tokens per turn is, before anything else, more room to keep going.
A bill you can reason about. Both kinds of tool are observable: you can see what was sent before and after. A cost you can inspect is a cost you can manage, instead of a number that arrives at month-end.
Less re-explaining yourself. For the repeated-context half specifically, the payoff isn't only tokens — it's an agent that remembers your preferences and your project across sessions, so you stop re-establishing the same ground every time you open it.
Privacy you don't have to trade for savings. Both Headroom and Sieve run locally; Sieve additionally keeps its memory store encrypted on your own disk with no telemetry. Cutting your token bill shouldn't mean shipping your context to one more third party.

Honest limits

A few things I won't claim, because they aren't mine to claim yet.

I haven't run the two together. The "they compose" argument above is architectural — it follows from what each tool does, not from a tested pipeline I've measured. Treat it as a sound hypothesis, not a benchmarked result. If you stack them, I'd genuinely like to hear how it goes.

Reduction has a warm-up. A memory-and-reduction layer with an empty store can't save you much on day one; the savings arrive as it learns. Compression, by contrast, helps on the very first verbose payload.

The numbers are yours, not mine. Whatever either project's headline percentage, the only figure that means anything is the one you measure on your own workload. Both projects expose per-request stats precisely so you can check rather than trust. Use them.

The takeaway

The next time the bill jumps, the useful first question isn't "what cuts tokens" — it's "which of my two problems is this." If it's verbose tool output drowning a small signal, you want compression. If it's the same standing context re-sent every turn and an agent with no memory, you want reduction. Most agents need both, and the good news is the tooling for each now exists, is open source, and runs on your own machine. Knowing which half you're looking at is most of the battle.

Sieve is open source under Apache 2.0: github.com/llmsieve/llm-sieve. If I've misrepresented Headroom, open an issue and I'll correct it.