Matt Fitzgerald

Posted on May 4 • Edited on May 16

LLM386: borrowing a 1990s idea for managing LLM context

#llm #agents #ai

Some of us might remember that in the late 90's, MS-DOS had a 640 KB ceiling on conventional memory. Those of us with a PC (side note: I was an Amiga user but 'required' a PC for study) used something like EMM386 which leveraged the 80386 CPU's address-translation hardware to page chunks of a much larger memory space through a small fixed window inside that 640 KB. Programs that asked nicely got effectively unlimited memory through a peephole, by paging only what was relevant for the current operation (I loved the Amiga!).

Working in the AI agentic space recently; I've come to the conclusion that LLMs have the same problem.

The context window is bounded; 32K, 128K, 1M tokens, but the data you want to shoehorn into it is bigger. Conversation history, retrieved documents, tool results, persistent facts will exceed any window worth paying for, which means every call has to choose/figure out what gets through.

The common approach (at least from what I've seen, and used in the past) is ad-hoc: keep messages in a list, retrieve "the last N plus a vector hit," concatenate, send. This breaks down once the prompt grows enough that you can't trace what's in it. The model gives an answer; nobody can explain why two turns produce different responses for reasons that aren't recorded anywhere.

Enter LLM386... the runtime EMM386 was, applied to LLM context windows.

The thesis

f(context) → output

The model is a pure function. No memory, no persistence, no cross-call state. All continuity has to be reconstructed every call. Two consequences:

Durable state lives in a store the runtime owns. The model is a stateless consumer.
The prompt for each call is recomputed from that store, with the model's input budget as the constraint.

What are the building blocks?

A persistent block store (content-addressed, deduped on hash).

A pager, the thing that figures out what in the block store should be in the context's working set that is provided to the model. It picks which blocks fit the model's input budget by running configured retrievers in parallel (recency, BM25, embedding ANN, custom), normalizing their scores, merging by max-per-block, and allocating across the following canonical sections: System, Task, State, Plan, Retrieved, Tools, Recent, Background.

A packer, another thing that takes facts from the user or the LLM. It renders the selection into a deterministic prompt string or a role-tagged chat message list.

A tracer that records what the model saw and why, with byte-level prompt hashes for replay. Effectively the audit trail for LLM386, which allows for playback, diff etc.

A reducer that turns model output back into committed state via parsed events; a fancy way of saying "just storing LLM output verbatim is messy so we don't do it"

A typed-edge graph that ties dependent blocks together so the pager keeps tool results paired with the assistant message that called them. Blocks are intrinsically related, we need to make sure they are tied together somehow, so the graph aids in that.

A diff layer for comparing two trace records turn-over-turn. Very useful for seeing which blocks were added/removed between turns.

Rust library, Python SDK (PyO3 native extension), CLI. Apache-2.0. Alpha (1.0.0-alpha).

What's deliberately not in there:

No chatbot UI... because it isn't a chatbot!
No treating model output as truth.
No learned components anywhere in the hot path. Every retriever, packer, and reducer is deterministic, which is the property that makes the trace replayable. A learned reranker or a trained embedding tweaker would break that, so they're a design constraint to live without.

Try it

git clone https://github.com/fitzee/llm386
cd llm386
export ANTHROPIC_API_KEY=sk-ant-...
docker compose -f examples/langgraph-agent/docker-compose.yml run --rm agent

Five minutes from clone to chatting. It's demonstrates a small CLI-based chatbot with two stub tools (a calculator and a fake user-profile lookup), with LLM386 as the memory layer (conversations persist across container restarts because the store is on a Docker volume). You'll notice the model can recall things from prior turns - leave it for a week or so, chat some more, ask the agent about different things you spoke about, ask it when those things were discussed; that recall is provided entirely by the LLM386 since LangGraph holds no state between turns.

Should you use it?

Have an agent that works in dev but the prompts are a mess and you can't reason about what the model is seeing? Yes, the use-case it was built for!
Want a quick chatbot demo? Probably not. Use the simplest thing that runs.
Do you want to swap models without rewriting prompt assembly? Then yes. The ModelProfile abstraction carries the context window, tokenizer, and capability flags; the pager and packer respect that contract regardless of which model you swap in; you can even flip models mid-conversation!

So, as agents get more complex, "what's actually in the prompt right now?" becomes a question most agentic stacks have a hard time answering. LLM386 was designed so it stays cheap to answer that very question.

To conclude, EMM386 worked because a bounded window into a larger memory space was the right abstraction for a structurally constrained system, the same abstraction applies to LLM context windows three decades later, LLM386 is my attempt at that abstraction.

Disclaimer: the post was not written by AI; however Claude was heavily use in the development of LLM386. My design choices, invariants, what goes in (what doesn't go in!) etc. Claude is just a much faster coder than me. Oh, and the 'retro' image above is definitely AI generated!

GitHub: https://github.com/fitzee/llm386

DEV Community

LLM386: borrowing a 1990s idea for managing LLM context

The thesis

What are the building blocks?

Try it

Should you use it?

Top comments (0)