In 1989, DOS had a 640 KB ceiling on conventional memory. EMM386 used the 80386 CPU's address-translation hardware to page chunks of a much larger memory space through a small fixed window inside that 640 KB. Programs that asked nicely got effectively unlimited memory through a peephole, by paging only what was relevant for the current operation.
LLMs have the same problem.
The context window is bounded; 32K, 128K, 1M tokens. Your data is bigger. Conversation history, retrieved documents, tool results, persistent facts will exceed any window worth paying for. Every call has to choose what gets through.
The common approach is ad-hoc: keep messages in a list, retrieve "the last N plus a vector hit," concatenate, send. This breaks down once the prompt grows enough that you can't trace what's init. The model gives an answer; nobody can explain why; two turns produce different responses for reasons that aren't recorded anywhere.
LLM386 is the runtime EMM386 was, applied to LLM context windows.
The thesis
f(context) → output
The model is a pure function. No memory, no persistence, no cross-call state. All continuity has to be reconstructed every call. Two consequences:
- Durable state lives in a store the runtime owns. The model is a stateless consumer.
- The prompt for each call is recomputed from that store, with the model's input budget as the constraint.
What's in the runtime
A persistent block store (LMDB, content-addressed, deduped on hash). A pager that picks which blocks fit the model's input budget by running configured retrievers in parallel (recency, BM25, embedding ANN, custom), normalizing their scores, merging by max-per-block, and allocating across canonical sections: System, Task, State, Plan, Retrieved, Tools, Recent, Background. A packer that renders the selection into a deterministic prompt string or a role-tagged chat message list. A tracer that records what the model saw and why, with byte-level prompt hashes for replay. A reducer that turns model output back into committed state via parsed events. A typed-edge graph that ties dependent blocks together so the pager keeps tool results paired with the assistant message that called them. A diff layer for comparing two trace records turn-over-turn.
Rust library, Python SDK (PyO3 native extension), CLI. Apache-2.0. Alpha (1.0.0-alpha).
What's deliberately not in there:
- No chatbot UI.
- No hidden state inside prompts.
- No treating model output as truth.
- No distributed storage in the initial version.
- No learned components anywhere in the hot path. Every retriever, packer, and reducer is deterministic, which is the property that makes the trace replayable. A learned reranker or a trained embedding tweaker would break that, so they're a design constraint to live without.
Try it
git clone https://github.com/fitzee/llm386
cd llm386
export ANTHROPIC_API_KEY=sk-ant-...
docker compose -f examples/langgraph-agent/docker-compose.yml run --rm agent
Five minutes from clone to chatting. Small chatbot with two stub tools (a calculator and a fake user-profile lookup), with LLM386 as the memory layer. Conversation persists across container restarts because the store is a Docker volume. The model recalls things from prior turns; that recall is provided entirely by the runtime, since LangGraph holds no state between turns.
Should you use it?
- Have an agent that works in dev but the prompts are a mess and you can't reason about what the model is seeing? Yes, built for that.
- Want a quick chatbot demo? Probably not. Use the simplest thing that runs.
- Want to swap models without rewriting prompt assembly? Yes. The ModelProfile abstraction carries context window, tokenizer, and capability flags; the pager and packer respect that contract regardless of which model you swap in.
As agents get more complex, "what's actually in the prompt right now?" becomes a question most stacks have a hard time answering. The runtime is designed so it stays cheap.
EMM386 worked because a bounded window into a larger memory was the right abstraction for a structurally constrained system. The same abstraction applies to LLM context windows three decades later.
GitHub: https://github.com/fitzee/llm386
Top comments (0)