<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Matt Fitzgerald</title>
    <description>The latest articles on DEV Community by Matt Fitzgerald (@matt_fitzgerald_e0904a636).</description>
    <link>https://dev.to/matt_fitzgerald_e0904a636</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1871005%2F34a900c6-4876-403b-b3c0-988c23cb7140.jpeg</url>
      <title>DEV Community: Matt Fitzgerald</title>
      <link>https://dev.to/matt_fitzgerald_e0904a636</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/matt_fitzgerald_e0904a636"/>
    <language>en</language>
    <item>
      <title>LLM386: borrowing a 1990s idea for managing LLM context</title>
      <dc:creator>Matt Fitzgerald</dc:creator>
      <pubDate>Mon, 04 May 2026 23:46:33 +0000</pubDate>
      <link>https://dev.to/matt_fitzgerald_e0904a636/llm386-borrowing-a-1990s-idea-for-managing-llm-context-4pga</link>
      <guid>https://dev.to/matt_fitzgerald_e0904a636/llm386-borrowing-a-1990s-idea-for-managing-llm-context-4pga</guid>
      <description>&lt;p&gt;In 1989, DOS had a 640 KB ceiling on conventional memory. EMM386 used the 80386 CPU's address-translation hardware to page chunks of a much larger memory space through a small fixed window inside that 640 KB. Programs that asked nicely got effectively unlimited memory through a peephole, by paging only what was relevant for the current operation.&lt;/p&gt;

&lt;p&gt;LLMs have the same problem.&lt;/p&gt;

&lt;p&gt;The context window is bounded; 32K, 128K, 1M tokens. Your data is bigger. Conversation history, retrieved documents, tool results, persistent facts will exceed any window worth paying for. Every call has to choose what gets through.&lt;/p&gt;

&lt;p&gt;The common approach is ad-hoc: keep messages in a list, retrieve "the last N plus a vector hit," concatenate, send. This breaks down once the prompt grows enough that you can't trace what's init. The model gives an answer; nobody can explain why; two turns produce different responses for reasons that aren't recorded anywhere.&lt;/p&gt;

&lt;p&gt;LLM386 is the runtime EMM386 was, applied to LLM context windows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thesis
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;f(context) → output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model is a pure function. No memory, no persistence, no cross-call state. All continuity has to be reconstructed every call. Two consequences:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Durable state lives in a store the runtime owns. The model is a stateless consumer.&lt;/li&gt;
&lt;li&gt;The prompt for each call is recomputed from that store, with the model's input budget as the constraint.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's in the runtime
&lt;/h2&gt;

&lt;p&gt;A persistent block store (LMDB, content-addressed, deduped on hash). A pager that picks which blocks fit the model's input budget by running configured retrievers in parallel (recency, BM25, embedding ANN, custom), normalizing their scores, merging by max-per-block, and allocating across canonical sections: System, Task, State, Plan, Retrieved, Tools, Recent, Background. A packer that renders the selection into a deterministic prompt string or a role-tagged chat message list. A tracer that records what the model saw and why, with byte-level prompt hashes for replay. A reducer that turns model output back into committed state via parsed events. A typed-edge graph that ties dependent blocks together so the pager keeps tool results paired with the assistant message that called them. A diff layer for comparing two trace records turn-over-turn.&lt;/p&gt;

&lt;p&gt;Rust library, Python SDK (PyO3 native extension), CLI. Apache-2.0. Alpha (1.0.0-alpha).&lt;/p&gt;

&lt;p&gt;What's deliberately not in there:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No chatbot UI.&lt;/li&gt;
&lt;li&gt;No hidden state inside prompts.&lt;/li&gt;
&lt;li&gt;No treating model output as truth.&lt;/li&gt;
&lt;li&gt;No distributed storage in the initial version.&lt;/li&gt;
&lt;li&gt;No learned components anywhere in the hot path. Every retriever, packer, and reducer is deterministic, which is the property that makes the trace replayable. A learned reranker or a trained embedding tweaker would break that, so they're a design constraint to live without.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/fitzee/llm386
&lt;span class="nb"&gt;cd &lt;/span&gt;llm386
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...
docker compose &lt;span class="nt"&gt;-f&lt;/span&gt; examples/langgraph-agent/docker-compose.yml run &lt;span class="nt"&gt;--rm&lt;/span&gt; agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five minutes from clone to chatting. Small chatbot with two stub tools (a calculator and a fake user-profile lookup), with LLM386 as the memory layer. Conversation persists across container restarts because the store is a Docker volume. The model recalls things from prior turns; that recall is provided entirely by the runtime, since LangGraph holds no state between turns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Should you use it?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Have an agent that works in dev but the prompts are a mess and you can't reason about what the model is seeing? Yes, built for that.&lt;/li&gt;
&lt;li&gt;Want a quick chatbot demo? Probably not. Use the simplest thing that runs.&lt;/li&gt;
&lt;li&gt;Want to swap models without rewriting prompt assembly? Yes. The ModelProfile abstraction carries context window, tokenizer, and capability flags; the pager and packer respect that contract regardless of which model you swap in.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As agents get more complex, "what's actually in the prompt right now?" becomes a question most stacks have a hard time answering. The runtime is designed so it stays cheap.&lt;/p&gt;

&lt;p&gt;EMM386 worked because a bounded window into a larger memory was the right abstraction for a structurally constrained system. The same abstraction applies to LLM context windows three decades later.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/fitzee/llm386" rel="noopener noreferrer"&gt;https://github.com/fitzee/llm386&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
