A simple file-based memory system for AI coding sessions turned a $45 multi-session rebuild into a single $4.87 conversation. Here's the architecture, the data, and why context management is the most undervalued problem in AI-assisted development.
The Problem Nobody Talks About
Every AI coding assistant has the same dirty secret: context evaporates.
You spend 3 hours in a session with your AI pair programmer. You explore APIs, validate assumptions, make design decisions, discover edge cases. Then the session ends. Tomorrow, you start from zero.
The next session costs just as much — not because the work is hard, but because the AI has to rediscover everything it already knew.
Tracked across 53 sessions over 6 weeks on a platform engineering project. The waste was staggering.
What Was Measured
The setup: an AI coding assistant used for infrastructure automation — managing on-call schedules, incident response tooling, and platform operations across multiple cloud environments. The work is context-heavy: API integrations, team structures, design decisions, and operational processes that span weeks of iterative design.
One project required 9 sessions over 2 weeks to produce a specification document. Here's what actually happened:
| Session | Focus | Key Outputs |
|---|---|---|
| 1–2 | API discovery, shift structure mapping | 10 engineer IDs resolved, 9 shift UUIDs mapped |
| 3 | Architecture pattern discovery | Fundamental design change (member-swap → scheduled absence) |
| 4 | 22 use cases gathered from stakeholder | Design decisions D-1 through D-9 |
| 5 | 5 integration POCs executed | 3 passed, 2 blocked (enterprise auth) |
| 6 | Auth blocker solved | Novel zero-auth approach discovered |
| 7 | 20-point design review with stakeholder | Corrections on every major section |
| 8 | 6th POC + gap analysis | 17 missing items identified, spec outline approved |
| 9 | Spec writing | 792-line spec + 285-line execution plan |
Session 9 — the one that produced the actual deliverable — consumed all the knowledge from sessions 1–8. Without persistent storage, session 9 would need to:
- Re-discover API endpoints, UUIDs, and shift structures (sessions 1–2)
- Re-learn the scheduled absence pattern (session 3)
- Re-gather 22 use cases (session 4)
- Re-run or re-verify 6 POCs (sessions 5–6)
- Re-apply 20 points of stakeholder feedback (session 7)
- Re-do the gap analysis (session 8)
Conservative estimate: 6–8 sessions at $5–6 each = $35–45 just to rebuild context before writing a single line.
Actual cost of session 9 with local storage: $4.87.
The Architecture: Three Files
The system is embarrassingly simple. Three markdown files per project, stored locally (never in the AI provider's cloud, never in the remote repository):
.local/agent/
├── current.md # Session state (what's active, what's next)
├── praveen-style.md # Operating manual (style, decisions, anti-patterns)
└── projects/
└── <project-name>/
├── log.md # Chronological session history
├── reference.md # Verified facts, API endpoints, IDs
└── open-questions.md # Decision tracker
current.md — The Recovery Point
This is the only file the AI reads at session start. It contains:
- Active projects with one-line status and file pointers
- TODO list with priorities and owners
- Resume instructions — what was done last session, what's next
- File index — when to load each file (on-demand, not preloaded)
562 lines. Updated at the end of every session. If a session crashes, this file is the recovery point.
log.md — The Session History
Chronological log of every session: what was done, what was decided, what was discovered. Each entry has:
- Context (why this session happened)
- Decisions made (with rationale)
- Key findings (especially surprises)
- Open items carried forward
For the project that produced the spec, this file grew to 1,040 lines across 9 sessions. It's the primary source material — the AI reads it when it needs to understand why a decision was made, not just what was decided.
reference.md — The Verified Facts
API endpoints, authentication patterns, ID mappings, integration test results — anything that was verified against a live system. This file exists because LLMs hallucinate, and the most dangerous hallucinations are the ones that look like API documentation.
Every entry in this file was confirmed by an actual API call or system query. When the AI reads this file, it's reading facts, not assumptions.
396 lines for the project in question. Includes: 10 verified API endpoints, 10 engineer identity mappings, 4 placeholder account UUIDs, 9 shift structures, 6 POC results with evidence, and a complete decision register.
The Rules That Make It Work
The files alone aren't enough. Five rules — discovered through painful trial and error — prevent the system from degrading:
Rule 1: Read current.md First, Everything Else On-Demand
The AI reads current.md at session start. That's it. Every other file is loaded only when the current task requires it. This prevents context window pollution — loading 11,000+ lines of project knowledge into a conversation about a single API endpoint.
Rule 2: Separate State from History from Facts
-
current.md= what's happening now (mutable, updated every session) -
log.md= what happened (append-only, never edited retroactively) -
reference.md= what's true (verified facts, updated only when facts change)
This separation means the AI loads only the type of knowledge it needs. Writing a spec? Load log.md for design history. Making an API call? Load reference.md for endpoints. Starting a new session? Just current.md.
Rule 3: Update Before Session End
The AI updates current.md resume instructions before every session close. This is non-negotiable. If the session crashes after the update, the next session can recover. If it crashes before, one session of context is lost — not everything.
Rule 4: Archive When Files Get Large
When log.md exceeds ~500 lines, older sessions are archived to archive/. The active file stays manageable. The archive is there if deep historical context is needed (rare — maybe 5% of sessions).
Rule 5: Never Store in Provider Cloud
All files live in .local/ (gitignored). They never go to the AI provider's servers, never go to the remote repository. This is about control: the user owns the context, decides what persists, and can move between AI providers without losing institutional knowledge.
The Numbers
Single Session ROI
| Metric | Without Storage | With Storage |
|---|---|---|
| Context rebuild | 6–8 sessions ($35–45) | 0 sessions ($0) |
| Spec writing | 1 session ($5–6) | 1 session ($4.87) |
| Total | $40–51 | $4.87 |
| Saving | 86–89% |
Cumulative Impact (53 Sessions, 6 Weeks)
| Metric | Value |
|---|---|
| Active projects | 10 |
| Total project knowledge | 11,812 lines across 36 files |
| Semantic memory chunks | 961 (indexed for search) |
| Files indexed | 51 |
| Estimated sessions saved | 40–60 (context rebuilds avoided) |
What the Storage Contains
| Category | Lines | Examples |
|---|---|---|
| Session histories | ~6,200 | Design decisions, POC results, stakeholder feedback |
| Reference data | ~2,100 | API endpoints, verified IDs, integration patterns |
| Operating manuals | ~760 | Style guides, decision-making patterns, anti-patterns |
| Session state | ~560 | Active projects, TODOs, resume instructions |
| Decision trackers | ~190 | Open questions with status and resolution |
| Total | ~11,812 |
The Semantic Memory Layer
On top of the three-file system, a lightweight semantic search layer handles cross-project recall:
# 196 lines of Python
# sentence-transformers (all-MiniLM-L6-v2, 384-dim embeddings)
# numpy .npz + chunks.json storage
# Index time: ~12s for 961 chunks
# Search time: <3s
# Cost: $0 (local model, no API calls)
This handles the "I know this was solved in a different project" problem. The AI searches across all project files semantically before starting work that might duplicate past effort.
961 chunks indexed from 51 files. The index is 2.5MB total. Re-indexed after every session save.
What Doesn't Work
Stuffing Everything Into the System Prompt
The obvious first attempt: load all project files at session start. The problems:
- Context window waste — 11,000 lines is ~40,000 tokens. That's a significant chunk of the context window consumed before the conversation even starts.
- Attention dilution — LLMs pay less attention to content in the middle of long contexts. Critical facts buried in page 15 of 20 get missed.
- Cost — Every message in the conversation includes the full system prompt. Token costs scale linearly.
On-demand loading (Rule 1) solved all three.
Relying on the AI's "Memory" Features
Some AI providers offer built-in memory or "project knowledge" features. Tried these too. The problems:
- Opacity — You can't see exactly what was stored or how it's retrieved.
- Vendor lock-in — Switch providers, lose everything.
- Granularity — Built-in memory stores summaries. Real work needs verbatim API endpoints, exact UUIDs, precise decision rationale. Summaries lose the details that matter.
- No version control — Local files are in a git-ignored directory, but they could be version-controlled. Built-in memory can't be diffed, branched, or rolled back.
RAG Over Everything
Full RAG (vector database, chunking pipeline, retrieval-augmented generation) is overkill for this use case. The semantic search layer here is 196 lines of Python with a local embedding model. It indexes in 12 seconds and searches in 3. No database server, no embedding API costs, no infrastructure.
The three-file system handles 95% of cases. Semantic search handles the remaining 5% (cross-project recall). A full RAG stack would add complexity without proportional benefit.
The Compression Effect
The most interesting outcome isn't cost savings — it's knowledge compression.
Session 9 consumed 2,221 lines of pre-existing context (across 5 files) and produced 1,077 lines of structured output. That's a 2.2:1 compression ratio. But the real compression happened across all 9 sessions:
| Input | Lines |
|---|---|
| 8 sessions of iterative design | ~4,000 (including dead ends) |
| API documentation and POC logs | ~1,500 |
| Stakeholder feedback (20+ points) | ~800 |
| Total raw input | ~6,300 |
| Output | Lines |
|---|---|
| Spec (25 sections) | 792 |
| Execution plan (9 tasks) | 285 |
| Total structured output | 1,077 |
5.8:1 compression from scattered session notes to structured specification. The local storage system made this possible because:
- Nothing was lost between sessions (no re-discovery)
- Dead ends were recorded once and never repeated (session 5 documented 10 failed auth approaches — session 9 didn't retry any of them)
- Decisions were recorded with rationale (session 9 didn't re-debate closed questions)
Why This Matters for the Industry
The current AI coding assistant landscape is focused on:
- Model intelligence — bigger models, better reasoning
- Tool use — code execution, file editing, web search
- Context windows — 128K, 200K, 1M tokens
Nobody is seriously working on session-to-session knowledge persistence as a first-class feature. The assumption seems to be that bigger context windows solve the problem. They don't.
A 1M token context window means you can load 11,000 lines of project knowledge. It doesn't mean you should. Attention mechanisms degrade with context length. Cost scales linearly. And the fundamental problem remains: who decides which knowledge to load, when?
The three-file system described here is a manual solution to what should be an automated one. The rules discovered empirically — separate state from history from facts, load on-demand, update before session end, archive when large — should be built into every AI coding tool.
What a Real Solution Looks Like
- Automatic session persistence — Every session's decisions, discoveries, and dead ends are captured without manual effort.
- Typed knowledge stores — Separate "what's true" (facts) from "what happened" (history) from "what's next" (state). Different retrieval strategies for each.
- On-demand retrieval — Load context based on the current task, not the current project. If I'm writing an API integration, load verified API endpoints. If I'm writing a spec, load design decisions.
- Cross-session deduplication — If a question was asked and answered in session 3, don't let session 7 ask it again.
- Provider-agnostic storage — The knowledge belongs to the user, not the AI provider. Portable, version-controllable, inspectable.
The team that builds this well — not as a feature bolted onto a chat interface, but as a core architectural primitive — wins the AI coding assistant market. Because the cost of intelligence is dropping (model prices halve every 6 months). The cost of context is the durable competitive advantage.
Try It Yourself
You don't need any special tooling. Create three files:
# In your project root (gitignored)
.local/
├── current.md # "Read this at session start"
├── log.md # Append after every session
└── reference.md # Verified facts only
Start every AI session with: "Read .local/current.md first."
End every session with: "Update .local/current.md with resume instructions."
After 5 sessions, measure how often you're re-explaining context. After 10 sessions, calculate the cost of sessions with vs without the files.
The ROI will speak for itself.
Data from 53 sessions across 6 weeks on a platform engineering project. Total local storage: 11,812 lines across 36 files. Measured cost savings: 86–89% per context-heavy session. The entire memory system is 196 lines of Python and three markdown files.
Top comments (0)