DEV Community

Cover image for The $4.87 Spec: How Local Session Storage Cuts AI Costs by 89%
Praveen KG
Praveen KG

Posted on

The $4.87 Spec: How Local Session Storage Cuts AI Costs by 89%

A simple file-based memory system for AI coding sessions turned a $45 multi-session rebuild into a single $4.87 conversation. Here's the architecture, the data, and why context management is the most undervalued problem in AI-assisted development.


The Problem Nobody Talks About

Every AI coding assistant has the same dirty secret: context evaporates.

You spend 3 hours in a session with your AI pair programmer. You explore APIs, validate assumptions, make design decisions, discover edge cases. Then the session ends. Tomorrow, you start from zero.

The next session costs just as much — not because the work is hard, but because the AI has to rediscover everything it already knew.

Tracked across 53 sessions over 6 weeks on a platform engineering project. The waste was staggering.


What Was Measured

The setup: an AI coding assistant used for infrastructure automation — managing on-call schedules, incident response tooling, and platform operations across multiple cloud environments. The work is context-heavy: API integrations, team structures, design decisions, and operational processes that span weeks of iterative design.

One project required 9 sessions over 2 weeks to produce a specification document. Here's what actually happened:

Session Focus Key Outputs
1–2 API discovery, shift structure mapping 10 engineer IDs resolved, 9 shift UUIDs mapped
3 Architecture pattern discovery Fundamental design change (member-swap → scheduled absence)
4 22 use cases gathered from stakeholder Design decisions D-1 through D-9
5 5 integration POCs executed 3 passed, 2 blocked (enterprise auth)
6 Auth blocker solved Novel zero-auth approach discovered
7 20-point design review with stakeholder Corrections on every major section
8 6th POC + gap analysis 17 missing items identified, spec outline approved
9 Spec writing 792-line spec + 285-line execution plan

Session 9 — the one that produced the actual deliverable — consumed all the knowledge from sessions 1–8. Without persistent storage, session 9 would need to:

  1. Re-discover API endpoints, UUIDs, and shift structures (sessions 1–2)
  2. Re-learn the scheduled absence pattern (session 3)
  3. Re-gather 22 use cases (session 4)
  4. Re-run or re-verify 6 POCs (sessions 5–6)
  5. Re-apply 20 points of stakeholder feedback (session 7)
  6. Re-do the gap analysis (session 8)

Conservative estimate: 6–8 sessions at $5–6 each = $35–45 just to rebuild context before writing a single line.

Actual cost of session 9 with local storage: $4.87.


The Architecture: Three Files

The system is embarrassingly simple. Three markdown files per project, stored locally (never in the AI provider's cloud, never in the remote repository):

.local/agent/
├── current.md                    # Session state (what's active, what's next)
├── praveen-style.md              # Operating manual (style, decisions, anti-patterns)
└── projects/
    └── <project-name>/
        ├── log.md                # Chronological session history
        ├── reference.md          # Verified facts, API endpoints, IDs
        └── open-questions.md     # Decision tracker
Enter fullscreen mode Exit fullscreen mode

current.md — The Recovery Point

This is the only file the AI reads at session start. It contains:

  • Active projects with one-line status and file pointers
  • TODO list with priorities and owners
  • Resume instructions — what was done last session, what's next
  • File index — when to load each file (on-demand, not preloaded)

562 lines. Updated at the end of every session. If a session crashes, this file is the recovery point.

log.md — The Session History

Chronological log of every session: what was done, what was decided, what was discovered. Each entry has:

  • Context (why this session happened)
  • Decisions made (with rationale)
  • Key findings (especially surprises)
  • Open items carried forward

For the project that produced the spec, this file grew to 1,040 lines across 9 sessions. It's the primary source material — the AI reads it when it needs to understand why a decision was made, not just what was decided.

reference.md — The Verified Facts

API endpoints, authentication patterns, ID mappings, integration test results — anything that was verified against a live system. This file exists because LLMs hallucinate, and the most dangerous hallucinations are the ones that look like API documentation.

Every entry in this file was confirmed by an actual API call or system query. When the AI reads this file, it's reading facts, not assumptions.

396 lines for the project in question. Includes: 10 verified API endpoints, 10 engineer identity mappings, 4 placeholder account UUIDs, 9 shift structures, 6 POC results with evidence, and a complete decision register.


The Rules That Make It Work

The files alone aren't enough. Five rules — discovered through painful trial and error — prevent the system from degrading:

Rule 1: Read current.md First, Everything Else On-Demand

The AI reads current.md at session start. That's it. Every other file is loaded only when the current task requires it. This prevents context window pollution — loading 11,000+ lines of project knowledge into a conversation about a single API endpoint.

Rule 2: Separate State from History from Facts

  • current.md = what's happening now (mutable, updated every session)
  • log.md = what happened (append-only, never edited retroactively)
  • reference.md = what's true (verified facts, updated only when facts change)

This separation means the AI loads only the type of knowledge it needs. Writing a spec? Load log.md for design history. Making an API call? Load reference.md for endpoints. Starting a new session? Just current.md.

Rule 3: Update Before Session End

The AI updates current.md resume instructions before every session close. This is non-negotiable. If the session crashes after the update, the next session can recover. If it crashes before, one session of context is lost — not everything.

Rule 4: Archive When Files Get Large

When log.md exceeds ~500 lines, older sessions are archived to archive/. The active file stays manageable. The archive is there if deep historical context is needed (rare — maybe 5% of sessions).

Rule 5: Never Store in Provider Cloud

All files live in .local/ (gitignored). They never go to the AI provider's servers, never go to the remote repository. This is about control: the user owns the context, decides what persists, and can move between AI providers without losing institutional knowledge.


The Numbers

Single Session ROI

Metric Without Storage With Storage
Context rebuild 6–8 sessions ($35–45) 0 sessions ($0)
Spec writing 1 session ($5–6) 1 session ($4.87)
Total $40–51 $4.87
Saving 86–89%

Cumulative Impact (53 Sessions, 6 Weeks)

Metric Value
Active projects 10
Total project knowledge 11,812 lines across 36 files
Semantic memory chunks 961 (indexed for search)
Files indexed 51
Estimated sessions saved 40–60 (context rebuilds avoided)

What the Storage Contains

Category Lines Examples
Session histories ~6,200 Design decisions, POC results, stakeholder feedback
Reference data ~2,100 API endpoints, verified IDs, integration patterns
Operating manuals ~760 Style guides, decision-making patterns, anti-patterns
Session state ~560 Active projects, TODOs, resume instructions
Decision trackers ~190 Open questions with status and resolution
Total ~11,812

The Semantic Memory Layer

On top of the three-file system, a lightweight semantic search layer handles cross-project recall:

# 196 lines of Python
# sentence-transformers (all-MiniLM-L6-v2, 384-dim embeddings)
# numpy .npz + chunks.json storage
# Index time: ~12s for 961 chunks
# Search time: <3s
# Cost: $0 (local model, no API calls)
Enter fullscreen mode Exit fullscreen mode

This handles the "I know this was solved in a different project" problem. The AI searches across all project files semantically before starting work that might duplicate past effort.

961 chunks indexed from 51 files. The index is 2.5MB total. Re-indexed after every session save.


What Doesn't Work

Stuffing Everything Into the System Prompt

The obvious first attempt: load all project files at session start. The problems:

  1. Context window waste — 11,000 lines is ~40,000 tokens. That's a significant chunk of the context window consumed before the conversation even starts.
  2. Attention dilution — LLMs pay less attention to content in the middle of long contexts. Critical facts buried in page 15 of 20 get missed.
  3. Cost — Every message in the conversation includes the full system prompt. Token costs scale linearly.

On-demand loading (Rule 1) solved all three.

Relying on the AI's "Memory" Features

Some AI providers offer built-in memory or "project knowledge" features. Tried these too. The problems:

  1. Opacity — You can't see exactly what was stored or how it's retrieved.
  2. Vendor lock-in — Switch providers, lose everything.
  3. Granularity — Built-in memory stores summaries. Real work needs verbatim API endpoints, exact UUIDs, precise decision rationale. Summaries lose the details that matter.
  4. No version control — Local files are in a git-ignored directory, but they could be version-controlled. Built-in memory can't be diffed, branched, or rolled back.

RAG Over Everything

Full RAG (vector database, chunking pipeline, retrieval-augmented generation) is overkill for this use case. The semantic search layer here is 196 lines of Python with a local embedding model. It indexes in 12 seconds and searches in 3. No database server, no embedding API costs, no infrastructure.

The three-file system handles 95% of cases. Semantic search handles the remaining 5% (cross-project recall). A full RAG stack would add complexity without proportional benefit.


The Compression Effect

The most interesting outcome isn't cost savings — it's knowledge compression.

Session 9 consumed 2,221 lines of pre-existing context (across 5 files) and produced 1,077 lines of structured output. That's a 2.2:1 compression ratio. But the real compression happened across all 9 sessions:

Input Lines
8 sessions of iterative design ~4,000 (including dead ends)
API documentation and POC logs ~1,500
Stakeholder feedback (20+ points) ~800
Total raw input ~6,300
Output Lines
Spec (25 sections) 792
Execution plan (9 tasks) 285
Total structured output 1,077

5.8:1 compression from scattered session notes to structured specification. The local storage system made this possible because:

  1. Nothing was lost between sessions (no re-discovery)
  2. Dead ends were recorded once and never repeated (session 5 documented 10 failed auth approaches — session 9 didn't retry any of them)
  3. Decisions were recorded with rationale (session 9 didn't re-debate closed questions)

Why This Matters for the Industry

The current AI coding assistant landscape is focused on:

  • Model intelligence — bigger models, better reasoning
  • Tool use — code execution, file editing, web search
  • Context windows — 128K, 200K, 1M tokens

Nobody is seriously working on session-to-session knowledge persistence as a first-class feature. The assumption seems to be that bigger context windows solve the problem. They don't.

A 1M token context window means you can load 11,000 lines of project knowledge. It doesn't mean you should. Attention mechanisms degrade with context length. Cost scales linearly. And the fundamental problem remains: who decides which knowledge to load, when?

The three-file system described here is a manual solution to what should be an automated one. The rules discovered empirically — separate state from history from facts, load on-demand, update before session end, archive when large — should be built into every AI coding tool.

What a Real Solution Looks Like

  1. Automatic session persistence — Every session's decisions, discoveries, and dead ends are captured without manual effort.
  2. Typed knowledge stores — Separate "what's true" (facts) from "what happened" (history) from "what's next" (state). Different retrieval strategies for each.
  3. On-demand retrieval — Load context based on the current task, not the current project. If I'm writing an API integration, load verified API endpoints. If I'm writing a spec, load design decisions.
  4. Cross-session deduplication — If a question was asked and answered in session 3, don't let session 7 ask it again.
  5. Provider-agnostic storage — The knowledge belongs to the user, not the AI provider. Portable, version-controllable, inspectable.

The team that builds this well — not as a feature bolted onto a chat interface, but as a core architectural primitive — wins the AI coding assistant market. Because the cost of intelligence is dropping (model prices halve every 6 months). The cost of context is the durable competitive advantage.


Try It Yourself

You don't need any special tooling. Create three files:

# In your project root (gitignored)
.local/
├── current.md      # "Read this at session start"
├── log.md          # Append after every session
└── reference.md    # Verified facts only
Enter fullscreen mode Exit fullscreen mode

Start every AI session with: "Read .local/current.md first."

End every session with: "Update .local/current.md with resume instructions."

After 5 sessions, measure how often you're re-explaining context. After 10 sessions, calculate the cost of sessions with vs without the files.

The ROI will speak for itself.


Data from 53 sessions across 6 weeks on a platform engineering project. Total local storage: 11,812 lines across 36 files. Measured cost savings: 86–89% per context-heavy session. The entire memory system is 196 lines of Python and three markdown files.

Top comments (0)