Joel Alan

Posted on Apr 23

Context Compression and Persistent Memory Design for Terminal AI Assistants

#ai #cli #agents #llm

Drawing from practical experience with memo-agent (simplified Hermes version), exploring how to give terminal AI assistants "long-term memory" and "extended conversation" capabilities.

I. The Starting Point

Imagine this scenario: You're pair-programming with AI through a CLI tool, discussing project architecture, database design, and API specifications for 2 hours straight. The AI performs well, remembering every constraint you mentioned.

The next day, you open the terminal and type "continue yesterday's design." The AI replies: "Sure, what did we discuss yesterday?"

The conversation context has been reset.

This isn't science fiction—it's the reality of most terminal AI tools today. Other common issues include:

After 30 rounds of code review, the AI "forgets" the architectural constraints you mentioned at the beginning
Context window warnings force you to /clear and start over
No cross-session context accumulation means re-explaining background every time

The root causes are twofold:

Brutal context truncation — During extended conversations, the system often discards the earliest messages, causing critical information loss
No memory across sessions — Each session starts from zero, unable to accumulate project context

The design goal of memo-agent is to solve these two problems.

II. Persistent Memory: Letting AI Remember "Who You Are"

2.1 Local File Memory

The simplest persistence solution is often the most reliable. memo-agent uses local Markdown files as memory carriers:

~/.memo-agent/memory/
  NOTES.md      # Work notes (agent can read/write)
  PROFILE.md    # User preferences (read-only, maintained by user)

NOTES.md is automatically updated by the agent after each conversation round when deemed necessary. For example:

You mentioned "this project uses functional style, avoid classes"
You specified "API responses should always use {code, data, message} format"
You indicated "use SQLite WAL mode for database"

If the agent considers these valuable, it appends them to NOTES.md. These notes are automatically injected into the system prompt at the start of the next session, becoming the agent's "common knowledge."

PROFILE.md is manually maintained by the user, suitable for long-term stable preferences:

I'm a backend engineer, primarily using Go and TypeScript.
Code style: functional first, avoid over-abstraction.
Please respond in Chinese, with English code comments.

2.2 Safe Injection Mechanism

Injecting local file content into the system prompt carries security risks — if files are maliciously tampered with, they may contain prompt injection attacks. memo-agent scans content before injection, detecting the following patterns:

"Ignore previous instructions"
"You are now a... role"
"Send the following information to..."

If detected, injection is skipped and an alert is shown in the UI.

2.3 Session Chain: History Never Lost

NOTES.md alone isn't enough. When conversations grow long and need compression, memo-agent doesn't truncate brutally. Instead:

Uses an auxiliary model to generate a summary of intermediate history
Creates a new session with the summary as the starting context
Old sessions are linked through parent_session_id

This forms the following structure:

┌─────────────────────────────────────────────────────────────┐
│  Session Tree                                               │
│                                                             │
│  Session A (2024-01-15)                                     │
│  ├── Original conversation: 50 rounds                       │
│  ├── Input tokens: 45,000                                   │
│  └── Child session: B                                       │
│                                                             │
│      Session B (2024-01-16)                                 │
│      ├── Compressed summary: "Decided to use SQLite WAL..." │
│      ├── New conversation: 30 rounds                        │
│      ├── Input tokens: 28,000                               │
│      └── Child session: C                                   │
│                                                             │
│          Session C (2024-01-17)                             │
│          ├── Secondary compression summary                  │
│          └── New conversation: 20 rounds                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Session Chain Database Design:

CREATE TABLE sessions (
  id TEXT PRIMARY KEY,
  title TEXT,
  model TEXT,
  parent_session_id TEXT REFERENCES sessions(id),  -- Chain linkage
  compressed_summary TEXT,  -- Summary inherited from parent session
  input_tokens INTEGER DEFAULT 0,
  output_tokens INTEGER DEFAULT 0,
  estimated_cost_usd REAL DEFAULT 0,
  created_at TEXT DEFAULT (datetime('now')),
  updated_at TEXT DEFAULT (datetime('now'))
);

In theory, this chain can extend infinitely—history is never lost. Users can view the session chain via /history and use --resume <session-id> to return to any node.

Cost Comparison:

Solution	Cost per round after 100 rounds	Pros	Cons
No compression	~$0.015/round	Complete history	Linear cost growth
Three-zone compression	~$0.006/round	Balanced	Summaries may lose details
Direct truncation	~$0.005/round	Low cost	Loses early context

III. Context Compression: Three-Zone Model Engineering Practice

3.1 Why Compression is Needed

Large models have limited context windows (e.g., GPT-4o's 128k tokens). Even with sufficient window size, excessively long contexts cause two problems:

Linear cost growth — Every round sends the entire history, token consumption keeps increasing
Attention dilution — The model may "overlook" key information buried in the middle of long contexts

Common truncation strategies (directly discarding earliest messages) break conversation coherence. memo-agent adopts a more elegant three-zone model.

3.2 Three-Zone Model

Divides conversation context into three zones:

┌─────────────────────────────────────────────────────────┐
│  HEAD (Anchor) ~ 4k tokens                              │
│  ├── system prompt (NOTES.md + PROFILE.md)              │
│  ├── First user input (project background, constraints) │
│  └── First AI response (core decisions)                 │
│  Never compressed, retains full semantics               │
├─────────────────────────────────────────────────────────┤
│  MIDDLE (Archive) Dynamic adjustment                    │
│  Before compression: Complete conversation history        │
│                      (Round 2 to N-20)                  │
│  After compression: LLM-generated structured summary      │
│  Example: "Decided to use SQLite WAL mode, pending:      │
│            index fields"                                │
├─────────────────────────────────────────────────────────┤
│  TAIL (Active) ~20k tokens                              │
│  Last 20 rounds of conversation, fully preserved        │
│  Ensures complete context for current topic             │
└─────────────────────────────────────────────────────────┘

Compression Trigger Strategy:

Threshold	Behavior	Status Bar Display
70%	Yellow warning	`tokens: 89k/128k (70%)`
85%	Auto-trigger archive	Shows "compressing..."
Manual	`/compact [focus description]`	Execute immediately

Summary Generation Prompt Template:

const COMPRESS_PROMPT = `Compress the following conversation history into a structured summary.
Retain: Key decisions, pending items, technical constraints
Discard: Specific code implementations, debugging process, repeated discussions

Format:
- Decision: [matter]
- Constraint: [condition]
- Pending: [to be confirmed]

Conversation history:
{{history}}`;

3.3 Summary Generation Strategy

Archive compression isn't simple text truncation—it's about having an auxiliary model (recommend low-cost models like gpt-4o-mini) generate structured summaries. For example:

Original Conversation (10 rounds):

User: I want to write a SQLite storage layer
AI: Sure, I recommend using better-sqlite3...
User: Need to support WAL mode
AI: WAL mode configuration is as follows...
User: Also add FTS5 full-text search
AI: You can create a virtual table like this...

Compressed Summary:

- Decision: Use better-sqlite3 as the database driver
- Configuration: Enable WAL mode (concurrent read/write, better performance)
- Feature: Add FTS5 virtual table for full-text search support
- Pending: Index fields to be confirmed

The summary retains decision points and pending items, discarding implementation details. If these details are needed later, they can be retrieved through /search in the history.

3.4 Auxiliary Model Cost Reduction

Archive compression can be configured with an independent auxiliary model:

model:
  name: gpt-4o              # Main model, responsible for high-quality conversations

auxiliary:
  name: gpt-4o-mini         # Auxiliary model, responsible for archive summaries

Typical scenario

100 rounds of conversation, cumulative 120k tokens consumed
When triggering archive, use gpt-4o-mini to process 80k tokens of intermediate history
Generate 2k tokens of summary
Save approximately 60% of token consumption for subsequent rounds

IV. Full-Text Search: Adding a Search Engine to History

Summaries alone aren't enough—users often ask "what did we discuss before," requiring precise recall. memo-agent implements full-text search using FTS5 virtual tables on SQLite.

4.1 Table Structure Design

-- Messages table
CREATE TABLE messages (
  id TEXT PRIMARY KEY,
  session_id TEXT NOT NULL,
  role TEXT NOT NULL,  -- 'user' | 'assistant' | 'tool' | 'system'
  content TEXT,
  tool_calls JSON,     -- Tool call records
  token_count INTEGER,
  created_at TEXT DEFAULT (datetime('now'))
);

-- FTS5 full-text index virtual table (automatic tokenization, inverted index)
CREATE VIRTUAL TABLE messages_fts USING fts5(
  content,                    -- Index field
  content='messages',         -- Associated source table
  content_rowid='id'          -- Link through rowid
);

4.2 Automatic Synchronization Mechanism

Uses triggers to keep FTS index synchronized with source table:

-- Automatically sync to FTS on insert
CREATE TRIGGER messages_fts_insert AFTER INSERT ON messages BEGIN
  INSERT INTO messages_fts(rowid, content)
  VALUES (new.rowid, new.content);
END;

-- Automatically clean up FTS on delete
CREATE TRIGGER messages_fts_delete AFTER DELETE ON messages BEGIN
  DELETE FROM messages_fts WHERE rowid = old.rowid;
END;

-- Sync changes on update
CREATE TRIGGER messages_fts_update AFTER UPDATE ON messages BEGIN
  UPDATE messages_fts SET content = new.content WHERE rowid = old.rowid;
END;

4.3 Query and Security Protection

When user inputs /search sqlite WAL mode, the underlying execution is:

function searchMessages(db: Database, query: string, limit = 20) {
  // Escape FTS5 special characters to prevent syntax injection
  const safeQuery = query
    .replace(/"/g, '""')      // Escape double quotes
    .replace(/\\/g, '\\\\')   // Escape backslashes
    .replace(/\*/g, '');        // Remove wildcards (or keep based on requirements)

  const sql = `
    SELECT m.*, s.title, rank
    FROM messages_fts f
    JOIN messages m ON f.rowid = m.id
    JOIN sessions s ON m.session_id = s.id
    WHERE messages_fts MATCH ?
    ORDER BY rank
    LIMIT ?
  `;

  return db.prepare(sql).all(\`"${safeQuery}"\`, limit);
}

Query Result Example:

> /search sqlite WAL mode

[Session: Database Design Discussion]
User: I want to write a SQLite storage layer
Assistant: Sure, I recommend using better-sqlite3 and enabling WAL mode...

[Session: Performance Optimization]
User: How to handle read-write conflicts in WAL mode?
Assistant: WAL mode supports concurrent read/write, but you need to configure busy_timeout...

4.4 Performance Data

10,000 messages: Index build time ~200ms, query time < 10ms
100,000 messages: Index size approximately 30% of original data, query time < 50ms

V. Summary

Giving terminal AI assistants "long-term memory" and "extended conversation" capabilities centers on three designs:

Local file memory — Simple and reliable, automatic injection, with security scanning
Three-zone compression model — HEAD anchor + MIDDLE summary + TAIL active, balancing completeness and cost
Session chain + full-text search — History never lost, key information retrievable

These designs aren't silver bullets. Summaries lose details, automatic memory may introduce noise, and token counting has errors. But in engineering practice, they strike a good balance between usability and cost.

5.1 Quick Start

If you want to experience these features:

# Install
npm install -g memo-agent

# Initialize configuration
memo init

# Start conversation (automatically loads memory)
memo

# View conversation history
memo --history

# Return to specific session
memo --resume <session-id>

5.2 Future Roadmap

MCP Integration: Connect external data sources (Notion, GitHub Issues, etc.) through Model Context Protocol
Multimodal Memory: Support OCR indexing and retrieval of images, code screenshots
Smart Archive Strategy: Automatically determine compression granularity based on conversation importance, rather than simple token thresholds
Collaborative Memory: Team-shared NOTES.md for unified project standards

If you're building similar terminal AI tools, welcome to discuss and exchange ideas.

Reference Implementation

Project: github.com/lxfu1/memo-agent
Core Modules:
- src/context/compressor.ts — Three-zone compression implementation
- src/memory/notesManager.ts — Local file memory management
- src/session/db.ts — SQLite and session chain design
- src/model/streaming.ts — Streaming conversation processing

Further Reading

DEV Community