Drawing from practical experience with memo-agent (simplified Hermes version), exploring how to give terminal AI assistants "long-term memory" and "extended conversation" capabilities.
I. The Starting Point
Imagine this scenario: You're pair-programming with AI through a CLI tool, discussing project architecture, database design, and API specifications for 2 hours straight. The AI performs well, remembering every constraint you mentioned.
The next day, you open the terminal and type "continue yesterday's design." The AI replies: "Sure, what did we discuss yesterday?"
The conversation context has been reset.
This isn't science fiction—it's the reality of most terminal AI tools today. Other common issues include:
- After 30 rounds of code review, the AI "forgets" the architectural constraints you mentioned at the beginning
- Context window warnings force you to
/clearand start over - No cross-session context accumulation means re-explaining background every time
The root causes are twofold:
- Brutal context truncation — During extended conversations, the system often discards the earliest messages, causing critical information loss
- No memory across sessions — Each session starts from zero, unable to accumulate project context
The design goal of memo-agent is to solve these two problems.
II. Persistent Memory: Letting AI Remember "Who You Are"
2.1 Local File Memory
The simplest persistence solution is often the most reliable. memo-agent uses local Markdown files as memory carriers:
~/.memo-agent/memory/
NOTES.md # Work notes (agent can read/write)
PROFILE.md # User preferences (read-only, maintained by user)
NOTES.md is automatically updated by the agent after each conversation round when deemed necessary. For example:
- You mentioned "this project uses functional style, avoid classes"
- You specified "API responses should always use
{code, data, message}format" - You indicated "use SQLite WAL mode for database"
If the agent considers these valuable, it appends them to NOTES.md. These notes are automatically injected into the system prompt at the start of the next session, becoming the agent's "common knowledge."
PROFILE.md is manually maintained by the user, suitable for long-term stable preferences:
I'm a backend engineer, primarily using Go and TypeScript.
Code style: functional first, avoid over-abstraction.
Please respond in Chinese, with English code comments.
2.2 Safe Injection Mechanism
Injecting local file content into the system prompt carries security risks — if files are maliciously tampered with, they may contain prompt injection attacks. memo-agent scans content before injection, detecting the following patterns:
- "Ignore previous instructions"
- "You are now a... role"
- "Send the following information to..."
If detected, injection is skipped and an alert is shown in the UI.
2.3 Session Chain: History Never Lost
NOTES.md alone isn't enough. When conversations grow long and need compression, memo-agent doesn't truncate brutally. Instead:
- Uses an auxiliary model to generate a summary of intermediate history
- Creates a new session with the summary as the starting context
- Old sessions are linked through
parent_session_id
This forms the following structure:
┌─────────────────────────────────────────────────────────────┐
│ Session Tree │
│ │
│ Session A (2024-01-15) │
│ ├── Original conversation: 50 rounds │
│ ├── Input tokens: 45,000 │
│ └── Child session: B │
│ │
│ Session B (2024-01-16) │
│ ├── Compressed summary: "Decided to use SQLite WAL..." │
│ ├── New conversation: 30 rounds │
│ ├── Input tokens: 28,000 │
│ └── Child session: C │
│ │
│ Session C (2024-01-17) │
│ ├── Secondary compression summary │
│ └── New conversation: 20 rounds │
│ │
└─────────────────────────────────────────────────────────────┘
Session Chain Database Design:
CREATE TABLE sessions (
id TEXT PRIMARY KEY,
title TEXT,
model TEXT,
parent_session_id TEXT REFERENCES sessions(id), -- Chain linkage
compressed_summary TEXT, -- Summary inherited from parent session
input_tokens INTEGER DEFAULT 0,
output_tokens INTEGER DEFAULT 0,
estimated_cost_usd REAL DEFAULT 0,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
);
In theory, this chain can extend infinitely—history is never lost. Users can view the session chain via /history and use --resume <session-id> to return to any node.
Cost Comparison:
| Solution | Cost per round after 100 rounds | Pros | Cons |
|---|---|---|---|
| No compression | ~$0.015/round | Complete history | Linear cost growth |
| Three-zone compression | ~$0.006/round | Balanced | Summaries may lose details |
| Direct truncation | ~$0.005/round | Low cost | Loses early context |
III. Context Compression: Three-Zone Model Engineering Practice
3.1 Why Compression is Needed
Large models have limited context windows (e.g., GPT-4o's 128k tokens). Even with sufficient window size, excessively long contexts cause two problems:
- Linear cost growth — Every round sends the entire history, token consumption keeps increasing
- Attention dilution — The model may "overlook" key information buried in the middle of long contexts
Common truncation strategies (directly discarding earliest messages) break conversation coherence. memo-agent adopts a more elegant three-zone model.
3.2 Three-Zone Model
Divides conversation context into three zones:
┌─────────────────────────────────────────────────────────┐
│ HEAD (Anchor) ~ 4k tokens │
│ ├── system prompt (NOTES.md + PROFILE.md) │
│ ├── First user input (project background, constraints) │
│ └── First AI response (core decisions) │
│ Never compressed, retains full semantics │
├─────────────────────────────────────────────────────────┤
│ MIDDLE (Archive) Dynamic adjustment │
│ Before compression: Complete conversation history │
│ (Round 2 to N-20) │
│ After compression: LLM-generated structured summary │
│ Example: "Decided to use SQLite WAL mode, pending: │
│ index fields" │
├─────────────────────────────────────────────────────────┤
│ TAIL (Active) ~20k tokens │
│ Last 20 rounds of conversation, fully preserved │
│ Ensures complete context for current topic │
└─────────────────────────────────────────────────────────┘
Compression Trigger Strategy:
| Threshold | Behavior | Status Bar Display |
|---|---|---|
| 70% | Yellow warning | tokens: 89k/128k (70%) |
| 85% | Auto-trigger archive | Shows "compressing..." |
| Manual | /compact [focus description] |
Execute immediately |
Summary Generation Prompt Template:
const COMPRESS_PROMPT = `Compress the following conversation history into a structured summary.
Retain: Key decisions, pending items, technical constraints
Discard: Specific code implementations, debugging process, repeated discussions
Format:
- Decision: [matter]
- Constraint: [condition]
- Pending: [to be confirmed]
Conversation history:
{{history}}`;
3.3 Summary Generation Strategy
Archive compression isn't simple text truncation—it's about having an auxiliary model (recommend low-cost models like gpt-4o-mini) generate structured summaries. For example:
Original Conversation (10 rounds):
User: I want to write a SQLite storage layer
AI: Sure, I recommend using better-sqlite3...
User: Need to support WAL mode
AI: WAL mode configuration is as follows...
User: Also add FTS5 full-text search
AI: You can create a virtual table like this...
Compressed Summary:
- Decision: Use better-sqlite3 as the database driver
- Configuration: Enable WAL mode (concurrent read/write, better performance)
- Feature: Add FTS5 virtual table for full-text search support
- Pending: Index fields to be confirmed
The summary retains decision points and pending items, discarding implementation details. If these details are needed later, they can be retrieved through /search in the history.
3.4 Auxiliary Model Cost Reduction
Archive compression can be configured with an independent auxiliary model:
model:
name: gpt-4o # Main model, responsible for high-quality conversations
auxiliary:
name: gpt-4o-mini # Auxiliary model, responsible for archive summaries
Typical scenario
- 100 rounds of conversation, cumulative 120k tokens consumed
- When triggering archive, use gpt-4o-mini to process 80k tokens of intermediate history
- Generate 2k tokens of summary
- Save approximately 60% of token consumption for subsequent rounds
IV. Full-Text Search: Adding a Search Engine to History
Summaries alone aren't enough—users often ask "what did we discuss before," requiring precise recall. memo-agent implements full-text search using FTS5 virtual tables on SQLite.
4.1 Table Structure Design
-- Messages table
CREATE TABLE messages (
id TEXT PRIMARY KEY,
session_id TEXT NOT NULL,
role TEXT NOT NULL, -- 'user' | 'assistant' | 'tool' | 'system'
content TEXT,
tool_calls JSON, -- Tool call records
token_count INTEGER,
created_at TEXT DEFAULT (datetime('now'))
);
-- FTS5 full-text index virtual table (automatic tokenization, inverted index)
CREATE VIRTUAL TABLE messages_fts USING fts5(
content, -- Index field
content='messages', -- Associated source table
content_rowid='id' -- Link through rowid
);
4.2 Automatic Synchronization Mechanism
Uses triggers to keep FTS index synchronized with source table:
-- Automatically sync to FTS on insert
CREATE TRIGGER messages_fts_insert AFTER INSERT ON messages BEGIN
INSERT INTO messages_fts(rowid, content)
VALUES (new.rowid, new.content);
END;
-- Automatically clean up FTS on delete
CREATE TRIGGER messages_fts_delete AFTER DELETE ON messages BEGIN
DELETE FROM messages_fts WHERE rowid = old.rowid;
END;
-- Sync changes on update
CREATE TRIGGER messages_fts_update AFTER UPDATE ON messages BEGIN
UPDATE messages_fts SET content = new.content WHERE rowid = old.rowid;
END;
4.3 Query and Security Protection
When user inputs /search sqlite WAL mode, the underlying execution is:
function searchMessages(db: Database, query: string, limit = 20) {
// Escape FTS5 special characters to prevent syntax injection
const safeQuery = query
.replace(/"/g, '""') // Escape double quotes
.replace(/\\/g, '\\\\') // Escape backslashes
.replace(/\*/g, ''); // Remove wildcards (or keep based on requirements)
const sql = `
SELECT m.*, s.title, rank
FROM messages_fts f
JOIN messages m ON f.rowid = m.id
JOIN sessions s ON m.session_id = s.id
WHERE messages_fts MATCH ?
ORDER BY rank
LIMIT ?
`;
return db.prepare(sql).all(\`"${safeQuery}"\`, limit);
}
Query Result Example:
> /search sqlite WAL mode
[Session: Database Design Discussion]
User: I want to write a SQLite storage layer
Assistant: Sure, I recommend using better-sqlite3 and enabling WAL mode...
[Session: Performance Optimization]
User: How to handle read-write conflicts in WAL mode?
Assistant: WAL mode supports concurrent read/write, but you need to configure busy_timeout...
4.4 Performance Data
- 10,000 messages: Index build time ~200ms, query time < 10ms
- 100,000 messages: Index size approximately 30% of original data, query time < 50ms
V. Summary
Giving terminal AI assistants "long-term memory" and "extended conversation" capabilities centers on three designs:
- Local file memory — Simple and reliable, automatic injection, with security scanning
- Three-zone compression model — HEAD anchor + MIDDLE summary + TAIL active, balancing completeness and cost
- Session chain + full-text search — History never lost, key information retrievable
These designs aren't silver bullets. Summaries lose details, automatic memory may introduce noise, and token counting has errors. But in engineering practice, they strike a good balance between usability and cost.
5.1 Quick Start
If you want to experience these features:
# Install
npm install -g memo-agent
# Initialize configuration
memo init
# Start conversation (automatically loads memory)
memo
# View conversation history
memo --history
# Return to specific session
memo --resume <session-id>
5.2 Future Roadmap
- MCP Integration: Connect external data sources (Notion, GitHub Issues, etc.) through Model Context Protocol
- Multimodal Memory: Support OCR indexing and retrieval of images, code screenshots
- Smart Archive Strategy: Automatically determine compression granularity based on conversation importance, rather than simple token thresholds
- Collaborative Memory: Team-shared NOTES.md for unified project standards
If you're building similar terminal AI tools, welcome to discuss and exchange ideas.
Reference Implementation
- Project: github.com/lxfu1/memo-agent
- Core Modules:
-
src/context/compressor.ts— Three-zone compression implementation -
src/memory/notesManager.ts— Local file memory management -
src/session/db.ts— SQLite and session chain design -
src/model/streaming.ts— Streaming conversation processing
-
Further Reading
Top comments (0)