Why Default Memory Fails at Scale
OpenClaw's built-in memory is simple: append to MEMORY.md, inject the whole file into every prompt. Works fine at 500 tokens. Falls apart at 5,000.
The problems compound:
Token explosion: Every message pays the full context tax. A 10-token query drags 4,000 tokens of memory. Your $0.01 API call becomes $0.15.
Relevance collapse: The model sees everything, prioritizes nothing. Ask about "database connection pooling" and it weighs your lunch preferences equally.
No semantic understanding: Keyword matching alone misses synonyms. "DB connection" won't find notes about "PostgreSQL pooling" unless you used those exact words.
Cloud dependency: Vector search usually means Pinecone, Weaviate, or some hosted service. Your private notes now live on someone else's servers.
QMD solves all four. It indexes your markdown files locally, runs hybrid retrieval combining three search strategies, and returns only the relevant snippets. 700 characters max per result, 6 results default. Your 10,000-token memory footprint becomes 200 tokens of gold.
What interviewers are actually testing: Can you explain the token economics of context injection? The insight: context length is O(n) cost, but relevance is what matters. Retrieval-augmented generation (RAG) exists because "just include everything" doesn't scale.
The Hybrid Search Pipeline
QMD doesn't pick one search strategy. It runs three and combines the results.
Stage 1: BM25 (Keyword Matching)
Classic information retrieval. Term frequency, inverse document frequency, document length normalization. Fast, deterministic, great for exact matches. When you search "SwiftUI navigation," BM25 finds documents containing those exact terms.
Score = Σ IDF(term) × TF(term, doc) × (k₁ + 1) / (TF + k₁ × (1 - b + b × |doc|/avgdl))
Limitation: misses semantic relationships. "iOS routing" won't match "SwiftUI navigation" even though they're related.
Stage 2: Vector Search (Semantic Matching)
QMD uses Jina v3 embeddings, running locally via a ~1GB GGUF model. Your text becomes a 1024-dimensional vector. Similar meanings cluster together in vector space, so "iOS routing" lands near "SwiftUI navigation."
The embedding model downloads automatically on first run. No API keys. No cloud calls. Your notes never leave your machine.
Stage 3: LLM Reranking (Precision Boost)
Here's where it gets interesting. After BM25 and vector search return candidates, a local LLM reranks them by actual relevance to your query. This catches cases where keyword and semantic matches both miss the point.
The reranker asks: "Given the query 'Ray's SwiftUI style,' which of these snippets actually answers it?" A snippet about Ray's code review preferences beats a snippet mentioning SwiftUI in passing.
Query: "Ray's SwiftUI style"
├── BM25 candidates (10)
├── Vector candidates (10)
└── LLM reranker → Top 6 results (700 chars each)
What interviewers are actually testing: Hybrid search is the 2026 standard for production RAG. Pure vector search has recall problems (misses keyword matches). Pure BM25 has semantic problems. The combination, plus reranking, is how you build retrieval that actually works.
Local-First Architecture
QMD runs entirely on your machine. No cloud. No API costs. No privacy leakage.
The stack:
- Rust CLI: Fast, single binary, cross-platform
- GGUF models: Quantized for local inference (~1GB total)
- SQLite indexes: BM25 and metadata stored locally
- Jina v3 embeddings: 1024-dim vectors, multilingual
On a Mac Mini M2, embedding 1,000 markdown files takes about 30 seconds. Queries return in under 100ms. The models auto-download on first use, no manual setup required.
Why does this matter? Three reasons:
Cost: Vector search APIs charge per query. At scale, that's real money. QMD is free after the initial model download.
Privacy: Your agent memory contains sensitive context. Project names, credentials patterns, personal preferences. Keeping it local means keeping it private.
Latency: Network round-trips add 50-200ms per query. Local inference is faster, especially when you're running multiple retrievals per agent turn.
The trade-off is compute. You need a machine with enough RAM to load the models (~4GB recommended). Cloud instances work, but you're paying for compute instead of API calls.
What interviewers are actually testing: The build-vs-buy decision for ML infrastructure. Local models trade API costs for compute costs. The break-even depends on query volume, latency requirements, and privacy constraints. Know your numbers.
Integration with OpenClaw
QMD plugs into OpenClaw as a memory backend. Three commands to set it up:
# Install QMD globally
bun install -g https://github.com/tobi/qmd
# Add memory collection
qmd collection add ~/.openclaw/agents/main/memory --name agent-logs
# Build initial embeddings
qmd embed
Then update your OpenClaw config:
memory:
backend: "qmd"
qmd:
update:
interval: "5m" # Re-index every 5 minutes
limits:
maxResults: 6 # Return top 6 snippets
maxChars: 700 # 700 chars per snippet
On agent boot, QMD:
- Syncs indexes (15-second debounce to avoid thrashing)
- Pre-warms embeddings for frequently accessed files
- Registers as the memory provider for all retrieval calls
When the agent needs context, it queries QMD instead of injecting the full MEMORY.md. The Lane Queue serializes these queries to avoid OOM from concurrent embedding operations.
You can also add custom paths beyond the default memory directory:
qmd collection add ~/projects/notes --name project-context
qmd collection add ~/.config/snippets --name code-patterns
All collections merge into a single search index. Query once, search everything.
What interviewers are actually testing: System integration patterns. How do you replace a component (memory backend) without breaking the rest of the system? The answer involves clean interfaces, configuration-driven switching, and graceful degradation if the new backend fails.
MCP Mode for Advanced Workflows
QMD exposes an MCP (Model Context Protocol) server, letting agents query memory programmatically. This enables self-healing memory workflows.
Example: a compaction skill that prunes outdated entries:
// Memory compaction skill
const staleEntries = await qmd.query({
collection: "agent-logs",
filter: { olderThan: "30d", accessCount: 0 }
});
for (const entry of staleEntries) {
if (await confirmDeletion(entry)) {
await qmd.delete(entry.id);
}
}
await qmd.reindex();
The MCP interface supports:
- query: Hybrid search with filters
- add: Insert new memory entries
- update: Modify existing entries
- delete: Remove stale content
- reindex: Rebuild embeddings after bulk changes
This turns memory from a passive store into an active system. Agents can curate their own context, pruning irrelevant entries and promoting useful ones.
One pattern I've seen work well: a nightly job that analyzes query patterns, identifies entries that never get retrieved, and archives them. Memory stays lean without manual curation.
What interviewers are actually testing: Can you design systems that maintain themselves? Self-healing infrastructure is a senior engineer concern. The specific technique (memory compaction) matters less than the pattern: observe, analyze, act, verify.
Try It Yourself
Want to benchmark QMD against default memory? Here's a comparison test.
Prerequisites
- OpenClaw v2026.2.0+
- Bun or Node 22+
- 4GB available RAM
- ~2GB disk space for models
Step 1: Install QMD
bun install -g https://github.com/tobi/qmd
# Verify installation
qmd --version
# Expected: qmd 0.4.2 or higher
Step 2: Create Test Collection
# Index your existing memory
qmd collection add ~/.openclaw/agents/main/memory --name test-memory
# Build embeddings (takes 30-60s first time)
qmd embed --collection test-memory
Step 3: Run Comparison Queries
# QMD hybrid search
time qmd query "database connection pooling" --collection test-memory
# Compare token counts
echo "QMD returns ~700 chars × 6 results = 4,200 chars max"
echo "Full MEMORY.md injection = $(wc -c < ~/.openclaw/agents/main/memory/MEMORY.md) chars"
Expected Output
Query: "database connection pooling"
Results: 6 snippets (4,102 chars total)
Latency: 47ms
Top result (relevance: 0.94):
"PostgreSQL connection pooling config: pool_size=20,
max_overflow=10. Set in database.yml. Learned 2026-01-15
after production OOM incident..."
Step 4: Enable in OpenClaw
# Add to config
openclaw config set memory.backend qmd
openclaw config set memory.qmd.update.interval 5m
# Restart to apply
openclaw restart
Troubleshooting
- "Model download failed": Check disk space. Models need ~1.5GB.
-
"Collection not found": Run
qmd collection listto verify paths. - Slow first query: Normal. Embeddings cache after first run.
-
OOM errors: Reduce
maxResultsor increase system RAM.
Key Takeaways
QMD transforms OpenClaw memory from a liability into an asset. Instead of injecting thousands of irrelevant tokens, you get surgical retrieval: BM25 for exact matches, vector search for semantic similarity, LLM reranking for precision. All running locally with zero cloud costs and zero data leakage.
The hybrid search pipeline is the key insight. Neither keyword nor semantic search alone is sufficient. Production RAG systems combine both, then rerank for the final precision boost. QMD packages this pattern into a single tool that integrates cleanly with OpenClaw's memory system.
If your MEMORY.md is past 2,000 tokens and you're paying for every context injection, QMD pays for itself in a week.
👉 Want more AI engineering deep dives? Follow the full OpenClaw Deep Dive series on Upskill.
🚀 Preparing for FAANG interviews? Upskill AI helps IC4-IC6 engineers ace system design and ML interviews.
Sources:
Top comments (0)