Chen-Hung Wu

Posted on Feb 22

OpenClaw QMD: Local Hybrid Search for 10x Smarter Memory

#ai #productivity #tutorial #machinelearning

Why Default Memory Fails at Scale

OpenClaw's built-in memory is simple: append to MEMORY.md, inject the whole file into every prompt. Works fine at 500 tokens. Falls apart at 5,000.

The problems compound:

Token explosion: Every message pays the full context tax. A 10-token query drags 4,000 tokens of memory. Your $0.01 API call becomes $0.15.

Relevance collapse: The model sees everything, prioritizes nothing. Ask about "database connection pooling" and it weighs your lunch preferences equally.

No semantic understanding: Keyword matching alone misses synonyms. "DB connection" won't find notes about "PostgreSQL pooling" unless you used those exact words.

Cloud dependency: Vector search usually means Pinecone, Weaviate, or some hosted service. Your private notes now live on someone else's servers.

QMD solves all four. It indexes your markdown files locally, runs hybrid retrieval combining three search strategies, and returns only the relevant snippets. 700 characters max per result, 6 results default. Your 10,000-token memory footprint becomes 200 tokens of gold.

What interviewers are actually testing: Can you explain the token economics of context injection? The insight: context length is O(n) cost, but relevance is what matters. Retrieval-augmented generation (RAG) exists because "just include everything" doesn't scale.

The Hybrid Search Pipeline

QMD doesn't pick one search strategy. It runs three and combines the results.

Stage 1: BM25 (Keyword Matching)

Classic information retrieval. Term frequency, inverse document frequency, document length normalization. Fast, deterministic, great for exact matches. When you search "SwiftUI navigation," BM25 finds documents containing those exact terms.

Score = Σ IDF(term) × TF(term, doc) × (k₁ + 1) / (TF + k₁ × (1 - b + b × |doc|/avgdl))

Limitation: misses semantic relationships. "iOS routing" won't match "SwiftUI navigation" even though they're related.

Stage 2: Vector Search (Semantic Matching)

QMD uses Jina v3 embeddings, running locally via a ~1GB GGUF model. Your text becomes a 1024-dimensional vector. Similar meanings cluster together in vector space, so "iOS routing" lands near "SwiftUI navigation."

The embedding model downloads automatically on first run. No API keys. No cloud calls. Your notes never leave your machine.

Stage 3: LLM Reranking (Precision Boost)

Here's where it gets interesting. After BM25 and vector search return candidates, a local LLM reranks them by actual relevance to your query. This catches cases where keyword and semantic matches both miss the point.

The reranker asks: "Given the query 'Ray's SwiftUI style,' which of these snippets actually answers it?" A snippet about Ray's code review preferences beats a snippet mentioning SwiftUI in passing.

Query: "Ray's SwiftUI style"
├── BM25 candidates (10)
├── Vector candidates (10)
└── LLM reranker → Top 6 results (700 chars each)

What interviewers are actually testing: Hybrid search is the 2026 standard for production RAG. Pure vector search has recall problems (misses keyword matches). Pure BM25 has semantic problems. The combination, plus reranking, is how you build retrieval that actually works.

Local-First Architecture

QMD runs entirely on your machine. No cloud. No API costs. No privacy leakage.

The stack:

Rust CLI: Fast, single binary, cross-platform
GGUF models: Quantized for local inference (~1GB total)
SQLite indexes: BM25 and metadata stored locally
Jina v3 embeddings: 1024-dim vectors, multilingual

On a Mac Mini M2, embedding 1,000 markdown files takes about 30 seconds. Queries return in under 100ms. The models auto-download on first use, no manual setup required.

Why does this matter? Three reasons:

Cost: Vector search APIs charge per query. At scale, that's real money. QMD is free after the initial model download.

Privacy: Your agent memory contains sensitive context. Project names, credentials patterns, personal preferences. Keeping it local means keeping it private.

Latency: Network round-trips add 50-200ms per query. Local inference is faster, especially when you're running multiple retrievals per agent turn.

The trade-off is compute. You need a machine with enough RAM to load the models (~4GB recommended). Cloud instances work, but you're paying for compute instead of API calls.

What interviewers are actually testing: The build-vs-buy decision for ML infrastructure. Local models trade API costs for compute costs. The break-even depends on query volume, latency requirements, and privacy constraints. Know your numbers.

Integration with OpenClaw

QMD plugs into OpenClaw as a memory backend. Three commands to set it up:

# Install QMD globally
bun install -g https://github.com/tobi/qmd

# Add memory collection
qmd collection add ~/.openclaw/agents/main/memory --name agent-logs

# Build initial embeddings
qmd embed

Then update your OpenClaw config:

memory:
  backend: "qmd"
  qmd:
    update:
      interval: "5m"    # Re-index every 5 minutes
    limits:
      maxResults: 6     # Return top 6 snippets
      maxChars: 700     # 700 chars per snippet

On agent boot, QMD:

Syncs indexes (15-second debounce to avoid thrashing)
Pre-warms embeddings for frequently accessed files
Registers as the memory provider for all retrieval calls

When the agent needs context, it queries QMD instead of injecting the full MEMORY.md. The Lane Queue serializes these queries to avoid OOM from concurrent embedding operations.

You can also add custom paths beyond the default memory directory:

qmd collection add ~/projects/notes --name project-context
qmd collection add ~/.config/snippets --name code-patterns

All collections merge into a single search index. Query once, search everything.

What interviewers are actually testing: System integration patterns. How do you replace a component (memory backend) without breaking the rest of the system? The answer involves clean interfaces, configuration-driven switching, and graceful degradation if the new backend fails.

MCP Mode for Advanced Workflows

QMD exposes an MCP (Model Context Protocol) server, letting agents query memory programmatically. This enables self-healing memory workflows.

Example: a compaction skill that prunes outdated entries:

// Memory compaction skill
const staleEntries = await qmd.query({
  collection: "agent-logs",
  filter: { olderThan: "30d", accessCount: 0 }
});

for (const entry of staleEntries) {
  if (await confirmDeletion(entry)) {
    await qmd.delete(entry.id);
  }
}

await qmd.reindex();

The MCP interface supports:

query: Hybrid search with filters
add: Insert new memory entries
update: Modify existing entries
delete: Remove stale content
reindex: Rebuild embeddings after bulk changes

This turns memory from a passive store into an active system. Agents can curate their own context, pruning irrelevant entries and promoting useful ones.

One pattern I've seen work well: a nightly job that analyzes query patterns, identifies entries that never get retrieved, and archives them. Memory stays lean without manual curation.

What interviewers are actually testing: Can you design systems that maintain themselves? Self-healing infrastructure is a senior engineer concern. The specific technique (memory compaction) matters less than the pattern: observe, analyze, act, verify.

Try It Yourself

Want to benchmark QMD against default memory? Here's a comparison test.

Prerequisites

OpenClaw v2026.2.0+
Bun or Node 22+
4GB available RAM
~2GB disk space for models

Step 1: Install QMD

bun install -g https://github.com/tobi/qmd

# Verify installation
qmd --version
# Expected: qmd 0.4.2 or higher

Step 2: Create Test Collection

# Index your existing memory
qmd collection add ~/.openclaw/agents/main/memory --name test-memory

# Build embeddings (takes 30-60s first time)
qmd embed --collection test-memory

Step 3: Run Comparison Queries

# QMD hybrid search
time qmd query "database connection pooling" --collection test-memory

# Compare token counts
echo "QMD returns ~700 chars × 6 results = 4,200 chars max"
echo "Full MEMORY.md injection = $(wc -c < ~/.openclaw/agents/main/memory/MEMORY.md) chars"

Expected Output

Query: "database connection pooling"
Results: 6 snippets (4,102 chars total)
Latency: 47ms

Top result (relevance: 0.94):
"PostgreSQL connection pooling config: pool_size=20,
max_overflow=10. Set in database.yml. Learned 2026-01-15
after production OOM incident..."

Step 4: Enable in OpenClaw

# Add to config
openclaw config set memory.backend qmd
openclaw config set memory.qmd.update.interval 5m

# Restart to apply
openclaw restart

Troubleshooting

"Model download failed": Check disk space. Models need ~1.5GB.
"Collection not found": Run qmd collection list to verify paths.
Slow first query: Normal. Embeddings cache after first run.
OOM errors: Reduce maxResults or increase system RAM.

Key Takeaways

QMD transforms OpenClaw memory from a liability into an asset. Instead of injecting thousands of irrelevant tokens, you get surgical retrieval: BM25 for exact matches, vector search for semantic similarity, LLM reranking for precision. All running locally with zero cloud costs and zero data leakage.

The hybrid search pipeline is the key insight. Neither keyword nor semantic search alone is sufficient. Production RAG systems combine both, then rerank for the final precision boost. QMD packages this pattern into a single tool that integrates cleanly with OpenClaw's memory system.

If your MEMORY.md is past 2,000 tokens and you're paying for every context injection, QMD pays for itself in a week.

👉 Want more AI engineering deep dives? Follow the full OpenClaw Deep Dive series on Upskill.

🚀 Preparing for FAANG interviews? Upskill AI helps IC4-IC6 engineers ace system design and ML interviews.

Sources:

Top comments (1)

Harjot Singh • May 31

Hybrid search (keyword + semantic) for agent memory is the right call - pure vector search misses exact-match needs (an error code, a function name, an ID) and pure keyword misses paraphrase, so combining them is what makes retrieval actually reliable. And doing it locally kills both the per-query embedding cost and the privacy concern, which is a double win for a memory layer.

The knob that decides quality: the fusion/ranking between the two signals - get the weighting wrong and you reintroduce the noise you were trying to avoid. Sharp retrieval is what keeps context lean (and cost down), which is the same discipline I lean on in Moonshift (prompt to a shipped SaaS on your own GitHub+Vercel) - pull the right slice, not the whole store. Nice build; are you doing reciprocal-rank fusion or a learned re-ranker on top? (Moonshift's first run's free if useful.)