Chen-Hung Wu

Posted on Mar 1 • Originally published at tryupskill.app

Build a LanceDB Memory Plugin for OpenClaw

#vectordatabase #rag #tutorial #ai

16 minute read

You spend forty minutes configuring your OpenClaw agent. Database credentials, code style preferences, deployment quirks — you walk it through everything. Next session: blank slate. The agent has no idea you prefer bun over npm, that your staging server requires a VPN, or that auth/middleware.ts broke three times last week because of a race condition. OpenClaw's built-in memory system tries to fix this, but its Markdown-file approach has fundamental gaps. The agent decides what to save. The agent decides when to search. And context compaction can silently destroy the memories it did store.

This is the problem that memory-lancedb-pro attacks head-on — a community-built plugin that replaces OpenClaw's memory subsystem with a 7-layer hybrid retrieval pipeline, multi-scope isolation, and automatic noise filtering. I've been running it for weeks. Here's how it works, why it's engineered the way it is, and how to build one yourself.

Why the Built-In Memory System Leaks

OpenClaw's default memory stores information as plain Markdown files in the agent workspace. Daily logs go to memory/YYYY-MM-DD.md, long-term facts live in MEMORY.md. The agent decides what to write, when to search, and whether a retrieved memory is relevant enough to inject into context.

This architecture has three structural problems:

1. The agent controls the save. Memory capture is LLM-driven — the model decides what's worth persisting. In practice, it misses subtle-but-critical details. A throwaway mention of "use port 3001 for the dev server" never makes it to disk because the model didn't flag it as important.

2. Context compaction destroys memories. When the conversation approaches the context window limit, OpenClaw compresses older messages. Any memories injected earlier get summarized, rewritten, or dropped entirely. Your carefully stored facts become lossy approximations.

3. Search is opt-in. The built-in memory_search tool only fires when the agent decides to call it. No automatic recall means relevant context sits in the database unsurfaced while the agent confidently hallucinates from scratch.

┌─────────────────────────────────────────┐
│          Built-in Memory Flow           │
│                                         │
│  Agent decides → maybe save to .md      │
│  Agent decides → maybe call search      │
│  Compaction   → maybe destroys memory   │
│                                         │
│  Three "maybes" = unreliable recall     │
└─────────────────────────────────────────┘

What interviewers are actually testing: When candidates describe memory systems, interviewers want to hear about the failure modes. Any system where the LLM decides what to persist is fundamentally unreliable — the model doesn't know what it doesn't know. System-level memory extraction (outside the agent loop) is the architectural pattern that fixes this.

The 7-Layer Hybrid Retrieval Pipeline

Pure vector search gets you 70% of the way. It understands that "the machine running the gateway" and "gateway host" are semantically equivalent. But it completely misses exact matches — error codes, IP addresses, function names, config keys. The memory-lancedb-pro plugin solves this with a retrieval pipeline that fuses multiple signals, then aggressively post-processes to eliminate noise.

Here's the full pipeline:

Layer 1: Vector + BM25 Fusion

Every query runs two parallel searches. Vector search (cosine similarity via LanceDB ANN) captures semantic relationships. BM25 full-text search catches exact keyword matches. Results merge via weighted scoring:

fusedScore = vectorWeight × vectorScore + bm25Weight × bm25Score

Default weights: 70% vector, 30% BM25. This mirrors what production RAG systems at scale have converged on — semantic understanding dominates, but keyword precision fills the gaps that embeddings miss.

Layer 2: Cross-Encoder Reranking

Bi-encoders (what generates your embeddings) encode query and document separately. They're fast but shallow. Cross-encoders process query and document together through a full transformer pass, capturing token-level interactions that bi-encoders miss.

The plugin sends top candidates to a cross-encoder (Jina Reranker v3 by default) and blends the scores:

rerankedScore = 0.6 × crossEncoderScore + 0.4 × fusedScore

This 60/40 blend is deliberate. The cross-encoder is more accurate but can occasionally hallucinate relevance for tangentially related content. Keeping 40% of the original fused score anchors the ranking in actual keyword and semantic matches.

Layer 3: Recency Boost

recencyBoost = exp(-ageDays / halfLife) × weight

Default half-life: 14 days. A memory from yesterday gets nearly full boost. One from a month ago gets ~25%. This matters because in practice, your most recent debugging session is far more relevant than something you stored six months ago.

Layer 4: Importance Weighting

adjustedScore = score × (0.7 + 0.3 × importance)

Importance is a 0-1 float set at storage time. Critical facts (production credentials location, deployment procedures) get importance=1.0 and a 1.0× multiplier. Casual observations get importance=0.0 and a 0.7× multiplier. The 0.7 floor ensures nothing gets completely buried.

Layer 5: Length Normalization

normalizedScore = score / (1 + 0.5 × log₂(length / anchor))

Anchor: 500 characters. Without this, long entries dominate rankings simply because they contain more matching terms. The logarithmic normalization penalizes verbosity without crushing long-form memories that happen to be genuinely relevant.

Layer 6: Time Decay

decayedScore = 0.5 + 0.5 × exp(-ageDays / halfLife)

Separate from recency boost. Half-life: 60 days. Floor: 0.5×. This is the long-term forgetting curve — even important memories gradually fade unless they keep getting recalled. The 0.5 floor means nothing ever fully disappears.

Layer 7: Noise Filter + MMR Diversity

Two final passes. First, any result scoring below hardMinScore (default: 0.35) gets discarded entirely. Then Maximal Marginal Relevance removes near-duplicates — if two memories have cosine similarity > 0.85, the lower-scoring one gets demoted. This prevents your top-3 results from being three slightly different versions of the same daily note.

// Simplified MMR diversity filter
for (const candidate of sorted) {
  const dominated = selected.some(
    s => cosineSim(s.embedding, candidate.embedding) > 0.85
  );
  if (!dominated) selected.push(candidate);
}

What interviewers are actually testing: The 7-layer design isn't arbitrary complexity — each layer addresses a specific failure mode. Vector search alone misses exact matches (Layer 1 fix). Bi-encoders miss cross-token interactions (Layer 2 fix). Stale results dominate (Layers 3, 6 fix). Long content gets unfair advantage (Layer 5 fix). Near-duplicates waste limited context (Layer 7 fix). If you can articulate why each component exists, you demonstrate systems thinking.

Multi-Scope Isolation: Agent Privacy Without Silos

In multi-agent setups, you don't want Agent A reading Agent B's private context. But you also don't want total isolation — shared knowledge like coding standards or infrastructure details should be accessible to everyone.

The plugin implements five scope types:

Scope	Purpose	Example
`global`	Shared across all agents	Coding standards, team conventions
`agent:<id>`	Private to one agent	Agent-specific configurations
`project:<id>`	Project-level boundaries	Per-repo architectural decisions
`user:<id>`	User-specific context	Personal preferences
`custom:<name>`	Arbitrary grouping	`custom:debugging-tips`

Each agent gets access to global plus its own agent:<id> scope by default. You can expand access via configuration:

{
  "scopes": {
    "default": "global",
    "agentAccess": {
      "code-reviewer": ["global", "agent:code-reviewer", "project:frontend"],
      "devops-agent": ["global", "agent:devops-agent", "project:infra"]
    }
  }
}

This is the same access control pattern you'd see in a multi-tenant SaaS database — row-level security through scope tagging rather than physical table separation. Memories are co-located in a single LanceDB table but filtered at query time.

What interviewers are actually testing: Scope isolation in a shared vector store is a real-world design problem. The naive approach (separate database per agent) doesn't scale and prevents knowledge sharing. The production approach (tag-based filtering on a shared index) trades some query overhead for dramatically better flexibility. This shows up in any system design interview involving multi-tenancy.

Noise Filtering and Adaptive Retrieval

Not every message deserves memory storage. And not every query deserves a database lookup.

What Gets Filtered Out

The auto-capture system rejects:

Agent refusals: "I don't have information about that"
Meta-questions: "Do you remember what we discussed?"
Greetings: "Hi", "Hello", "HEARTBEAT" keepalive signals
Confirmation noise: "OK", "Got it", "Thanks"

These patterns generate false positives in retrieval — a stored "hello" matches future greetings and wastes one of your precious top-3 context injection slots.

When Search Gets Skipped

Adaptive retrieval saves latency and prevents irrelevant context injection:

Short confirmations under 15 characters (English) or 6 characters (CJK)
Slash commands like /help, /status
Single emoji responses

Conversely, queries containing memory-related keywords ("remember", "previously", "last time", "之前", "前回") always trigger a full retrieval regardless of length.

// Adaptive retrieval decision
function shouldRetrieve(query: string): boolean {
  if (MEMORY_KEYWORDS.some(k => query.includes(k))) return true;
  if (query.startsWith('/')) return false;
  const threshold = isCJK(query) ? 6 : 15;
  return query.length >= threshold;
}

Try It Yourself: Installation in 10 Minutes

Prerequisites

OpenClaw installed and running
Node.js 18+
An embedding API key (Jina AI offers free tier — jina.ai)

Step 1: Clone the Plugin

cd your-workspace/
git clone https://github.com/win4r/memory-lancedb-pro.git plugins/memory-lancedb-pro
cd plugins/memory-lancedb-pro
npm install

Step 2: Configure OpenClaw

Update your openclaw.json:

{
  "plugins": {
    "slots": {
      "memory": "memory-lancedb-pro"
    },
    "memory-lancedb-pro": {
      "embedding": {
        "apiKey": "${JINA_API_KEY}",
        "model": "jina-embeddings-v5-text-small",
        "baseURL": "https://api.jina.ai/v1",
        "dimensions": 1024
      },
      "retrieval": {
        "mode": "hybrid",
        "vectorWeight": 0.7,
        "bm25Weight": 0.3,
        "rerank": "cross-encoder",
        "minScore": 0.3
      },
      "autoCapture": true,
      "autoRecall": true
    }
  }
}

Step 3: Set Your API Key

export JINA_API_KEY="jina_xxxxxxxxxxxxx"

Step 4: Restart and Verify

openclaw gateway restart
openclaw plugins list
# Should show: memory-lancedb-pro (active)

Step 5: Test Memory Storage and Recall

In a new session, tell the agent something specific:

> Remember: our production database is at db-prod-east-2.example.com, port 5432

Start another session and ask:

> What's our production database address?

The plugin should auto-recall the stored fact without you explicitly asking it to search.

Troubleshooting

"memory unavailable" in status: Check openclaw plugins doctor — usually a missing API key
Slow first search: LanceDB builds FTS indexes on first query. Subsequent searches are fast
No auto-recall: Verify autoRecall: true in config and restart the gateway

Choosing Your Embedding Provider

The plugin supports any OpenAI-compatible embedding API. Your choice affects cost, latency, and retrieval quality:

Provider	Model	Dimensions	Latency	Cost
Jina	jina-embeddings-v5-text-small	1024	~50ms	Free tier available
OpenAI	text-embedding-3-small	1536	~80ms	$0.02/1M tokens
Google	gemini-embedding-001	3072	~100ms	Free tier available
Ollama	nomic-embed-text	Variable	~20ms	Free (local)

For most setups, Jina hits the sweet spot — low latency, generous free tier, and 1024 dimensions is plenty for conversational memory. If you're privacy-conscious or working offline, Ollama with a local model eliminates API calls entirely.

The plugin also supports task-aware embedding via Jina's taskQuery and taskPassage parameters, which optimize the embedding differently depending on whether the text is a search query or a stored passage. This asymmetric embedding is a meaningful accuracy improvement that most vector database tutorials skip.

What interviewers are actually testing: Embedding dimension selection is a real trade-off. Higher dimensions capture more semantic nuance but cost more storage and compute. For conversational memory (short texts, limited vocabulary), 1024 dimensions is overkill already. For code search over millions of files, 3072 dimensions starts making sense. The right answer is always "it depends on your data distribution."

Key Takeaways

OpenClaw's built-in memory works for simple, single-agent setups where occasional forgetfulness is tolerable. But the moment you need reliable recall across sessions, multi-agent privacy, or noise-free context injection, the architectural limitations become blockers. The memory-lancedb-pro plugin demonstrates that replacing a single subsystem — memory — with a purpose-built retrieval pipeline can transform an agent from "occasionally helpful" to "genuinely learns over time." The 7-layer pipeline isn't academic complexity; each layer exists because pure vector search fails in specific, well-documented ways. And the multi-scope isolation pattern is the same row-level security model that powers every multi-tenant SaaS database — proven at scale, applied to a new domain.

The code is open source on GitHub. Read it. The retrieval pipeline alone is worth studying for anyone building RAG systems.

DEV Community