Chen-Hung Wu

Posted on Feb 22 • Edited on Feb 24 • Originally published at tryupskill.app

How OpenClaw Orchestrates Long-Term Memory

#ai #machinelearning #tutorial #architecture

Files Are the Source of Truth

Forget embeddings stored in some opaque vector database you'll never inspect. OpenClaw takes a radically transparent approach: Markdown files in your workspace are the memory. The model "remembers" precisely what gets written to disk. Nothing more.

The architecture splits into two layers. Daily logs live at memory/YYYY-MM-DD.md — append-only notes that capture running context, decisions made, and operational details from each session. These get loaded automatically when you reconnect (today's and yesterday's files, specifically). The second layer is MEMORY.md: curated, durable facts. Preferences. Architectural decisions. The stuff that shouldn't decay.

This design has a brutal honesty to it. If the agent "forgets" something, you can open the file and see exactly why — either it never wrote the memory, or the search failed to surface it. No magical retrieval failures hidden behind API abstractions.

# memory/2026-02-22.md
- User prefers bun over npm; always suggest bun commands
- Discovered auth bug in JWTVerifier.validate() line 142
- Production deploys require VPN connection first

# MEMORY.md
## Workspace Conventions
- Test files: *.test.ts (not *.spec.ts)
- Never auto-commit without explicit approval
- Database credentials in ~/.secrets/db.env

What interviewers are actually testing: Can you articulate why filesystem-backed state provides better debuggability than distributed storage? The tradeoff is queryability — you lose native SQL queries but gain grep.

The memory layer loads contextually too. MEMORY.md only surfaces in private sessions. Group contexts strip it to prevent leaking personal preferences into shared channels. This scope-aware loading happens at session bootstrap, before the model sees anything.

Hybrid Search: BM25 Meets Vector Similarity

Semantic search alone fails spectacularly on code. Ask for "that bug with the auth token" and vector similarity might surface something about OAuth flows instead of the specific JWTVerifier incident you meant. Pure keyword search fails the other direction — querying "Mac Studio gateway host" won't match "machine running gateway" unless the exact tokens appear.

OpenClaw runs both retrieval signals in parallel and merges them. The formula normalizes each score to 1.0, then weights them:

finalScore = (vectorWeight × vectorScore) + (textWeight × bm25Score)

Default configuration sets vector weight at 0.7, BM25 at 0.3. In practice, this means semantic understanding dominates, but exact matches (error strings, function names, UUIDs) still punch through when they appear.

Here's where it gets interesting. After the initial ranking, OpenClaw applies Maximal Marginal Relevance re-ranking to reduce redundancy:

finalScore = λ × relevance − (1−λ) × max_similarity_to_selected

With lambda at 0.7, the system balances relevance against diversity. Three near-identical snippets about the same bug won't dominate your context window — you'll get the most relevant one plus related-but-distinct memories.

The practical effect: searches feel coherent rather than repetitive. You get one answer about the database migration, not five slightly different recollections of the same event.

What interviewers are actually testing: Can you explain MMR without hand-waving? The core insight is that relevance alone creates echo chambers in retrieval. You need a diversity penalty.

Temporal Decay: Recent Memories Win

A memory from three months ago shouldn't rank equally with one from yesterday. OpenClaw applies exponential decay to older memories:

decayedScore = score × e^(-λ × ageInDays)

where λ = ln(2) / halfLifeDays (≈ 0.023 for the default 30-day half-life). Numbers that actually mean something:

Age	Score Multiplier
Today	100%
7 days	~84%
30 days	50%
90 days	12.5%

But not everything decays. MEMORY.md and non-dated memory files get exempted — they're treated as evergreen. Your preference for tabs over spaces shouldn't fade because you set it three months ago.

The decay calculation happens at query time, not at index time. This matters because you'd otherwise need to re-index constantly. Instead, the system stores raw timestamps and applies decay during scoring. Subtle, but it keeps the indexer simple.

I've seen this bite people in production when they expected old memories to persist at full strength. The docs won't tell you this explicitly, but if you want something truly permanent, it belongs in MEMORY.md, not in a dated log file. The dated logs are inherently ephemeral by design.

// Configuration for temporal decay
{
  "memorySearch": {
    "query": {
      "hybrid": {
        "temporalDecay": {
          "enabled": true,
          "halfLifeDays": 30
        }
      }
    }
  }
}

The Gateway and Lane Queues

Memory retrieval doesn't happen in isolation. It sits inside OpenClaw's broader orchestration — and understanding that architecture explains why memory queries never race with active tool execution.

Everything flows through a single daemon called the Gateway. All session state lives there. UI clients query the Gateway; they don't read session files directly. This centralization sounds like a bottleneck, but it enables something subtle: deterministic execution order.

The Lane Queue enforces serial execution per session. One task at a time. One message processed fully before the next begins. Parallelism only happens across different sessions or for operations explicitly marked as idempotent.

Why does this matter for memory? Because memory searches and memory writes both happen inside the agent loop. If you could have concurrent runs within a session, you'd get race conditions — a memory write from turn N could interleave with a memory read from turn N+1, producing inconsistent state. The Lane Queue eliminates this class of bugs by construction.

Message arrives → Gateway assigns to session lane →
Queue ensures serial execution → Agent loop runs →
Context loaded (including memory search) → Model inference →
Tool execution → Memory persistence → Response streamed

The tradeoff is throughput. A single session can't process multiple user messages simultaneously. But for an agent with memory, consistency beats concurrency. You don't want yesterday's corrections overwritten by a stale parallel execution.

What interviewers are actually testing: Race conditions in agent systems aren't edge cases. They're the default failure mode when you accept concurrent input without explicit ordering. Serial execution is the unsexy-but-correct answer.

Memory Flush Before Compaction

Context windows aren't infinite. When you approach the limit, OpenClaw triggers auto-compaction — summarizing earlier turns to free space. But here's the problem: any memories the model was holding in working context (but hadn't persisted) would vanish.

OpenClaw's solution is a pre-compaction memory flush. Before compaction fires, the system injects a silent turn:

{
  "compaction": {
    "memoryFlush": {
      "enabled": true,
      "softThresholdTokens": 4000,
      "systemPrompt": "Session nearing compaction. Store durable memories now.",
      "prompt": "Write any lasting notes to memory/YYYY-MM-DD.md; reply with NO_REPLY if nothing to store."
    }
  }
}

This gives the model a chance to commit anything worth keeping. The soft threshold triggers when you're within 4000 tokens of compaction. One flush per cycle — it won't spam you.

The practical effect: sessions that run for hours don't lose context silently. You get a reliable commit point. But it requires the model to actually write — if it decides nothing is worth storing, nothing persists. The system can't force good memory hygiene; it can only provide the hook.

I've debugged sessions where users complained about lost context. Nine times out of ten, the memory flush fired correctly, but the model responded with NO_REPLY because it judged the recent context as transient. The fix is usually better system prompts that define what "worth storing" means for your use case.

Try It Yourself

Enough theory. Here's how to actually see OpenClaw's memory system in action.

Prerequisites

Node.js 20+ (OpenClaw uses modern ES modules)
An OpenAI API key (for embeddings) or a local GGUF model
~10 minutes of setup time

Step 1: Install OpenClaw

npm install -g @openclaw/cli
openclaw init my-agent
cd my-agent

This creates a workspace with the default memory structure:

my-agent/
├── MEMORY.md           # Long-term curated facts
├── memory/             # Daily logs go here
├── .openclaw/
│   └── config.json     # Memory search settings
└── SOUL.md             # Agent personality

Step 2: Configure Memory Search

Edit .openclaw/config.json to enable hybrid search with temporal decay:

{
  "memorySearch": {
    "provider": "openai",
    "model": "text-embedding-3-small",
    "query": {
      "hybrid": {
        "enabled": true,
        "vectorWeight": 0.7,
        "textWeight": 0.3,
        "temporalDecay": {
          "enabled": true,
          "halfLifeDays": 30
        }
      }
    }
  }
}

Set your API key:

export OPENAI_API_KEY="sk-..."

Step 3: Write Some Memories

Start a session and tell the agent something worth remembering:

openclaw chat

You: Remember that I prefer TypeScript over JavaScript, and always use strict mode.
Agent: Got it. I've noted your preference for TypeScript with strict mode.

Check that it actually wrote to disk:

cat memory/$(date +%Y-%m-%d).md

You should see:

- User prefers TypeScript over JavaScript
- Always use strict mode in TypeScript configs

Step 4: Test Memory Retrieval

Start a new session (simulating the next day):

openclaw chat --session new

You: What language should I use for this new project?
Agent: Based on your preferences, I'd recommend TypeScript with strict mode enabled...

The agent retrieved your preference from the daily log. Verify by checking the debug output:

openclaw chat --debug

Look for [memory_search] entries showing which files were queried and their relevance scores.

Step 5: Test Temporal Decay

Create an old memory file to see decay in action:

# Create a memory from 60 days ago
echo "- Old preference: use Webpack for bundling" > memory/$(date -d "60 days ago" +%Y-%m-%d).md

# Create a recent memory
echo "- New preference: use Vite for bundling" > memory/$(date +%Y-%m-%d).md

Now search for bundling preferences:

openclaw memory search "bundling tool preference"

Expected output shows the recent Vite preference scoring higher due to temporal decay:

Results:
1. [0.92] memory/2026-02-22.md:1 - "New preference: use Vite for bundling"
2. [0.46] memory/2025-12-24.md:1 - "Old preference: use Webpack for bundling"

The 60-day-old memory scores roughly half (50% at 30 days × ~50% at another 30 days ≈ 25%, plus base relevance).

Troubleshooting

Memory search returns nothing:

Check that .openclaw/config.json has valid embedding provider settings
Verify OPENAI_API_KEY is set (or local model path exists)
Run openclaw memory reindex to rebuild the index

Embeddings fail with 401:

Your API key is invalid or expired
Try openclaw config set memorySearch.provider local to use local embeddings instead

Daily logs not loading:

Filenames must match YYYY-MM-DD.md exactly
Check timezone: OpenClaw uses system timezone for "today"

What Actually Matters

OpenClaw's memory isn't intelligent. It's plumbing — well-designed plumbing that stays out of your way until you need to debug it. The filesystem-backed approach trades sophistication for transparency. You can cat MEMORY.md and see exactly what your agent "knows." Hybrid search balances semantic understanding with keyword precision. Temporal decay keeps recent context prominent without manual curation. And the Lane Queue ensures none of this races with itself.

The real insight isn't any single component. It's that persistent memory for agents requires coordinating retrieval, persistence, and context management as a unified system. Bolt-on memory layers fail because they don't account for the agent loop's execution model. OpenClaw's architecture assumes memory is load-bearing infrastructure, not an afterthought. That's what makes it work at 3am when your agent needs to remember why it's not supposed to touch the auth directory.

Key Takeaways

OpenClaw stores memory as plain Markdown files — transparent, debuggable, and grep-able
Hybrid search (BM25 + vector) handles both semantic queries and exact token matches
Temporal decay with 30-day half-life keeps recent memories prominent; evergreen files exempt
Lane Queues enforce serial execution to prevent memory race conditions
Pre-compaction memory flush prevents context loss during long sessions

👉 Want more AI engineering deep dives? Follow the full OpenClaw Deep Dive series on Upskill.

🚀 Preparing for FAANG interviews? Upskill AI helps IC4-IC6 engineers ace system design and ML interviews.

Further Reading: