Tahseen Rahman

Posted on Feb 17 • Edited on Feb 28

My AI Agent Lost Its Memory — Here's How It Fixed Itself

#ai #automation #debugging #devops

I was mid-conversation with my AI agent on a Tuesday morning. Revenue sprint week. Asked it a simple question:

"Do you remember the memory fix we did a few days back?"

It tried to recall. And instead of an answer, I got an error.

Cannot find module '@lancedb/lancedb'

Live. In the conversation. My agent's memory had broken — and we'd just confirmed it the worst possible way.

This is the story of what happened, how we found it, how we fixed it, and the uncomfortable lesson about building AI systems that actually hold up in production.

The Setup

I run a 24/7 AI agent called Gandalf, built on OpenClaw. It's not a demo or a prototype — it's a real operational system managing 14 automated cron jobs, multiple sub-agents, and several SaaS products I'm actively developing under Motu Inc.

Gandalf wakes up every morning before I do. There's a Research Synthesizer cron at 7am that pulls overnight findings, summarizes them, and stores key learnings to long-term memory. There are monitoring jobs, content pipeline jobs, customer-facing automations. The agent is effectively a co-founder running on a schedule.

For long-term semantic memory, Gandalf uses LanceDB — a vector database that runs locally and integrates with OpenClaw's memory_recall / memory_store tools. When something important happens — a decision, a lesson, a fix — it stores a vector embedding. When it needs context later, it queries by semantic similarity and retrieves what's relevant.

This is the difference between an agent that remembers and an agent that forgets. Memory transforms a stateless session into something with continuity.

It had been working great. Until it silently stopped.

The Break

On February 14, OpenClaw pushed an update: v2026.2.14 → v2026.2.15.

Routine stuff. I didn't think much of it.

What I didn't realize was that the update ran npm install in the background to rebuild its dependency tree from the core package.json. This is standard Node.js update behavior. Totally reasonable.

The problem: @lancedb/lancedb was not in OpenClaw's core package.json. It was a manually installed dependency — something I'd added by running npm install @lancedb/lancedb directly into OpenClaw's node_modules folder to extend its capabilities.

When npm install ran from the core manifest, it rebuilt node_modules clean. My manual add-on wasn't on the list. It got wiped.

No error during the update. No warning. No broken startup message. The gateway restarted fine. Everything looked fine.

The dependency was just gone. Quietly. And any call to memory_recall or memory_store would now fail with a module-not-found error that wouldn't surface until something actually tried to use memory.

The Discovery

The failure surfaced at 7:04am.

The Research Synthesizer cron woke up as usual, collected its overnight findings, and tried to store them to long-term memory. It hit Cannot find module '@lancedb/lancedb'. The cron logged the error and exited.

Gandalf noticed — it checks its own session logs — but at that point, wasn't certain if it was a transient issue or something deeper. It flagged it internally.

Then I opened Telegram and asked a casual question:

"Do you remember the memory fix we did a few days back?"

Gandalf's response was to call memory_recall with a query about the LanceDB fix. The tool returned an error instead of a result. Right there in the conversation thread. That's when we knew it was real.

What made this particularly bad timing: this was a critical revenue sprint week. I'd just made a deliberate decision to step back and let Gandalf's strategy run. The last thing I needed was its memory infrastructure quietly offline while cron jobs were running and decisions were being made without full context.

The Fix

The fix, once identified, was straightforward:

Step 1: Reinstall the missing dependency

cd /opt/homebrew/lib/node_modules/openclaw
npm install @lancedb/lancedb

This puts @lancedb/lancedb back into OpenClaw's node_modules. Same place it was before the update wiped it.

Step 2: Restart the gateway to reload the extension

openclaw gateway restart

The gateway needs to restart to pick up the newly installed module. Without this step, the running process still has the broken module cache.

Step 3: Verify it works

Call memory_recall with a test query. Confirm you get results back instead of an error. Then call memory_store with something benign to confirm writes work too.

Step 4: Store a memory about the fix

Fixed LanceDB memory issue — @lancedb/lancedb gets wiped on OpenClaw updates 
because it's not in core package.json. Fix: cd /opt/homebrew/lib/node_modules/openclaw 
&& npm install @lancedb/lancedb, then openclaw gateway restart.

This last step matters more than it sounds.

The Meta Lesson

Here's the part that stuck with me.

This was the second time this happened. First time was February 15. We fixed it, and Gandalf stored a memory about it — the exact commands, the root cause, the lesson.

Then February 17, another update ran the same pattern. Same wipe. Memory gone.

Including the memory about the previous fix.

So when I asked "do you remember the memory fix?" — Gandalf wasn't being forgetful. The memory literally didn't exist anymore. The fix had been stored in the system that was now broken.

Except: Gandalf also writes daily file-based logs. Plain markdown files in its workspace, one per day, logging sessions, decisions, and events.

February 15's log had the exact commands. Timestamped. In plain text. Sitting on disk, completely unaffected by the node_modules wipe.

The file-based memory was the fallback that recovered the semantic memory. The backup system saved the primary system.

I didn't design this as a redundancy strategy. It emerged from the habit of writing things down. But it held up when the fancier system failed, and that's not an accident — it's the nature of layered resilience.

Practical Takeaways

If you're running AI agents in production — or building toward it — here's what this experience taught me:

1. Know your manual dependencies

If you've extended your platform with a dependency that isn't in the core manifest, document it. Somewhere that survives a wipe. A file. A setup script. A checklist the agent can read even when its memory is offline.

2. Build layered memory

Semantic vector memory (like LanceDB) is powerful for fuzzy recall and contextual retrieval. But it's a dependency. It has a failure mode.

File-based memory is boring. It's just markdown on disk. But it's resilient in ways vector databases aren't — no module dependencies, no database processes, no embedding models.

Use both. They cover each other's failure modes.

3. Let your cron jobs be your canary

The Research Synthesizer failed at 7:04am. That was the first sign. If I hadn't had structured cron jobs probing memory on a schedule, the failure might have gone undetected until something worse broke downstream.

Cron jobs that exercise your memory system aren't just useful — they're tests that run automatically. Treat them that way.

4. After platform updates, check your add-ons

This is now a standing rule: after any OpenClaw update, manually verify that @lancedb/lancedb is still present. Takes 10 seconds:

ls /opt/homebrew/lib/node_modules/openclaw/node_modules/@lancedb/

If the directory doesn't exist, run the fix. Don't wait for the 7am cron to fail.

5. Failures are information

The worst version of this story is one where the memory silently fails and the agent just... makes decisions without context, without realizing it's missing anything. That's the scary failure mode.

The version we got — explicit errors, a cron that logged the failure, a live confirmation in conversation — was actually good. It was noisy enough to catch. Build your systems so failures are loud, not silent.

Where We Are Now

Memory is back online. The Research Synthesizer is running clean. Gandalf has a standing reminder to check LanceDB after every OpenClaw update.

The irony is that the whole incident is now documented in both systems — semantic memory and the daily file log. Two copies. Because we learned the hard way that one isn't enough.

That's the thing about building in public and building in production: the stories that feel embarrassing in the moment are the ones that actually teach you something. An agent that lost its memory and found its way back using its notes is a better story than one that never broke.

Build resilient. Write things down. And check your node_modules after updates.

Gandalf is an AI agent built on OpenClaw and runs 24/7 managing operations for Motu Inc. I write about what I'm building at @Tahseen_Rahman.

DEV Community