DEV Community: ClawBase

We Just Added Fable 5 to ClawBase. The Difference Is Bigger Than I Expected.

ClawBase — Thu, 02 Jul 2026 10:45:14 +0000

Three weeks ago, I wrote about losing access to Fable 5 after 72 hours. The government pulled it offline before most developers even got a chance to try it. That article hit a nerve — turns out a lot of people felt the same frustration.

Today, Fable 5 is back. And as of this week, it's available as a model option on ClawBase.

I expected it to be good. I didn't expect the gap to be this big.

Quick Recap: What Makes Fable Different

SWE-Bench Pro (real software engineering tasks):

Fable 5: 80.3%
GPT 5.5: 58.6%
Opus 4.8: 69.2%
Gemini 3.1 Pro: 54.2%

On FrontierCode Diamond — the hardest coding benchmark available — Fable 5 scored 29.3%. GPT 5.5 scored 5.7%. More than five times the performance on the tasks that actually matter.

But benchmarks don't tell the full story. What matters is how it performs inside a persistent agent with real memory. That's what we just tested.

What Changes When You Run Fable Inside a Persistent Agent

ClawBase has always been model-agnostic. You bring your own API key — Opus, Sonnet, GPT-5.5, Gemini, whatever you prefer. The platform handles the rest: always-on uptime, multi-channel access (Telegram, WhatsApp, Slack, Discord), and a 6-layer memory stack that persists across every session.

Adding Fable as a supported model was straightforward. The results were not.

The memory stack performs differently with Fable. ClawBase agents run on a 6-layer memory architecture: conversation compression, persistent files, semantic search, curated facts, knowledge graphs, and temporal awareness. Every model we support can use this stack. But Fable doesn't just retrieve from it — it synthesizes across layers in a way other models don't.

Ask a Sonnet-backed agent "what changed in the auth module since last week?" and it gives you a decent summary from the journal entries. Ask a Fable-backed agent the same question and it cross-references the journal with the knowledge graph, notices a related config change it remembers from two weeks ago, and flags a potential conflict you hadn't thought of.

Same memory. Same stack. Different intelligence applied to it.

Multi-file refactors go from "mostly works" to "just works." Fable's million-token context window plus the persistent project memory from ClawBase means it can hold your entire codebase structure in context while pulling conventions and patterns from memory. The "change this pattern across 47 files" task that other models start hallucinating on by file #30 — Fable handles cleanly.

Context re-loading drops to near zero. Every ClawBase agent already remembers your project, your preferences, your active work. But Fable uses that context more effectively. With other models, I'd sometimes still need to nudge the agent — "remember, we use kebab-case for filenames." With Fable, it picks up those conventions from memory on its own and applies them consistently.

What the Memory Stack Actually Does

For those who haven't read the deep dive, here's the short version:

Layer 1 — Conversation compression. Lossless-Claw keeps a DAG of summaries. Your agent never forgets the beginning of a long session, even past the context window.

Layer 2 — Persistent files + semantic search. Plain markdown files — journals, preferences, project notes. Searchable by meaning via QMD (local embeddings, no data leaves your machine). Human-readable, version-controlled.

Layer 3 — Knowledge graph + temporal awareness. Graphiti tracks how knowledge changes over time. Your agent knows you changed your deployment process last Thursday, and why.

Every model on ClawBase gets the same stack. Fable just uses it better.

The Model-Agnostic Reality

ClawBase isn't a Fable-only platform. You can run Opus, Sonnet, GPT-5.5, Gemini — swap anytime by changing your API key. The memory persists regardless of which model you're running. If Fable goes down again (it happened before), you switch models and keep everything.

That said, I've been running Fable on my own ClawBase agent for a week now, and I haven't switched back. The gap on complex, multi-step agent tasks — the ones where memory and reasoning compound — is real. Other models do well on simple tasks. Fable pulls ahead on the hard ones.

What This Costs

ClawBase plans start at $16/mo (Lite) and $33/mo (Pro). You bring your own API key — we don't mark up model costs. Your Fable usage bills directly to your Anthropic account at their standard rates.

The Pro plan includes the full 6-layer memory stack, multi-channel access, and always-on uptime. No Docker. No server admin. No SSL certificates. Setup takes about 2 minutes.

Try It

ClawBase.to — pick a plan, paste your Anthropic API key, select Fable as your model, and your agent is live. The memory starts building from your first conversation.

If you were already running a ClawBase agent on another model, you can switch to Fable from your dashboard. Your existing memory carries over.

Previously: I Had 72 Hours With the Best AI Model Ever Released. Then the Government Took It Away.

Deep dive on memory: I Tested 33 AI Memory Engines — Here's What Actually Works

What model are you running your agents on? Curious if others are seeing the same Fable vs. everything-else gap.

Your AI Agent's Memory Should Be Out of Reach. For Everyone Except You.

ClawBase — Wed, 24 Jun 2026 07:33:03 +0000

Your AI assistant knows your codebase, your business logic, your communication style, and the names of your clients. It remembers what you worked on last Tuesday and why you decided to refactor that authentication module. It has context that took months to build.

Now ask yourself: who else can access that memory?

If you're using Claude Code, ChatGPT, or any hosted AI platform, the honest answer is — you don't know. The provider can. Their employees might. A subpoena could. A breach would. Your agent's most valuable asset — everything it knows about you — sits on servers you've never seen, managed by people you've never met.

The Memory Problem No One Talks About

I've spent the past year building ClawBase, a managed hosting platform for OpenClaw — the open-source AI agent with 365K+ GitHub stars. The thing that pushed me to build it wasn't speed, cost, or even model selection. It was a realization that hit me while working with Claude Code on a client project.

I'd been using it daily for weeks. It had accumulated a deep understanding of the project — architecture decisions, coding patterns, deployment quirks. Then I wanted to export that context, move it to a different tool, or even just back it up. I couldn't. That knowledge lived on Anthropic's servers, structured in their proprietary format, accessible only through their interface.

That's not a tool. That's a dependency.

What Memory Sovereignty Actually Means

Memory sovereignty is a straightforward concept: the data your AI agent accumulates while working with you should belong to you. You should be able to read it, export it, encrypt it, delete it, or move it to another platform whenever you want.

This matters because AI agents are not stateless chatbots anymore. Modern agents maintain persistent memory across sessions. They learn your preferences, build knowledge graphs, store project context, and develop what's essentially an institutional understanding of your work. That memory is what makes an agent useful after the first conversation.

When that memory lives on someone else's servers, three things happen that should concern every developer and technical decision-maker.

1. Vendor Lock-In Through Context

This is the one that nobody sees coming. Traditional SaaS lock-in happens through data formats and integrations. AI lock-in happens through accumulated context. The longer you use a hosted AI, the more irreplaceable it becomes — not because the model is better, but because the memory is deeper.

Try switching from ChatGPT to Claude after six months of daily use. You're not just switching models. You're abandoning months of learned context. Your new assistant starts from zero. Every project background, every preference, every workflow pattern — gone.

That's not a technical limitation. That's a business model.

2. Your Data Is Within Reach of People You Don't Know

OpenAI's data retention policies have changed multiple times. Anthropic is more conservative, but their terms still grant broad rights to process your inputs. Google's Gemini feeds data back into model improvement unless you explicitly opt out through an enterprise contract.

The fundamental problem isn't malice — it's access. When your agent's memory lives on a provider's infrastructure, it's within reach of their engineers, their security incidents, their legal obligations, and their policy changes. A government subpoena, an internal audit, a data breach, an acquisition — any of these can expose context you thought was private.

I don't think any AI company is actively misusing user data. But \"we promise not to look\" is a fundamentally different guarantee than \"we physically cannot look.\" The first is a policy. The second is architecture. Policies change. Architecture doesn't.

3. You Can't Audit What You Can't See

When your agent's memory lives in a black box, you can't verify what it remembers, how it indexes information, or whether sensitive data from one project is bleeding into another context. For regulated industries — healthcare, finance, legal — this isn't just uncomfortable, it's a compliance risk.

With ClawBase, the memory layer is a file system on your dedicated instance. You can read it. You can grep through it. You can encrypt it at rest with keys you control. There's nothing proprietary about the format — it's markdown files, JSON, and SQLite databases that you own completely.

How ClawBase Puts Your Data Out of Reach

ClawBase runs OpenClaw on a dedicated cloud server for each customer. Every instance is fully isolated — no shared infrastructure, no shared memory, no multi-tenant data mixing. There is no central database where your conversations pool together with everyone else's.

Here's what that means in practice:

Your agent's memory is encrypted at rest and in transit. The encryption keys are tied to your instance. Memory persists across sessions, survives restarts and upgrades, and stays on infrastructure that only you can access. Not us. Not our engineers. Not a future acquirer. Not a government agency fishing through a provider's data lake. Your data is structurally out of reach for everyone except you.

This isn't a privacy policy. It's physics. The data never leaves your server, so there's nothing for anyone to request, subpoena, or breach on our side. We couldn't hand over your agent's memory even if someone asked — because we don't have it.

If you want to export everything and leave, you can. It's open-source software running on a server you have full access to. The memory is markdown files, JSON, and SQLite databases in standard formats. No proprietary lock, no vendor-specific encoding.

Why This Matters Now

AI agents are about to become critical business infrastructure. They're managing codebases, handling customer communications, automating workflows, and making decisions based on accumulated context. The organizations that let this context live on third-party infrastructure are building on a foundation they don't own.

I've seen this pattern before. It happened with email (Gmail), documents (Google Docs), and source code (GitHub). Each time, convenience won over sovereignty — and each time, the cost of switching became enormous once people were locked in.

AI memory is the next version of this trap, and it's happening faster because the lock-in is invisible. You don't notice it until you try to leave.

The Bottom Line

This isn't about being paranoid or anti-cloud. I use Claude, I use GPT, I use Gemini. They're incredible models. But there's a difference between using an AI model and letting an AI platform hold the keys to everything your agent knows about your work.

ClawBase exists because I believe the most sensitive data in the AI era — your agent's accumulated memory of your business — should be out of reach for everyone except you. Not protected by a promise. Not gated by a policy. Architecturally unreachable.

You should get the full power of a modern AI agent — persistent memory, 50+ model options, browser automation, tool use — without anyone else ever being able to touch what it learns about you.

Your agent's memory is your intellectual property. Make sure no one else can reach it.

Your AI Agent Forgets Everything After Every Session. Graphiti Fixes That.

ClawBase — Fri, 19 Jun 2026 05:58:47 +0000

Here's a problem every developer building AI agents has hit: your agent is smart for exactly one session. Close the chat, come back tomorrow, and it has no idea who you are, what you were working on, or what it already told you.

The standard fix is to dump the chat history back into the context window. That works until it doesn't — context windows fill up, latency spikes, costs balloon, and the agent still can't reason about when things happened or what changed.

Graphiti takes a fundamentally different approach. Instead of stuffing raw transcripts into a context window, it builds a temporal knowledge graph that tracks entities, relationships, and facts — including when those facts became true and when they were superseded.

What Graphiti Actually Is

Graphiti is an open-source framework by Zep for building and querying temporal context graphs for AI agents. It's the engine behind Zep's managed memory platform, but it's fully usable standalone.

The core idea: instead of treating memory as "a big pile of text the agent can search," Graphiti structures memory as a graph of entities (people, products, concepts), relationships between them, and facts with explicit time validity.

A fact in Graphiti looks like: "Kendra prefers Adidas shoes (as of March 2026)." If she switches to Nike in June, the old fact gets invalidated — not deleted — and the new one takes its place. Both are queryable. You can ask "what does Kendra prefer now?" and "what did Kendra prefer in March?"

This is what "temporal" means in practice. Every piece of information has a timeline.

Why This Matters More Than You Think

If you've built agents that run for more than a few turns, you've hit at least one of these:

The context window tax. Shoving full chat histories into the context window is the brute-force approach to agent memory. It works for short conversations, but at 115K tokens, you're looking at 30-second response times and massive API bills. Zep's benchmarks show their graph-based approach uses ~1.6K tokens for the same queries — roughly 2% of the baseline — with 90% lower latency.

The "it forgot" problem. Without structured memory, agents can't track state changes. If a user updates their preference, the old preference is still sitting somewhere in the transcript. The agent might retrieve the stale one. Graphiti's temporal invalidation handles this automatically — old facts are marked as superseded, not just buried under newer text.

The temporal reasoning gap. "Which happened first — when I updated the config or when the build broke?" Standard RAG can't answer this reliably. It retrieves text chunks by semantic similarity, not chronological order. Graphiti's bi-temporal tracking makes time-based queries first-class operations.

How It Works Under the Hood

Graphiti's architecture has three core layers:

1. Episodes (the raw data)

Everything that goes into Graphiti starts as an episode — a chunk of raw data, whether it's a chat message, a JSON document, or unstructured text. Episodes are the ground truth. Every derived fact traces back to the episode that produced it.

2. Entities and Relationships (the graph)

From episodes, Graphiti extracts entities (nodes) and relationships (edges). An LLM processes the raw data and identifies who/what is involved and how they relate to each other.

The interesting part: you can define your own entity and edge types upfront using Pydantic models (prescribed ontology), or let Graphiti discover structure from your data (learned ontology). Start simple, add structure as patterns emerge.

3. Temporal Validity (what makes it different)

Every fact in the graph carries a validity window. When new information contradicts an existing fact, the old fact gets an end timestamp. It's not deleted — it's invalidated. This means you can query the graph at any point in time and get the state of the world as it was then.

This is a fundamentally different model from vector-based RAG, where you're just doing similarity search over chunks with no concept of "this information replaced that information."

Graphiti vs. GraphRAG vs. Standard RAG

Aspect	Standard RAG	GraphRAG	Graphiti
Data updates	Batch reindex	Batch recompute	Incremental, real-time
Time handling	None	Basic timestamps	Bi-temporal with auto-invalidation
Contradictions	Retrieves both old and new	LLM summarization	Automatic invalidation, history preserved
Retrieval	Vector similarity	LLM summarization chains	Hybrid: semantic + keyword + graph traversal
Query latency	Sub-second	Seconds to tens of seconds	Sub-second
Schema	None	Fixed clusters	Custom Pydantic models or auto-discovered

The practical difference: GraphRAG was designed for static document corpora. It's great for summarizing a fixed set of documents. Graphiti was designed for data that changes — user preferences, business state, ongoing conversations, real-world events.

Getting Started (It's Simpler Than You'd Expect)

pip install graphiti-core

You need a graph database (Neo4j, FalkorDB, or Amazon Neptune) and an LLM API key (OpenAI, Anthropic, or Gemini all work).

The fastest way to try it:

# Start FalkorDB locally
docker run -p 6379:6379 -p 3000:3000 -it --rm falkordb/falkordb:latest

from graphiti_core import Graphiti
from graphiti_core.driver.falkordb_driver import FalkorDriver

driver = FalkorDriver(host="localhost", port=6379)
graphiti = Graphiti(graph_driver=driver)
await graphiti.build_indices_and_constraints()

# Add an episode (raw data)
await graphiti.add_episode(
    name="user_chat",
    episode_body="User said they switched from VS Code to Cursor last week and love the AI integration.",
    source_description="chat_message"
)

# Search the graph
results = await graphiti.search("What IDE does the user prefer?")

That's it. Graphiti handles entity extraction, relationship mapping, and temporal tracking automatically. You add data, you search — the graph builds itself.

The MCP Server (This Is Where It Gets Interesting)

Graphiti ships with an MCP server that lets Claude, Cursor, and other MCP-compatible tools use Graphiti as a memory backend directly. Deploy it with Docker and Neo4j, and your AI assistant gets persistent, temporally-aware memory without writing any custom memory management code.

This is relevant if you're building agent workflows that span multiple sessions. Instead of hacking together file-based memory or hoping the context window holds everything, you get structured, queryable memory with time awareness built in.

The Benchmarks Tell a Clear Story

Zep (powered by Graphiti) scored 94.8% on Deep Memory Retrieval versus MemGPT's 93.4%. More impressively, on LongMemEval — a much harder benchmark with 500 human-curated temporal reasoning questions — Zep hit 63.8% accuracy versus the full-context baseline's 55.4% with GPT-4o-mini. And it did it with 2% of the tokens and 90% less latency.

The full-context approach (dumping everything into the context window) scored 60.2% with GPT-4o on the same benchmark. Zep scored 71.2%. That's not a marginal improvement — that's the difference between an agent that sort of remembers and one that actually tracks what happened when.

When You Should (and Shouldn't) Use This

Use Graphiti when:

Your agent needs to remember things across sessions
Facts change over time and you need to track what changed
You're building personalized experiences where user state matters
You need temporal reasoning ("what happened before X?")
Context window costs are becoming a problem

Stick with standard RAG when:

You're searching a static document corpus
Your data doesn't change frequently
You don't need time-based reasoning
Simple semantic search is good enough for your use case

The Bigger Picture

The AI agent memory space is heating up. Mem0, Letta (MemGPT), and Zep/Graphiti are all attacking the problem from different angles. Anthropic shipped persistent memory for Claude Managed Agents in April 2026. The industry has collectively realized that stateless agents are a dead end for anything beyond simple Q&A.

Graphiti's bet is that graph-based temporal reasoning will outperform flat memory approaches as agent tasks get more complex. The benchmarks support this — especially for temporal reasoning tasks where standard approaches fall apart.

The framework is open source, actively maintained, and the community is growing. If you're building agents that need to remember things and reason about time, it's worth spending an afternoon with.

Have you tried Graphiti or any other agent memory framework? What's your current approach to handling memory across sessions? Would love to hear what's working (and what isn't) in production.

I Had 72 Hours With the Best AI Model Ever Released. Then the Government Took It Away.

ClawBase — Mon, 15 Jun 2026 08:09:17 +0000

Last Monday, Anthropic released Claude Fable 5. By Thursday, the US government ordered it shut down. In between, developers got a glimpse of something genuinely different — and then it was gone.

I want to talk about what Fable 5 actually was, why the 72 hours mattered, and what this means for everyone building with AI right now.

What Made Fable 5 Different

Let me skip the marketing language and go straight to the numbers.

SWE-Bench Pro (real software engineering tasks across open-source codebases):

Fable 5: 80.3%
GPT 5.5: 58.6%
Opus 4.8: 69.2%
Gemini 3.1 Pro: 54.2%

That's not an incremental improvement. That's a generational leap.

On FrontierCode Diamond — the hardest coding benchmark available — Fable 5 scored 29.3%. GPT 5.5 scored 5.7%. More than five times the performance on the tasks that actually matter: the ones that are genuinely hard.

It hit #1 on the Chatbot Arena leaderboard. It was the first model to break 90% on Anthropic's core analytics benchmark. It scored the highest ever on Harvey's Legal Agent Benchmark.

But benchmarks don't tell the full story. What mattered was how it felt to use.

72 Hours of "Wait, It Can Do That?"

Simon Willison — one of the most respected voices in the Python ecosystem — spent $110 in 24 hours testing it. He called it "something of a beast." Jamie Marsland from Automattic built a complete WordPress block theme from a single screenshot. In one attempt.

Stripe reported that Fable 5 compressed a 50-million-line Ruby migration from two months of engineering work into a single day.

Developers on Reddit and Hacker News were reporting things like:

"The negative traits from Opus 4.7 and 4.8 are either absent or under control."

"It feels smarter. It identifies bugs that previous versions missed."

"Fable on 'high' is producing substantially better results than Opus 4.8."

For 72 hours, every developer I know was testing it on their hardest problems — the multi-file refactors, the legacy code migrations, the "I've been putting this off for months" tasks. And it was handling them.

The model had a one-million-token context window and 128,000 output tokens. It could hold an entire codebase in its head and produce coherent, targeted diffs across dozens of files without losing the thread.

Then It Was Gone

On Thursday, June 12, at 5:21 PM Eastern, the Commerce Department issued a directive. By that evening, Fable 5 and its unrestricted sibling Mythos 5 were offline worldwide.

The backstory, as reported by multiple outlets: an unnamed company claimed to have found a jailbreak in the Mythos model. Amazon CEO Andy Jassy reportedly raised concerns with the White House about potential cybersecurity implications. The government's response was swift — export controls on access, effective immediately.

This was the first time in history that a government pulled a publicly deployed AI model offline.

Anthropic's response was blunt: if the standard is that a "narrow potential jailbreak" justifies recalling a commercial model deployed to hundreds of millions of people, then it would "essentially halt all new model deployments" across the entire industry.

They had a point. Perfect jailbreak resistance isn't currently possible for any provider. Not OpenAI. Not Google. Not anyone.

What Actually Got Lost

Here's what most coverage misses: the people who moved fastest got hurt the worst.

Some teams had already piped Fable 5 into production within those three days. They were running code migrations, handling complex analytical workflows, doing things that genuinely couldn't be done with other models at the same quality level. When the shutdown hit, they scrambled to find replacements for a capability level that doesn't currently exist elsewhere.

The broader Claude ecosystem was unaffected — Opus, Sonnet, and Haiku all kept running. But for the specific tasks where Fable 5 excelled — the deep multi-file refactors, the long-running agentic sessions, the "hold 50,000 lines of code in context and make targeted changes" work — there's a gap now.

And it's not just about capability. It's about trust.

The Trust Problem

If you're a startup building on top of AI APIs, the Fable 5 shutdown is a case study in platform risk. Here's a model that:

Launched on Monday
Was immediately the best publicly available AI model
Got integrated into production by the most aggressive teams
Disappeared on Thursday — with no advance warning

No deprecation period. No migration path. No "this will be turned off in 90 days." Just gone.

Anthropic didn't choose this. The government forced their hand. But from a developer's perspective, the why doesn't change the what. Your production system broke either way.

This accelerates something I've been thinking about for a while: the case for model-agnostic architectures. If your entire stack depends on one specific model from one specific provider, you're one government directive away from a very bad day.

The developers who will navigate this best are the ones building abstraction layers now — systems that can hot-swap between providers without rewriting business logic. Not because it's architecturally elegant, but because it's a survival requirement.

The Uncomfortable Precedent

The Fable 5 shutdown sets a precedent that extends well beyond one model.

It proves that government intervention can remove AI capabilities from the market overnight, globally. Not just restrict them to certain countries or users — remove them entirely. Even Anthropic's own employees lost access.

It proves that a single company's competitive complaint (Amazon's, reportedly) can trigger the shutdown of another company's product within the same day.

And it proves that safety theater — the kind where we applaud companies for being "responsible" — can backfire spectacularly. Anthropic was transparent about Mythos's capabilities. They built Fable 5 specifically as the safe-for-public-use version. They implemented guardrails, red-teaming, and 30-day data retention for jailbreak monitoring. They did everything "right" by the safety playbook. And they got punished for it.

Meanwhile, other models with comparable capabilities — which Anthropic themselves noted — remain available without issue.

What This Means for Developers Right Now

If you're building with AI, here's what I'd take away from this:

1. Never depend on a single model. Build your systems to swap between providers. Test your critical workflows against at least two different models. The switching cost is real, but it's nothing compared to the cost of a sudden shutdown.

2. Local inference just became more important. Models like Qwen3 and Llama 3.3 running on local hardware can't be shut down by a government directive. They're not at Fable 5's capability level, but they're good enough for a large percentage of tasks — and they're always available.

3. The 72-hour window taught us what's possible. Even if Fable 5 never comes back, we now know what frontier AI coding looks like. Other models will reach that level. The benchmark has been set.

4. Platform risk is real and it's growing. This isn't hypothetical anymore. It happened. Plan accordingly.

Looking Forward

Fable 5 was a three-day preview of where AI development tools are heading. It showed us that multi-file refactoring, long-context reasoning, and one-shot accuracy at production quality aren't science fiction — they're engineering problems with solutions.

The model itself might come back. It might not. But the capabilities it demonstrated will show up again, in one form or another.

The question is whether the next time around, we'll have built systems resilient enough to use them without betting everything on one provider's continued availability.

For now, I'm keeping my architecture model-agnostic and my local inference setup warm. I'd recommend you do the same.

What was your experience with Fable 5? Did you get to use it before the shutdown? I'm curious what other developers were building with it in those 72 hours.

I Connected PewDiePie's Odysseus to a Cloud Memory Stack — Zero API Costs, Persistent Memory

ClawBase — Thu, 04 Jun 2026 14:03:32 +0000

PewDiePie's Odysseus just hit 44,000 GitHub stars in four days. The pitch is simple: a self-hosted AI workspace that runs on your hardware, with your data, no subscriptions.

I set it up the day it dropped. The local model setup is genuinely impressive — Cookbook scans your GPU, recommends models, and you're chatting in minutes. No API keys, no monthly bills.

But within a couple of days, I already hit the wall I always hit with self-hosted AI: memory.

Odysseus has ChromaDB for basic vector memory. It works for recall within a session. But it won't connect dots across weeks of conversations. It doesn't run agents in the background while I sleep. And when I close my laptop, everything stops.

So I built a hybrid: Odysseus runs my local model (free inference), and a cloud agent layer handles persistent memory, scheduling, and background tasks (via ClawBase). Both talk to the same local LLM through an authenticated tunnel.

Here's the full technical setup.

The architecture

┌──────────────────────────┐                     ┌───────────────────────────┐
│     YOUR MACHINE         │                     │     CLOUD (ClawBase)      │
│                          │   authenticated     │                           │
│  Odysseus (port 7000)    │     tunnel          │  OpenClaw Agent           │
│  ├─ Chat UI              │◄──────────────────►│  ├─ Agent logic + tools    │
│  ├─ Agent (MCP, tools)   │                     │  ├─ 6-layer memory stack  │
│  ├─ Documents, Email     │                     │  │  ├─ Daily journal      │
│  └─ ChromaDB (basic mem) │                     │  │  ├─ DAG lossless ctx   │
│                          │                     │  │  ├─ QMD semantic search │
│  Ollama (port 11434)     │                     │  │  ├─ Mem0 curated facts │
│  ├─ Your local model     │◄── LLM inference ──│  │  ├─ Cognee knowledge    │
│  └─ Your GPU (free!)     │   /v1/chat/complete │  │  └─ Graphiti temporal  │
│                          │                     │  ├─ Cron scheduling       │
│  nginx (port 11435)      │                     │  ├─ Telegram/Slack/Disc.  │
│  └─ Auth proxy + TLS     │                     │  └─ Background tasks 24/7 │
│                          │                     │                           │
└──────────────────────────┘                     └───────────────────────────┘

Ollama, vLLM, and llama.cpp all expose an OpenAI-compatible /v1/chat/completions endpoint. Any service that speaks OpenAI API format can use your local model — it just needs a way to reach it.

The tunnel bridges your local model server to the cloud agent. Your GPU does the inference. The cloud handles everything else.

What you get:

$0 in API costs (your GPU runs the model)
6-layer persistent memory that builds up over weeks
Agents that run on a schedule, even when your machine is off (they queue and execute when you reconnect)
Odysseus as your local workspace for chat, documents, research
Telegram/WhatsApp/Slack access to your agent from anywhere

Prerequisites

Odysseus installed and running (Quick Start)
Ollama serving a model (the Cookbook makes this easy)
A ClawBase account (or any OpenClaw instance)
10-15 minutes

Step 1: Verify your local model is running

After setting up Odysseus and downloading a model through Cookbook, confirm Ollama is serving:

curl http://localhost:11434/v1/models

You should see your model listed. Test a completion:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:14b",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 50
  }'

If you're using vLLM instead of Ollama, it's on port 8000 by default:

curl http://localhost:8000/v1/models

For llama.cpp server, default port is 8080:

curl http://localhost:8080/v1/models

All three speak the same OpenAI-compatible format. The rest of this guide uses Ollama on port 11434, but substitute your port if different.

Step 2: Set up an authenticated reverse proxy

This is critical. Your local model server has zero authentication by default. Before exposing it through any tunnel, you need a proxy that enforces a Bearer token.

Option A: nginx (recommended for production)

Install nginx if not already present:

# Ubuntu/Debian
sudo apt install nginx

# macOS
brew install nginx

Create the proxy config:

sudo tee /etc/nginx/sites-available/llm-proxy << 'EOF'
server {
    listen 11435;

    location / {
        # Enforce Bearer token authentication
        set \$expected_token "sk-local-YOUR-SECRET-TOKEN-HERE";

        if (\$http_authorization != "Bearer \$expected_token") {
            return 401 '{"error": "unauthorized"}';
        }

        # Proxy to local Ollama
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host \$host;
        proxy_set_header X-Real-IP \$remote_addr;
        proxy_read_timeout 300s;  # LLM inference can be slow
        proxy_send_timeout 300s;

        # Streaming support (important for chat completions)
        proxy_buffering off;
        proxy_cache off;
        chunked_transfer_encoding on;
    }
}
EOF

sudo ln -sf /etc/nginx/sites-available/llm-proxy /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

Generate a strong token:

# Generate a random token
openssl rand -hex 32
# Output: a1b2c3d4e5f6...  (use this as your token)

Test the authenticated endpoint:

# Should fail (no token)
curl http://localhost:11435/v1/models
# → 401 unauthorized

# Should succeed (with token)
curl http://localhost:11435/v1/models \
  -H "Authorization: Bearer sk-local-YOUR-SECRET-TOKEN-HERE"
# → {"object":"list","data":[{"id":"qwen2.5:14b",...}]}

Option B: Caddy (simpler config)

# Caddyfile
:11435 {
    @auth {
        header Authorization "Bearer sk-local-YOUR-SECRET-TOKEN-HERE"
    }
    handle @auth {
        reverse_proxy localhost:11434
    }
    respond 401
}

Option C: litellm proxy (if you want model aliasing)

LiteLLM can sit in front of Ollama and add auth + model name mapping:

# litellm_config.yaml
model_list:
  - model_name: "gpt-4"  # alias your local model as gpt-4
    litellm_params:
      model: "ollama/qwen2.5:14b"
      api_base: "http://localhost:11434"

general_settings:
  master_key: "sk-local-YOUR-SECRET-TOKEN-HERE"

litellm --config litellm_config.yaml --port 11435

This is useful if the cloud agent expects specific model names like gpt-4 — you can alias your local model without changing the cloud config.

Step 3: Create the tunnel

You need to expose port 11435 (the authenticated proxy) to the internet so the cloud agent can reach it. Here are four options, from easiest to most control.

Option A: Cloudflare Tunnel (easiest, free)

# Install cloudflared
# macOS: brew install cloudflare/cloudflare/cloudflared
# Linux: https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/downloads/

# Quick tunnel (no Cloudflare account needed, ephemeral URL)
cloudflared tunnel --url http://localhost:11435

Output:

Your quick Tunnel has been created! Visit it at:
https://random-words-here.trycloudflare.com

That URL is your tunnel endpoint. For a persistent tunnel (survives reboots, stable URL):

cloudflared tunnel create llm-tunnel
cloudflared tunnel route dns llm-tunnel llm.yourdomain.com

# Create config
cat > ~/.cloudflared/config.yml << EOF
tunnel: <tunnel-id>
credentials-file: /home/user/.cloudflared/<tunnel-id>.json

ingress:
  - hostname: llm.yourdomain.com
    service: http://localhost:11435
  - service: http_status:404
EOF

# Run as service
cloudflared service install

Option B: Tailscale (best for existing Tailscale users)

If you already use Tailscale, your machine has a stable IP on the mesh network. No extra tunnel needed:

# Your Tailscale IP (e.g., 100.x.y.z)
tailscale ip -4

# The cloud agent connects to:
# http://100.x.y.z:11435/v1/chat/completions

For HTTPS, use Tailscale HTTPS:

tailscale cert your-machine.tailnet-name.ts.net

Option C: SSH Reverse Tunnel (quick and dirty)

If you have a VPS or any server with a public IP:

# From your local machine, tunnel port 11435 to the remote server's port 9000
ssh -R 9000:localhost:11435 user@your-vps.com -N

# The cloud agent connects to:
# http://your-vps.com:9000/v1/chat/completions

Make it persistent with autossh:

autossh -M 0 -f -R 9000:localhost:11435 user@your-vps.com -N \
  -o "ServerAliveInterval 30" -o "ServerAliveCountMax 3"

Option D: NAT Port Forward (classic, no dependencies)

On your router:

Forward external port 11435 → internal IP:11435
Set up Dynamic DNS (e.g., noip.com, DuckDNS) if you don't have a static IP

Add TLS with Let's Encrypt + certbot on your nginx:

sudo apt install certbot python3-certbot-nginx
sudo certbot --nginx -d llm.yourdomain.com

Updated nginx config becomes:

server {
    listen 443 ssl;
    server_name llm.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/llm.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/llm.yourdomain.com/privkey.pem;

    location / {
        set \$expected_token "sk-local-YOUR-SECRET-TOKEN-HERE";
        if (\$http_authorization != "Bearer \$expected_token") {
            return 401 '{"error": "unauthorized"}';
        }
        proxy_pass http://127.0.0.1:11434;
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
        proxy_buffering off;
    }
}

Step 4: Point ClawBase at your tunnel

This is the only change on the ClawBase side. Open your agent, go to the Model tab, and:

Under AI Source, select "Use your own API key"
Set Provider to "Custom (OpenAI-compatible)"
Fill in the three fields that appear:

Base URL:  https://your-tunnel-url.com/v1
Model:     qwen2.5:14b   (or whatever you're serving)
API Key:   sk-local-YOUR-SECRET-TOKEN-HERE

Click Save Settings

That's it. The "Custom (OpenAI-compatible)" provider accepts any endpoint that speaks the standard /v1/chat/completions format — Ollama, vLLM, llama.cpp, or anything behind your tunnel.

Verify it works before saving:

curl https://your-tunnel-url.com/v1/chat/completions \
  -H "Authorization: Bearer sk-local-YOUR-SECRET-TOKEN-HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:14b",
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "max_tokens": 50
  }'

If you get a response from your local model, the tunnel is working.

Step 5: Verify the hybrid setup

At this point you have two parallel paths to the same local model:

Interface	Path	Memory	Background tasks
Odysseus (local UI)	Direct to Ollama on localhost	ChromaDB (basic vector)	Only while app is open
ClawBase (cloud agent)	Through tunnel to Ollama	6-layer compound stack	Cron, scheduled, 24/7
Telegram/Slack	Through ClawBase → tunnel → Ollama	6-layer compound stack	Anytime, anywhere

Both use your GPU for inference. Neither pays OpenAI or Anthropic a cent.

Test the memory:

Tell ClawBase something: "My main project uses Next.js with Supabase. I prefer terse responses."
Close the conversation.
Open a new conversation hours later: "What stack is my project using?"
The agent remembers.

Try the same in Odysseus. Depending on the model and ChromaDB config, it may or may not retain this. The 6-layer stack (journal, DAG, QMD, Mem0, Cognee, Graphiti) is what makes the difference — each layer captures context differently, so things don't just get stuffed into a vector store and forgotten.

Step 6: Systemd service (keep it running)

Make the authenticated proxy and tunnel start on boot:

# /etc/systemd/system/llm-tunnel.service
[Unit]
Description=LLM Tunnel (Cloudflare)
After=network-online.target ollama.service
Wants=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/cloudflared tunnel run llm-tunnel
Restart=always
RestartSec=10
User=your-username

[Install]
WantedBy=multi-user.target

sudo systemctl enable --now llm-tunnel

For the SSH tunnel variant:

# /etc/systemd/system/llm-ssh-tunnel.service
[Unit]
Description=LLM SSH Reverse Tunnel
After=network-online.target

[Service]
Type=simple
ExecStart=/usr/bin/ssh -R 9000:localhost:11435 user@your-vps.com -N -o "ServerAliveInterval 30" -o "ServerAliveCountMax 3" -o "ExitOnForwardFailure yes"
Restart=always
RestartSec=15
User=your-username

[Install]
WantedBy=multi-user.target

Security considerations

You're exposing a local service to the internet. Take this seriously:

Always use the auth proxy. Never tunnel raw Ollama/vLLM without authentication.
Rotate your token periodically. Store it as an environment variable, not hardcoded.
Use TLS. Cloudflare Tunnel handles this automatically. For NAT port forward, use Let's Encrypt.
Rate limit. Add rate limiting in nginx to prevent abuse if your token leaks:

   limit_req_zone $binary_remote_addr zone=llm:10m rate=10r/m;
   location / {
       limit_req zone=llm burst=5;
       # ... rest of proxy config
   }

Monitor logs. Check nginx access logs for unexpected requests:

   tail -f /var/log/nginx/access.log | grep 11435

IP allowlist. If your cloud agent has a static IP, lock it down:

   allow 1.2.3.4;  # ClawBase IP
   deny all;

Performance notes

Local model inference over a tunnel adds network latency. Expect:

Setup	Time to first token
Odysseus → Ollama (localhost)	~50-200ms
ClawBase → Tunnel → Ollama	~200-500ms (depending on tunnel)
ClawBase → OpenAI API	~300-800ms

The tunnel adds latency comparable to a normal API call. For most use cases (agent tasks, background work, Telegram messages), this is imperceptible. For real-time streaming chat, you'll feel it — use Odysseus locally for that.

Throughput depends on your GPU and model size. A 14B model on an RTX 4090 generates ~50 tokens/sec. Through a tunnel, the bottleneck is always inference speed, not the network.

What's next

This works today with no code changes to either project. A couple of things I'm watching:

Odysseus API — Odysseus is 4 days old. If it exposes an API for external access or webhooks for incoming messages, the integration gets tighter: conversations stored in both places, memory synced both ways.
MCP bridge — Both Odysseus and OpenClaw support MCP. A shared MCP server for memory could let both frontends read and write to the same knowledge base.

You don't have to pick sides. Your model stays local, your inference stays free, and the memory layer lives wherever makes sense for your setup.

If you want to try this setup, Odysseus is MIT-licensed and free. ClawBase has a 7-day free trial starting at $16/mo. The tunnel takes about 10 minutes.

Questions? Drop a comment or find me on Twitter/X.

I Tested 33 AI Memory Engines — Here's What Actually Works

ClawBase — Thu, 28 May 2026 06:57:58 +0000

6 months ago, I asked my AI agent what we'd been working on last week. It had no idea. Not because it couldn't remember — ChatGPT has memory, Claude has memory — but because I couldn't see what it stored, couldn't query it, couldn't tell it what to forget. A black box with a toggle that says "memory: on."

So I started testing every memory framework I could find — 33 engines total, running on OpenClaw (350K+ GitHub stars). Most solved one problem well and failed at everything else.

After 6 months, I landed on an architecture that actually works. It's not about one magic engine — it's about layers.

The memory stack your agent actually needs

Before diving into the 33 engines, here's what I learned: agent memory isn't one thing. It's a stack, like a human brain has short-term memory, long-term memory, and the ability to look things up.

A working agent memory stack has 3 layers:

Layer 1: Conversation compression — remembering what just happened

Every conversation eventually hits the context window limit. Without this layer, your agent literally forgets the beginning of your current conversation. A conversation compressor (like Lossless-Claw) keeps a DAG of summaries — compacting older turns into condensed summaries while keeping the most recent turns untouched. Your agent never loses mid-session context.

Layer 2: Native files + semantic search — the persistent record

Plain markdown files your agent reads and writes: daily journals (2026-05-28.md), a curated MEMORY.md, preference files, project notes. Simple, version-controlled, human-readable. No database, no API, no dependencies — this is the memory layer that survives everything.

A local embedding model indexes these files and lets your agent search by meaning, not just keywords. "How did we handle the auth migration?" finds the right entry even if it never used the word "auth." QMD runs a 333MB GGUF model locally — sub-second search, no API costs, no data leaving your machine. The files are the source of truth; the embeddings make them instantly searchable.

Layer 3: The long-term intelligence engine — this is where you choose

The first two layers are table stakes. Every serious agent needs them. The third layer is where the 33 engines I tested come in — and where the real differences emerge.

The 33 engines I tested

Here's every memory framework I put through real-world use — not benchmarks, not demos, actual daily agent work. They naturally group into 6 categories, each solving a different type of remembering:

Vector similarity — the foundation layer

These engines store embeddings and retrieve by semantic similarity. They're the building blocks most other memory systems are built on top of.

#	Engine	What it does
1	ChromaDB	Embedding-based semantic search, lightweight and developer-friendly
2	Qdrant	High-performance vector similarity search with filtering
3	Weaviate	Hybrid vector + keyword search with pluggable modules
4	Milvus	Distributed vector database built for scale
5	Pinecone	Serverless managed vector search
6	pgvector	Vector similarity search as a PostgreSQL extension
7	FAISS	Meta's similarity search library — raw speed, no frills
8	Redis Vector	Vector similarity on Redis Stack
9	Supabase Vector	pgvector on managed Postgres with auth and APIs
10	Marqo	End-to-end tensor search engine
11	Deep Lake	Vector store optimized for AI dataset versioning
12	Vespa	Hybrid search + ML serving at scale

These are excellent at "find me something similar to X" but they don't understand what they're storing. A vector store treats your preferences, your project architecture, and last Tuesday's standup notes the same way — as floating-point arrays. For RAG and document retrieval, they're essential. For agent memory, they're a necessary layer but not sufficient on their own.

Session & conversation memory — remembering the current thread

These keep track of what's been said within and across conversations.

#	Engine	What it does
13	Zep	Long-term conversation memory with automatic fact extraction
14	Motorhead	Redis-backed conversation memory server
15	OpenAI Memory	ChatGPT's native conversation memory
16	Claude Memory	Anthropic's native conversation memory

These solve the "I already told you this" problem within a session. Zep stands out here — it goes beyond simple buffer storage and extracts structured facts from conversations. But session memory alone doesn't give your agent a persistent understanding of your world.

Framework memory modules — memory as a feature

These are memory components built into larger agent/RAG frameworks.

#	Engine	What it does
17	LlamaIndex Memory	Chat memory + knowledge index integration
18	LangChain Memory	Buffer, summary, and entity memory modules
19	LangMem	Memory management primitives for LangChain/LangGraph
20	Haystack Memory	Document store memory in RAG pipelines
21	txtai	All-in-one embeddings database with workflows
22	CrewAI Memory	Short/long/entity memory for multi-agent crews

Good if you're already inside that ecosystem. They give you memory abstractions (buffers, summaries, entity tracking) but they're tightly coupled to their framework. Memory is a feature of these tools, not their core mission.

Agentic & autonomous memory — the agent manages its own memory

These let the agent itself decide what to remember and what to forget.

#	Engine	What it does
23	Letta (MemGPT)	Self-editing memory with inner/outer monologue
24	AutoGPT Memory	File + vector memory for autonomous agents
25	Memary	Knowledge graph memory for autonomous agents
26	AGiXT	Adaptive memory with chained agent context
27	BabyAGI	Task-driven memory with priority queues

Fascinating research direction. Letta/MemGPT in particular pioneered the idea of the model managing its own memory tiers. The challenge in production: you're trusting the LLM to decide what's worth keeping, and that decision quality varies with the model and context.

Personal AI & bookmarks — memory for humans, not agents

#	Engine	What it does
28	Khoj	Self-hosted personal AI with file-based memory
29	SuperMemory	AI-powered memory for saved content and bookmarks
30	Vanna	RAG-based memory for database queries

These are designed more as personal knowledge tools than agent memory layers. They work well for their use case, but they're solving a different problem — helping you remember things, not giving your agent persistent understanding.

Structured memory engines — purpose-built for agent intelligence

These are the engines designed specifically to give agents structured, queryable, persistent memory:

#	Engine	What it does
31	Mem0	Intelligent fact extraction, deduplication, contradiction resolution
32	Cognee	Entity-relationship knowledge graphs with 14 retrieval modes
33	Graphiti	Temporal knowledge graph with validity windows

This is where it gets interesting — and where I spent most of my 6 months.

The 3 tiers of long-term memory

After testing all 33, the structured memory engines stood out. But here's the insight that took me months to reach: these three aren't meant to run together. They're evolutionary tiers. Each one supersedes the previous, adding capabilities while covering the lower tier's functionality.

Tier 1: Mem0 — facts and preferences

Mem0 (48K+ GitHub stars, $24M Series A) is the intelligent facts layer. Tell your agent "I prefer TypeScript" on Monday and "use Python for data scripts" on Thursday — Mem0 doesn't store two contradictory entries. It updates: TypeScript for general dev, Python for data. Every fact is categorized, timestamped, and confidence-scored.

Where Zep's fact extraction is a feature bolted onto session memory, Mem0's entire architecture is built around making facts reliable. Your agent starts every session already knowing your preferences, your project's quirks, and your conventions. No re-explaining.

Best for: developers and technical use cases. If your agent mainly needs to remember preferences, conventions, and project details across sessions, Mem0 is the right choice. It's the simplest to set up and the most focused.

Tier 2: Cognee — relationships and reasoning (supersedes Mem0)

Cognee ($7.5M seed, GitHub Secure Open Source graduate, running in 70+ companies) builds a knowledge graph — not isolated facts, but a web of entities, relationships, and semantic connections.

Where Mem0 knows "the client prefers blue branding," Cognee knows that the client's brand guidelines connect to last month's campaign performance, which connects to the audience segments that engaged most, which connects to the content calendar. It ships 14 retrieval modes and a self-improving "memify" feature that strengthens connections the more you use them.

Cognee handles everything Mem0 does (facts are just nodes in the graph) plus it maps the relationships between them. That's why it supersedes Tier 1 — you don't need Mem0 if you're running Cognee.

Best for: marketing, content, and multi-project work. If your agent needs to reason across brands, campaigns, audiences, and projects — understanding how things connect, not just what things are — Cognee is the right choice.

Tier 3: Graphiti — temporal reasoning (supersedes Cognee)

Graphiti by Zep is the temporal knowledge graph. Its core insight: knowing the current state isn't enough. You need to know when things changed and what was true before.

Every fact carries validity intervals. When new information conflicts with old, Graphiti doesn't overwrite — it creates a temporal record and invalidates the previous one, preserving full history. "When did this config change?" "What was different before the March deploy?" Graphiti answers directly, no digging through logs.

It outperforms MemGPT on the Deep Memory Retrieval benchmark using a combination of semantic search, keyword matching, and graph traversal.

Graphiti handles facts (like Mem0) and relationships (like Cognee) plus tracks how they change over time. It supersedes both lower tiers — but it's also the heaviest to run (FalkorDB, more compute, more complexity).

Best for: operations, executive, and business use cases. If your agent needs cause-and-effect reasoning across time — "what changed," "when did it break," "what was true before" — Graphiti is the right choice.

Pick one, not all three

Your use case	Pick this tier	Why
Developer / DevOps	Mem0	You need fast, reliable fact recall. Preferences, conventions, project details.
Marketing / Content	Cognee	You need relationship reasoning. Brands, campaigns, audiences, how they connect.
Operations / Executive	Graphiti	You need temporal reasoning. What changed, when, and what broke.

The common mistake is thinking "more engines = better memory." It's not. Each tier already includes the capabilities of the one below it. Running Mem0 alongside Graphiti is redundant — Graphiti already stores facts. Running all three wastes compute and creates consistency conflicts.

Pick the tier that matches your work. Pair it with the base stack (conversation compression + native files with semantic search) and your agent will remember everything that matters.

The full architecture

Here's what a complete agent memory stack looks like:

Every layer feeds context to the model. The bottom two are always-on. The top one is your choice based on what kind of reasoning your agent needs.

Getting this running

The base stack (layers 1–2) is built into OpenClaw — conversation compression, native memory files, and semantic search work out of the box. The long-term engine (layer 3) requires additional setup: Mem0 needs a vector store, Cognee needs a graph database, Graphiti runs on FalkorDB.

OpenClaw is open source and you can self-host the full stack. If you want to skip the infrastructure work, I've been building ClawBase — managed OpenClaw hosting that pre-configures the right memory stack for your use case. But honestly, even if you self-host, the main takeaway here is the architecture: a 3-layer memory stack where you pick the long-term engine that matches your work.

The memory compounds over time — whichever way you run it, the longer you use it, the better it gets.

One thing I keep coming back to: once your agent has a real memory stack, it opens the door to something bigger — consistent shared memory across multiple agents. Imagine a team of agents that don't just remember their own context, but share a unified understanding of your projects, preferences, and decisions. That's a different kind of architecture entirely, and one I'll dig into in a future article.