<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ClawBase</title>
    <description>The latest articles on DEV Community by ClawBase (@clawbase).</description>
    <link>https://dev.to/clawbase</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3954750%2F123dad7c-0713-4467-8f29-d931e1434213.jpeg</url>
      <title>DEV Community: ClawBase</title>
      <link>https://dev.to/clawbase</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/clawbase"/>
    <language>en</language>
    <item>
      <title>Your AI Agent Forgets Everything After Every Session. Graphiti Fixes That.</title>
      <dc:creator>ClawBase</dc:creator>
      <pubDate>Fri, 19 Jun 2026 05:58:47 +0000</pubDate>
      <link>https://dev.to/clawbase/your-ai-agent-forgets-everything-after-every-session-graphiti-fixes-that-3163</link>
      <guid>https://dev.to/clawbase/your-ai-agent-forgets-everything-after-every-session-graphiti-fixes-that-3163</guid>
      <description>&lt;p&gt;Here's a problem every developer building AI agents has hit: your agent is smart for exactly one session. Close the chat, come back tomorrow, and it has no idea who you are, what you were working on, or what it already told you.&lt;/p&gt;

&lt;p&gt;The standard fix is to dump the chat history back into the context window. That works until it doesn't — context windows fill up, latency spikes, costs balloon, and the agent still can't reason about &lt;em&gt;when&lt;/em&gt; things happened or &lt;em&gt;what changed&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Graphiti takes a fundamentally different approach. Instead of stuffing raw transcripts into a context window, it builds a temporal knowledge graph that tracks entities, relationships, and facts — including when those facts became true and when they were superseded.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Graphiti Actually Is
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/getzep/graphiti" rel="noopener noreferrer"&gt;Graphiti&lt;/a&gt; is an open-source framework by &lt;a href="https://www.getzep.com" rel="noopener noreferrer"&gt;Zep&lt;/a&gt; for building and querying temporal context graphs for AI agents. It's the engine behind Zep's managed memory platform, but it's fully usable standalone.&lt;/p&gt;

&lt;p&gt;The core idea: instead of treating memory as "a big pile of text the agent can search," Graphiti structures memory as a graph of entities (people, products, concepts), relationships between them, and facts with explicit time validity.&lt;/p&gt;

&lt;p&gt;A fact in Graphiti looks like: &lt;strong&gt;"Kendra prefers Adidas shoes (as of March 2026)."&lt;/strong&gt; If she switches to Nike in June, the old fact gets invalidated — not deleted — and the new one takes its place. Both are queryable. You can ask "what does Kendra prefer now?" and "what did Kendra prefer in March?"&lt;/p&gt;

&lt;p&gt;This is what "temporal" means in practice. Every piece of information has a timeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters More Than You Think
&lt;/h2&gt;

&lt;p&gt;If you've built agents that run for more than a few turns, you've hit at least one of these:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The context window tax.&lt;/strong&gt; Shoving full chat histories into the context window is the brute-force approach to agent memory. It works for short conversations, but at 115K tokens, you're looking at 30-second response times and massive API bills. Zep's benchmarks show their graph-based approach uses ~1.6K tokens for the same queries — roughly 2% of the baseline — with 90% lower latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "it forgot" problem.&lt;/strong&gt; Without structured memory, agents can't track state changes. If a user updates their preference, the old preference is still sitting somewhere in the transcript. The agent might retrieve the stale one. Graphiti's temporal invalidation handles this automatically — old facts are marked as superseded, not just buried under newer text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The temporal reasoning gap.&lt;/strong&gt; "Which happened first — when I updated the config or when the build broke?" Standard RAG can't answer this reliably. It retrieves text chunks by semantic similarity, not chronological order. Graphiti's bi-temporal tracking makes time-based queries first-class operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works Under the Hood
&lt;/h2&gt;

&lt;p&gt;Graphiti's architecture has three core layers:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Episodes (the raw data)
&lt;/h3&gt;

&lt;p&gt;Everything that goes into Graphiti starts as an episode — a chunk of raw data, whether it's a chat message, a JSON document, or unstructured text. Episodes are the ground truth. Every derived fact traces back to the episode that produced it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Entities and Relationships (the graph)
&lt;/h3&gt;

&lt;p&gt;From episodes, Graphiti extracts entities (nodes) and relationships (edges). An LLM processes the raw data and identifies who/what is involved and how they relate to each other.&lt;/p&gt;

&lt;p&gt;The interesting part: you can define your own entity and edge types upfront using Pydantic models (prescribed ontology), or let Graphiti discover structure from your data (learned ontology). Start simple, add structure as patterns emerge.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Temporal Validity (what makes it different)
&lt;/h3&gt;

&lt;p&gt;Every fact in the graph carries a validity window. When new information contradicts an existing fact, the old fact gets an end timestamp. It's not deleted — it's invalidated. This means you can query the graph at any point in time and get the state of the world as it was then.&lt;/p&gt;

&lt;p&gt;This is a fundamentally different model from vector-based RAG, where you're just doing similarity search over chunks with no concept of "this information replaced that information."&lt;/p&gt;

&lt;h2&gt;
  
  
  Graphiti vs. GraphRAG vs. Standard RAG
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Standard RAG&lt;/th&gt;
&lt;th&gt;GraphRAG&lt;/th&gt;
&lt;th&gt;Graphiti&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data updates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Batch reindex&lt;/td&gt;
&lt;td&gt;Batch recompute&lt;/td&gt;
&lt;td&gt;Incremental, real-time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Time handling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Basic timestamps&lt;/td&gt;
&lt;td&gt;Bi-temporal with auto-invalidation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Contradictions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieves both old and new&lt;/td&gt;
&lt;td&gt;LLM summarization&lt;/td&gt;
&lt;td&gt;Automatic invalidation, history preserved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retrieval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Vector similarity&lt;/td&gt;
&lt;td&gt;LLM summarization chains&lt;/td&gt;
&lt;td&gt;Hybrid: semantic + keyword + graph traversal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sub-second&lt;/td&gt;
&lt;td&gt;Seconds to tens of seconds&lt;/td&gt;
&lt;td&gt;Sub-second&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Schema&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Fixed clusters&lt;/td&gt;
&lt;td&gt;Custom Pydantic models or auto-discovered&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The practical difference: GraphRAG was designed for static document corpora. It's great for summarizing a fixed set of documents. Graphiti was designed for data that changes — user preferences, business state, ongoing conversations, real-world events.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started (It's Simpler Than You'd Expect)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;graphiti-core
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You need a graph database (Neo4j, FalkorDB, or Amazon Neptune) and an LLM API key (OpenAI, Anthropic, or Gemini all work).&lt;/p&gt;

&lt;p&gt;The fastest way to try it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start FalkorDB locally&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 6379:6379 &lt;span class="nt"&gt;-p&lt;/span&gt; 3000:3000 &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--rm&lt;/span&gt; falkordb/falkordb:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;graphiti_core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Graphiti&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;graphiti_core.driver.falkordb_driver&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FalkorDriver&lt;/span&gt;

&lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FalkorDriver&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graphiti&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Graphiti&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;graph_driver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;graphiti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build_indices_and_constraints&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Add an episode (raw data)
&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;graphiti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_episode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;episode_body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User said they switched from VS Code to Cursor last week and love the AI integration.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;source_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Search the graph
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;graphiti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What IDE does the user prefer?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Graphiti handles entity extraction, relationship mapping, and temporal tracking automatically. You add data, you search — the graph builds itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The MCP Server (This Is Where It Gets Interesting)
&lt;/h2&gt;

&lt;p&gt;Graphiti ships with an &lt;a href="https://github.com/getzep/graphiti/tree/main/mcp_server" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt; that lets Claude, Cursor, and other MCP-compatible tools use Graphiti as a memory backend directly. Deploy it with Docker and Neo4j, and your AI assistant gets persistent, temporally-aware memory without writing any custom memory management code.&lt;/p&gt;

&lt;p&gt;This is relevant if you're building agent workflows that span multiple sessions. Instead of hacking together file-based memory or hoping the context window holds everything, you get structured, queryable memory with time awareness built in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmarks Tell a Clear Story
&lt;/h2&gt;

&lt;p&gt;Zep (powered by Graphiti) scored &lt;strong&gt;94.8%&lt;/strong&gt; on Deep Memory Retrieval versus MemGPT's 93.4%. More impressively, on LongMemEval — a much harder benchmark with 500 human-curated temporal reasoning questions — Zep hit &lt;strong&gt;63.8%&lt;/strong&gt; accuracy versus the full-context baseline's 55.4% with GPT-4o-mini. And it did it with 2% of the tokens and 90% less latency.&lt;/p&gt;

&lt;p&gt;The full-context approach (dumping everything into the context window) scored 60.2% with GPT-4o on the same benchmark. Zep scored 71.2%. That's not a marginal improvement — that's the difference between an agent that sort of remembers and one that actually tracks what happened when.&lt;/p&gt;

&lt;h2&gt;
  
  
  When You Should (and Shouldn't) Use This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use Graphiti when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your agent needs to remember things across sessions&lt;/li&gt;
&lt;li&gt;Facts change over time and you need to track what changed&lt;/li&gt;
&lt;li&gt;You're building personalized experiences where user state matters&lt;/li&gt;
&lt;li&gt;You need temporal reasoning ("what happened before X?")&lt;/li&gt;
&lt;li&gt;Context window costs are becoming a problem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stick with standard RAG when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're searching a static document corpus&lt;/li&gt;
&lt;li&gt;Your data doesn't change frequently&lt;/li&gt;
&lt;li&gt;You don't need time-based reasoning&lt;/li&gt;
&lt;li&gt;Simple semantic search is good enough for your use case&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;The AI agent memory space is heating up. Mem0, Letta (MemGPT), and Zep/Graphiti are all attacking the problem from different angles. Anthropic shipped persistent memory for Claude Managed Agents in April 2026. The industry has collectively realized that stateless agents are a dead end for anything beyond simple Q&amp;amp;A.&lt;/p&gt;

&lt;p&gt;Graphiti's bet is that graph-based temporal reasoning will outperform flat memory approaches as agent tasks get more complex. The benchmarks support this — especially for temporal reasoning tasks where standard approaches fall apart.&lt;/p&gt;

&lt;p&gt;The framework is open source, actively maintained, and the community is growing. If you're building agents that need to remember things and reason about time, it's worth spending an afternoon with.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you tried Graphiti or any other agent memory framework? What's your current approach to handling memory across sessions? Would love to hear what's working (and what isn't) in production.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>opensource</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>I Had 72 Hours With the Best AI Model Ever Released. Then the Government Took It Away.</title>
      <dc:creator>ClawBase</dc:creator>
      <pubDate>Mon, 15 Jun 2026 08:09:17 +0000</pubDate>
      <link>https://dev.to/clawbase/i-had-72-hours-with-the-best-ai-model-ever-released-then-the-government-took-it-away-4gda</link>
      <guid>https://dev.to/clawbase/i-had-72-hours-with-the-best-ai-model-ever-released-then-the-government-took-it-away-4gda</guid>
      <description>&lt;p&gt;Last Monday, Anthropic released Claude Fable 5. By Thursday, the US government ordered it shut down. In between, developers got a glimpse of something genuinely different — and then it was gone.&lt;/p&gt;

&lt;p&gt;I want to talk about what Fable 5 actually was, why the 72 hours mattered, and what this means for everyone building with AI right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Made Fable 5 Different
&lt;/h2&gt;

&lt;p&gt;Let me skip the marketing language and go straight to the numbers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SWE-Bench Pro&lt;/strong&gt; (real software engineering tasks across open-source codebases):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fable 5: &lt;strong&gt;80.3%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;GPT 5.5: 58.6%&lt;/li&gt;
&lt;li&gt;Opus 4.8: 69.2%&lt;/li&gt;
&lt;li&gt;Gemini 3.1 Pro: 54.2%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's not an incremental improvement. That's a generational leap.&lt;/p&gt;

&lt;p&gt;On &lt;strong&gt;FrontierCode Diamond&lt;/strong&gt; — the hardest coding benchmark available — Fable 5 scored 29.3%. GPT 5.5 scored 5.7%. More than five times the performance on the tasks that actually matter: the ones that are genuinely hard.&lt;/p&gt;

&lt;p&gt;It hit #1 on the Chatbot Arena leaderboard. It was the first model to break 90% on Anthropic's core analytics benchmark. It scored the highest ever on Harvey's Legal Agent Benchmark.&lt;/p&gt;

&lt;p&gt;But benchmarks don't tell the full story. What mattered was how it &lt;em&gt;felt&lt;/em&gt; to use.&lt;/p&gt;

&lt;h2&gt;
  
  
  72 Hours of "Wait, It Can Do That?"
&lt;/h2&gt;

&lt;p&gt;Simon Willison — one of the most respected voices in the Python ecosystem — spent $110 in 24 hours testing it. He called it "something of a &lt;em&gt;beast&lt;/em&gt;." Jamie Marsland from Automattic built a complete WordPress block theme from a single screenshot. In one attempt.&lt;/p&gt;

&lt;p&gt;Stripe reported that Fable 5 compressed a 50-million-line Ruby migration from two months of engineering work into a single day.&lt;/p&gt;

&lt;p&gt;Developers on Reddit and Hacker News were reporting things like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The negative traits from Opus 4.7 and 4.8 are either absent or under control."&lt;/p&gt;

&lt;p&gt;"It feels smarter. It identifies bugs that previous versions missed."&lt;/p&gt;

&lt;p&gt;"Fable on 'high' is producing substantially better results than Opus 4.8."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For 72 hours, every developer I know was testing it on their hardest problems — the multi-file refactors, the legacy code migrations, the "I've been putting this off for months" tasks. And it was handling them.&lt;/p&gt;

&lt;p&gt;The model had a one-million-token context window and 128,000 output tokens. It could hold an entire codebase in its head and produce coherent, targeted diffs across dozens of files without losing the thread.&lt;/p&gt;

&lt;h2&gt;
  
  
  Then It Was Gone
&lt;/h2&gt;

&lt;p&gt;On Thursday, June 12, at 5:21 PM Eastern, the Commerce Department issued a directive. By that evening, Fable 5 and its unrestricted sibling Mythos 5 were offline worldwide.&lt;/p&gt;

&lt;p&gt;The backstory, as reported by multiple outlets: an unnamed company claimed to have found a jailbreak in the Mythos model. Amazon CEO Andy Jassy reportedly raised concerns with the White House about potential cybersecurity implications. The government's response was swift — export controls on access, effective immediately.&lt;/p&gt;

&lt;p&gt;This was the first time in history that a government pulled a publicly deployed AI model offline.&lt;/p&gt;

&lt;p&gt;Anthropic's response was blunt: if the standard is that a "narrow potential jailbreak" justifies recalling a commercial model deployed to hundreds of millions of people, then it would "essentially halt all new model deployments" across the entire industry.&lt;/p&gt;

&lt;p&gt;They had a point. Perfect jailbreak resistance isn't currently possible for &lt;em&gt;any&lt;/em&gt; provider. Not OpenAI. Not Google. Not anyone.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Got Lost
&lt;/h2&gt;

&lt;p&gt;Here's what most coverage misses: the people who moved fastest got hurt the worst.&lt;/p&gt;

&lt;p&gt;Some teams had already piped Fable 5 into production within those three days. They were running code migrations, handling complex analytical workflows, doing things that genuinely couldn't be done with other models at the same quality level. When the shutdown hit, they scrambled to find replacements for a capability level that doesn't currently exist elsewhere.&lt;/p&gt;

&lt;p&gt;The broader Claude ecosystem was unaffected — Opus, Sonnet, and Haiku all kept running. But for the specific tasks where Fable 5 excelled — the deep multi-file refactors, the long-running agentic sessions, the "hold 50,000 lines of code in context and make targeted changes" work — there's a gap now.&lt;/p&gt;

&lt;p&gt;And it's not just about capability. It's about trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trust Problem
&lt;/h2&gt;

&lt;p&gt;If you're a startup building on top of AI APIs, the Fable 5 shutdown is a case study in platform risk. Here's a model that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Launched on Monday&lt;/li&gt;
&lt;li&gt;Was immediately the best publicly available AI model&lt;/li&gt;
&lt;li&gt;Got integrated into production by the most aggressive teams&lt;/li&gt;
&lt;li&gt;Disappeared on Thursday — with no advance warning&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No deprecation period. No migration path. No "this will be turned off in 90 days." Just gone.&lt;/p&gt;

&lt;p&gt;Anthropic didn't choose this. The government forced their hand. But from a developer's perspective, the &lt;em&gt;why&lt;/em&gt; doesn't change the &lt;em&gt;what&lt;/em&gt;. Your production system broke either way.&lt;/p&gt;

&lt;p&gt;This accelerates something I've been thinking about for a while: the case for model-agnostic architectures. If your entire stack depends on one specific model from one specific provider, you're one government directive away from a very bad day.&lt;/p&gt;

&lt;p&gt;The developers who will navigate this best are the ones building abstraction layers now — systems that can hot-swap between providers without rewriting business logic. Not because it's architecturally elegant, but because it's a survival requirement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Uncomfortable Precedent
&lt;/h2&gt;

&lt;p&gt;The Fable 5 shutdown sets a precedent that extends well beyond one model.&lt;/p&gt;

&lt;p&gt;It proves that government intervention can remove AI capabilities from the market overnight, globally. Not just restrict them to certain countries or users — remove them entirely. Even Anthropic's own employees lost access.&lt;/p&gt;

&lt;p&gt;It proves that a single company's competitive complaint (Amazon's, reportedly) can trigger the shutdown of another company's product within the same day.&lt;/p&gt;

&lt;p&gt;And it proves that safety theater — the kind where we applaud companies for being "responsible" — can backfire spectacularly. Anthropic was transparent about Mythos's capabilities. They built Fable 5 specifically as the safe-for-public-use version. They implemented guardrails, red-teaming, and 30-day data retention for jailbreak monitoring. They did everything "right" by the safety playbook. And they got punished for it.&lt;/p&gt;

&lt;p&gt;Meanwhile, other models with comparable capabilities — which Anthropic themselves noted — remain available without issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Developers Right Now
&lt;/h2&gt;

&lt;p&gt;If you're building with AI, here's what I'd take away from this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Never depend on a single model.&lt;/strong&gt; Build your systems to swap between providers. Test your critical workflows against at least two different models. The switching cost is real, but it's nothing compared to the cost of a sudden shutdown.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Local inference just became more important.&lt;/strong&gt; Models like Qwen3 and Llama 3.3 running on local hardware can't be shut down by a government directive. They're not at Fable 5's capability level, but they're good enough for a large percentage of tasks — and they're &lt;em&gt;always available&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The 72-hour window taught us what's possible.&lt;/strong&gt; Even if Fable 5 never comes back, we now know what frontier AI coding looks like. Other models will reach that level. The benchmark has been set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Platform risk is real and it's growing.&lt;/strong&gt; This isn't hypothetical anymore. It happened. Plan accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Forward
&lt;/h2&gt;

&lt;p&gt;Fable 5 was a three-day preview of where AI development tools are heading. It showed us that multi-file refactoring, long-context reasoning, and one-shot accuracy at production quality aren't science fiction — they're engineering problems with solutions.&lt;/p&gt;

&lt;p&gt;The model itself might come back. It might not. But the capabilities it demonstrated will show up again, in one form or another.&lt;/p&gt;

&lt;p&gt;The question is whether the next time around, we'll have built systems resilient enough to use them without betting everything on one provider's continued availability.&lt;/p&gt;

&lt;p&gt;For now, I'm keeping my architecture model-agnostic and my local inference setup warm. I'd recommend you do the same.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What was your experience with Fable 5? Did you get to use it before the shutdown? I'm curious what other developers were building with it in those 72 hours.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>claudeai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>I Connected PewDiePie's Odysseus to a Cloud Memory Stack — Zero API Costs, Persistent Memory</title>
      <dc:creator>ClawBase</dc:creator>
      <pubDate>Thu, 04 Jun 2026 14:03:32 +0000</pubDate>
      <link>https://dev.to/clawbase/i-connected-pewdiepies-odysseus-to-a-cloud-memory-stack-zero-api-costs-persistent-memory-4ke5</link>
      <guid>https://dev.to/clawbase/i-connected-pewdiepies-odysseus-to-a-cloud-memory-stack-zero-api-costs-persistent-memory-4ke5</guid>
      <description>&lt;p&gt;PewDiePie's &lt;a href="https://github.com/pewdiepie-archdaemon/odysseus" rel="noopener noreferrer"&gt;Odysseus&lt;/a&gt; just hit 44,000 GitHub stars in four days. The pitch is simple: a self-hosted AI workspace that runs on your hardware, with your data, no subscriptions.&lt;/p&gt;

&lt;p&gt;I set it up the day it dropped. The local model setup is genuinely impressive — Cookbook scans your GPU, recommends models, and you're chatting in minutes. No API keys, no monthly bills.&lt;/p&gt;

&lt;p&gt;But within a couple of days, I already hit the wall I always hit with self-hosted AI: memory.&lt;/p&gt;

&lt;p&gt;Odysseus has ChromaDB for basic vector memory. It works for recall within a session. But it won't connect dots across weeks of conversations. It doesn't run agents in the background while I sleep. And when I close my laptop, everything stops.&lt;/p&gt;

&lt;p&gt;So I built a hybrid: &lt;strong&gt;Odysseus runs my local model (free inference), and a cloud agent layer handles persistent memory, scheduling, and background tasks (via ClawBase).&lt;/strong&gt; Both talk to the same local LLM through an authenticated tunnel.&lt;/p&gt;

&lt;p&gt;Here's the full technical setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────┐                     ┌───────────────────────────┐
│     YOUR MACHINE         │                     │     CLOUD (ClawBase)      │
│                          │   authenticated     │                           │
│  Odysseus (port 7000)    │     tunnel          │  OpenClaw Agent           │
│  ├─ Chat UI              │◄──────────────────►│  ├─ Agent logic + tools    │
│  ├─ Agent (MCP, tools)   │                     │  ├─ 6-layer memory stack  │
│  ├─ Documents, Email     │                     │  │  ├─ Daily journal      │
│  └─ ChromaDB (basic mem) │                     │  │  ├─ DAG lossless ctx   │
│                          │                     │  │  ├─ QMD semantic search │
│  Ollama (port 11434)     │                     │  │  ├─ Mem0 curated facts │
│  ├─ Your local model     │◄── LLM inference ──│  │  ├─ Cognee knowledge    │
│  └─ Your GPU (free!)     │   /v1/chat/complete │  │  └─ Graphiti temporal  │
│                          │                     │  ├─ Cron scheduling       │
│  nginx (port 11435)      │                     │  ├─ Telegram/Slack/Disc.  │
│  └─ Auth proxy + TLS     │                     │  └─ Background tasks 24/7 │
│                          │                     │                           │
└──────────────────────────┘                     └───────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ollama, vLLM, and llama.cpp all expose an OpenAI-compatible &lt;code&gt;/v1/chat/completions&lt;/code&gt; endpoint. Any service that speaks OpenAI API format can use your local model — it just needs a way to reach it.&lt;/p&gt;

&lt;p&gt;The tunnel bridges your local model server to the cloud agent. Your GPU does the inference. The cloud handles everything else.&lt;/p&gt;

&lt;p&gt;What you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$0 in API costs (your GPU runs the model)&lt;/li&gt;
&lt;li&gt;6-layer persistent memory that builds up over weeks&lt;/li&gt;
&lt;li&gt;Agents that run on a schedule, even when your machine is off (they queue and execute when you reconnect)&lt;/li&gt;
&lt;li&gt;Odysseus as your local workspace for chat, documents, research&lt;/li&gt;
&lt;li&gt;Telegram/WhatsApp/Slack access to your agent from anywhere&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Odysseus&lt;/strong&gt; installed and running (&lt;a href="https://github.com/pewdiepie-archdaemon/odysseus#quick-start" rel="noopener noreferrer"&gt;Quick Start&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; serving a model (the Cookbook makes this easy)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;ClawBase&lt;/strong&gt; account (or any OpenClaw instance)&lt;/li&gt;
&lt;li&gt;10-15 minutes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 1: Verify your local model is running
&lt;/h2&gt;

&lt;p&gt;After setting up Odysseus and downloading a model through Cookbook, confirm Ollama is serving:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:11434/v1/models
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see your model listed. Test a completion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:11434/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "qwen2.5:14b",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 50
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're using &lt;strong&gt;vLLM&lt;/strong&gt; instead of Ollama, it's on port 8000 by default:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8000/v1/models
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For &lt;strong&gt;llama.cpp server&lt;/strong&gt;, default port is 8080:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/v1/models
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All three speak the same OpenAI-compatible format. The rest of this guide uses Ollama on port 11434, but substitute your port if different.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Set up an authenticated reverse proxy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This is critical.&lt;/strong&gt; Your local model server has zero authentication by default. Before exposing it through any tunnel, you need a proxy that enforces a Bearer token.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option A: nginx (recommended for production)
&lt;/h3&gt;

&lt;p&gt;Install nginx if not already present:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Ubuntu/Debian&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;nginx

&lt;span class="c"&gt;# macOS&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create the proxy config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/nginx/sites-available/llm-proxy &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
server {
    listen 11435;

    location / {
        # Enforce Bearer token authentication
        set &lt;/span&gt;&lt;span class="se"&gt;\$&lt;/span&gt;&lt;span class="sh"&gt;expected_token "sk-local-YOUR-SECRET-TOKEN-HERE";

        if (&lt;/span&gt;&lt;span class="se"&gt;\$&lt;/span&gt;&lt;span class="sh"&gt;http_authorization != "Bearer &lt;/span&gt;&lt;span class="se"&gt;\$&lt;/span&gt;&lt;span class="sh"&gt;expected_token") {
            return 401 '{"error": "unauthorized"}';
        }

        # Proxy to local Ollama
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host &lt;/span&gt;&lt;span class="se"&gt;\$&lt;/span&gt;&lt;span class="sh"&gt;host;
        proxy_set_header X-Real-IP &lt;/span&gt;&lt;span class="se"&gt;\$&lt;/span&gt;&lt;span class="sh"&gt;remote_addr;
        proxy_read_timeout 300s;  # LLM inference can be slow
        proxy_send_timeout 300s;

        # Streaming support (important for chat completions)
        proxy_buffering off;
        proxy_cache off;
        chunked_transfer_encoding on;
    }
}
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="nb"&gt;sudo ln&lt;/span&gt; &lt;span class="nt"&gt;-sf&lt;/span&gt; /etc/nginx/sites-available/llm-proxy /etc/nginx/sites-enabled/
&lt;span class="nb"&gt;sudo &lt;/span&gt;nginx &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl reload nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate a strong token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate a random token&lt;/span&gt;
openssl rand &lt;span class="nt"&gt;-hex&lt;/span&gt; 32
&lt;span class="c"&gt;# Output: a1b2c3d4e5f6...  (use this as your token)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test the authenticated endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Should fail (no token)&lt;/span&gt;
curl http://localhost:11435/v1/models
&lt;span class="c"&gt;# → 401 unauthorized&lt;/span&gt;

&lt;span class="c"&gt;# Should succeed (with token)&lt;/span&gt;
curl http://localhost:11435/v1/models &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer sk-local-YOUR-SECRET-TOKEN-HERE"&lt;/span&gt;
&lt;span class="c"&gt;# → {"object":"list","data":[{"id":"qwen2.5:14b",...}]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option B: Caddy (simpler config)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Caddyfile&lt;/span&gt;
:11435 &lt;span class="o"&gt;{&lt;/span&gt;
    @auth &lt;span class="o"&gt;{&lt;/span&gt;
        header Authorization &lt;span class="s2"&gt;"Bearer sk-local-YOUR-SECRET-TOKEN-HERE"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    handle @auth &lt;span class="o"&gt;{&lt;/span&gt;
        reverse_proxy localhost:11434
    &lt;span class="o"&gt;}&lt;/span&gt;
    respond 401
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option C: litellm proxy (if you want model aliasing)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;LiteLLM&lt;/a&gt; can sit in front of Ollama and add auth + model name mapping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# litellm_config.yaml&lt;/span&gt;
&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4"&lt;/span&gt;  &lt;span class="c1"&gt;# alias your local model as gpt-4&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ollama/qwen2.5:14b"&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434"&lt;/span&gt;

&lt;span class="na"&gt;general_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;master_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-local-YOUR-SECRET-TOKEN-HERE"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;litellm &lt;span class="nt"&gt;--config&lt;/span&gt; litellm_config.yaml &lt;span class="nt"&gt;--port&lt;/span&gt; 11435
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is useful if the cloud agent expects specific model names like &lt;code&gt;gpt-4&lt;/code&gt; — you can alias your local model without changing the cloud config.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Create the tunnel
&lt;/h2&gt;

&lt;p&gt;You need to expose port 11435 (the authenticated proxy) to the internet so the cloud agent can reach it. Here are four options, from easiest to most control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option A: Cloudflare Tunnel (easiest, free)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install cloudflared&lt;/span&gt;
&lt;span class="c"&gt;# macOS: brew install cloudflare/cloudflare/cloudflared&lt;/span&gt;
&lt;span class="c"&gt;# Linux: https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/downloads/&lt;/span&gt;

&lt;span class="c"&gt;# Quick tunnel (no Cloudflare account needed, ephemeral URL)&lt;/span&gt;
cloudflared tunnel &lt;span class="nt"&gt;--url&lt;/span&gt; http://localhost:11435
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your quick Tunnel has been created! Visit it at:
https://random-words-here.trycloudflare.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That URL is your tunnel endpoint. For a &lt;strong&gt;persistent tunnel&lt;/strong&gt; (survives reboots, stable URL):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cloudflared tunnel create llm-tunnel
cloudflared tunnel route dns llm-tunnel llm.yourdomain.com

&lt;span class="c"&gt;# Create config&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; ~/.cloudflared/config.yml &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
tunnel: &amp;lt;tunnel-id&amp;gt;
credentials-file: /home/user/.cloudflared/&amp;lt;tunnel-id&amp;gt;.json

ingress:
  - hostname: llm.yourdomain.com
    service: http://localhost:11435
  - service: http_status:404
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Run as service&lt;/span&gt;
cloudflared service &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option B: Tailscale (best for existing Tailscale users)
&lt;/h3&gt;

&lt;p&gt;If you already use Tailscale, your machine has a stable IP on the mesh network. No extra tunnel needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Your Tailscale IP (e.g., 100.x.y.z)&lt;/span&gt;
tailscale ip &lt;span class="nt"&gt;-4&lt;/span&gt;

&lt;span class="c"&gt;# The cloud agent connects to:&lt;/span&gt;
&lt;span class="c"&gt;# http://100.x.y.z:11435/v1/chat/completions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For HTTPS, use &lt;a href="https://tailscale.com/kb/1153/enabling-https/" rel="noopener noreferrer"&gt;Tailscale HTTPS&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;tailscale cert your-machine.tailnet-name.ts.net
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option C: SSH Reverse Tunnel (quick and dirty)
&lt;/h3&gt;

&lt;p&gt;If you have a VPS or any server with a public IP:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# From your local machine, tunnel port 11435 to the remote server's port 9000&lt;/span&gt;
ssh &lt;span class="nt"&gt;-R&lt;/span&gt; 9000:localhost:11435 user@your-vps.com &lt;span class="nt"&gt;-N&lt;/span&gt;

&lt;span class="c"&gt;# The cloud agent connects to:&lt;/span&gt;
&lt;span class="c"&gt;# http://your-vps.com:9000/v1/chat/completions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make it persistent with autossh:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;autossh &lt;span class="nt"&gt;-M&lt;/span&gt; 0 &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; 9000:localhost:11435 user@your-vps.com &lt;span class="nt"&gt;-N&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"ServerAliveInterval 30"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"ServerAliveCountMax 3"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option D: NAT Port Forward (classic, no dependencies)
&lt;/h3&gt;

&lt;p&gt;On your router:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Forward external port 11435 → internal IP:11435&lt;/li&gt;
&lt;li&gt;Set up Dynamic DNS (e.g., noip.com, DuckDNS) if you don't have a static IP&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Add TLS&lt;/strong&gt; with Let's Encrypt + certbot on your nginx:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;certbot python3-certbot-nginx
&lt;span class="nb"&gt;sudo &lt;/span&gt;certbot &lt;span class="nt"&gt;--nginx&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; llm.yourdomain.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Updated nginx config becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;listen&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt; &lt;span class="s"&gt;ssl&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server_name&lt;/span&gt; &lt;span class="s"&gt;llm.yourdomain.com&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;ssl_certificate&lt;/span&gt; &lt;span class="n"&gt;/etc/letsencrypt/live/llm.yourdomain.com/fullchain.pem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;ssl_certificate_key&lt;/span&gt; &lt;span class="n"&gt;/etc/letsencrypt/live/llm.yourdomain.com/privkey.pem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;set&lt;/span&gt; &lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="nv"&gt;$expected_token&lt;/span&gt; &lt;span class="s"&gt;"sk-local-YOUR-SECRET-TOKEN-HERE"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;if&lt;/span&gt; &lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="nv"&gt;$http_authorization&lt;/span&gt; &lt;span class="s"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;"Bearer&lt;/span&gt; &lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="nv"&gt;$expected_token&lt;/span&gt;&lt;span class="s"&gt;")&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kn"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;401&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="kn"&gt;"error":&lt;/span&gt; &lt;span class="s"&gt;"unauthorized"&lt;/span&gt;&lt;span class="err"&gt;}&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://127.0.0.1:11434&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_read_timeout&lt;/span&gt; &lt;span class="s"&gt;300s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_send_timeout&lt;/span&gt; &lt;span class="s"&gt;300s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_buffering&lt;/span&gt; &lt;span class="no"&gt;off&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 4: Point ClawBase at your tunnel
&lt;/h2&gt;

&lt;p&gt;This is the only change on the ClawBase side. Open your agent, go to the &lt;strong&gt;Model&lt;/strong&gt; tab, and:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Under &lt;strong&gt;AI Source&lt;/strong&gt;, select "Use your own API key"&lt;/li&gt;
&lt;li&gt;Set &lt;strong&gt;Provider&lt;/strong&gt; to "Custom (OpenAI-compatible)"&lt;/li&gt;
&lt;li&gt;Fill in the three fields that appear:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Base URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;https://your-tunnel-url.com/v1&lt;/span&gt;
&lt;span class="na"&gt;Model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;     &lt;span class="s"&gt;qwen2.5:14b   (or whatever you're serving)&lt;/span&gt;
&lt;span class="na"&gt;API Key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="s"&gt;sk-local-YOUR-SECRET-TOKEN-HERE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Save Settings&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. The "Custom (OpenAI-compatible)" provider accepts any endpoint that speaks the standard &lt;code&gt;/v1/chat/completions&lt;/code&gt; format — Ollama, vLLM, llama.cpp, or anything behind your tunnel.&lt;/p&gt;

&lt;p&gt;Verify it works before saving:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://your-tunnel-url.com/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer sk-local-YOUR-SECRET-TOKEN-HERE"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "qwen2.5:14b",
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "max_tokens": 50
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you get a response from your local model, the tunnel is working.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Verify the hybrid setup
&lt;/h2&gt;

&lt;p&gt;At this point you have two parallel paths to the same local model:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Interface&lt;/th&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Memory&lt;/th&gt;
&lt;th&gt;Background tasks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Odysseus&lt;/strong&gt; (local UI)&lt;/td&gt;
&lt;td&gt;Direct to Ollama on localhost&lt;/td&gt;
&lt;td&gt;ChromaDB (basic vector)&lt;/td&gt;
&lt;td&gt;Only while app is open&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;ClawBase&lt;/strong&gt; (cloud agent)&lt;/td&gt;
&lt;td&gt;Through tunnel to Ollama&lt;/td&gt;
&lt;td&gt;6-layer compound stack&lt;/td&gt;
&lt;td&gt;Cron, scheduled, 24/7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Telegram/Slack&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Through ClawBase → tunnel → Ollama&lt;/td&gt;
&lt;td&gt;6-layer compound stack&lt;/td&gt;
&lt;td&gt;Anytime, anywhere&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both use your GPU for inference. Neither pays OpenAI or Anthropic a cent.&lt;/p&gt;

&lt;p&gt;Test the memory:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tell ClawBase something: &lt;em&gt;"My main project uses Next.js with Supabase. I prefer terse responses."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Close the conversation.&lt;/li&gt;
&lt;li&gt;Open a new conversation hours later: &lt;em&gt;"What stack is my project using?"&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The agent remembers.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Try the same in Odysseus. Depending on the model and ChromaDB config, it may or may not retain this. The 6-layer stack (journal, DAG, QMD, Mem0, Cognee, Graphiti) is what makes the difference — each layer captures context differently, so things don't just get stuffed into a vector store and forgotten.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Systemd service (keep it running)
&lt;/h2&gt;

&lt;p&gt;Make the authenticated proxy and tunnel start on boot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/llm-tunnel.service&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;Unit]
&lt;span class="nv"&gt;Description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;LLM Tunnel &lt;span class="o"&gt;(&lt;/span&gt;Cloudflare&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;After&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;network-online.target ollama.service
&lt;span class="nv"&gt;Wants&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;network-online.target

&lt;span class="o"&gt;[&lt;/span&gt;Service]
&lt;span class="nv"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;simple
&lt;span class="nv"&gt;ExecStart&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/local/bin/cloudflared tunnel run llm-tunnel
&lt;span class="nv"&gt;Restart&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;always
&lt;span class="nv"&gt;RestartSec&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10
&lt;span class="nv"&gt;User&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-username

&lt;span class="o"&gt;[&lt;/span&gt;Install]
&lt;span class="nv"&gt;WantedBy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;multi-user.target
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="nt"&gt;--now&lt;/span&gt; llm-tunnel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the SSH tunnel variant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/llm-ssh-tunnel.service&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;Unit]
&lt;span class="nv"&gt;Description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;LLM SSH Reverse Tunnel
&lt;span class="nv"&gt;After&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;network-online.target

&lt;span class="o"&gt;[&lt;/span&gt;Service]
&lt;span class="nv"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;simple
&lt;span class="nv"&gt;ExecStart&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/bin/ssh &lt;span class="nt"&gt;-R&lt;/span&gt; 9000:localhost:11435 user@your-vps.com &lt;span class="nt"&gt;-N&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"ServerAliveInterval 30"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"ServerAliveCountMax 3"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"ExitOnForwardFailure yes"&lt;/span&gt;
&lt;span class="nv"&gt;Restart&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;always
&lt;span class="nv"&gt;RestartSec&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;15
&lt;span class="nv"&gt;User&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-username

&lt;span class="o"&gt;[&lt;/span&gt;Install]
&lt;span class="nv"&gt;WantedBy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;multi-user.target
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Security considerations
&lt;/h2&gt;

&lt;p&gt;You're exposing a local service to the internet. Take this seriously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Always use the auth proxy.&lt;/strong&gt; Never tunnel raw Ollama/vLLM without authentication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rotate your token&lt;/strong&gt; periodically. Store it as an environment variable, not hardcoded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use TLS.&lt;/strong&gt; Cloudflare Tunnel handles this automatically. For NAT port forward, use Let's Encrypt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limit.&lt;/strong&gt; Add rate limiting in nginx to prevent abuse if your token leaks:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;   &lt;span class="k"&gt;limit_req_zone&lt;/span&gt; &lt;span class="nv"&gt;$binary_remote_addr&lt;/span&gt; &lt;span class="s"&gt;zone=llm:10m&lt;/span&gt; &lt;span class="s"&gt;rate=10r/m&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="k"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
       &lt;span class="kn"&gt;limit_req&lt;/span&gt; &lt;span class="s"&gt;zone=llm&lt;/span&gt; &lt;span class="s"&gt;burst=5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
       &lt;span class="c1"&gt;# ... rest of proxy config&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Monitor logs.&lt;/strong&gt; Check nginx access logs for unexpected requests:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; /var/log/nginx/access.log | &lt;span class="nb"&gt;grep &lt;/span&gt;11435
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;IP allowlist.&lt;/strong&gt; If your cloud agent has a static IP, lock it down:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;   &lt;span class="k"&gt;allow&lt;/span&gt; &lt;span class="mf"&gt;1.2&lt;/span&gt;&lt;span class="s"&gt;.3.4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;# ClawBase IP&lt;/span&gt;
   &lt;span class="k"&gt;deny&lt;/span&gt; &lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Performance notes
&lt;/h2&gt;

&lt;p&gt;Local model inference over a tunnel adds network latency. Expect:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Time to first token&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Odysseus → Ollama (localhost)&lt;/td&gt;
&lt;td&gt;~50-200ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ClawBase → Tunnel → Ollama&lt;/td&gt;
&lt;td&gt;~200-500ms (depending on tunnel)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ClawBase → OpenAI API&lt;/td&gt;
&lt;td&gt;~300-800ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The tunnel adds latency comparable to a normal API call. For most use cases (agent tasks, background work, Telegram messages), this is imperceptible. For real-time streaming chat, you'll feel it — use Odysseus locally for that.&lt;/p&gt;

&lt;p&gt;Throughput depends on your GPU and model size. A 14B model on an RTX 4090 generates ~50 tokens/sec. Through a tunnel, the bottleneck is always inference speed, not the network.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;This works today with no code changes to either project. A couple of things I'm watching:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Odysseus API&lt;/strong&gt; — Odysseus is 4 days old. If it exposes an API for external access or webhooks for incoming messages, the integration gets tighter: conversations stored in both places, memory synced both ways.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP bridge&lt;/strong&gt; — Both Odysseus and OpenClaw support MCP. A shared MCP server for memory could let both frontends read and write to the same knowledge base.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don't have to pick sides. Your model stays local, your inference stays free, and the memory layer lives wherever makes sense for your setup.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you want to try this setup, &lt;a href="https://github.com/pewdiepie-archdaemon/odysseus" rel="noopener noreferrer"&gt;Odysseus&lt;/a&gt; is MIT-licensed and free. &lt;a href="https://clawbase.to" rel="noopener noreferrer"&gt;ClawBase&lt;/a&gt; has a 7-day free trial starting at $16/mo. The tunnel takes about 10 minutes.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Questions? Drop a comment or find me on &lt;a href="https://x.com/IosifPeterfi" rel="noopener noreferrer"&gt;Twitter/X&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>selfhosted</category>
      <category>tutorial</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Tested 33 AI Memory Engines — Here's What Actually Works</title>
      <dc:creator>ClawBase</dc:creator>
      <pubDate>Thu, 28 May 2026 06:57:58 +0000</pubDate>
      <link>https://dev.to/clawbase/i-tested-33-ai-memory-engines-heres-what-actually-works-3dgg</link>
      <guid>https://dev.to/clawbase/i-tested-33-ai-memory-engines-heres-what-actually-works-3dgg</guid>
      <description>&lt;p&gt;6 months ago, I asked my AI agent what we'd been working on last week. It had no idea. Not because it couldn't remember — ChatGPT has memory, Claude has memory — but because I couldn't see what it stored, couldn't query it, couldn't tell it what to forget. A black box with a toggle that says "memory: on."&lt;/p&gt;

&lt;p&gt;So I started testing every memory framework I could find — 33 engines total, running on &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; (350K+ GitHub stars). Most solved one problem well and failed at everything else.&lt;/p&gt;

&lt;p&gt;After 6 months, I landed on an architecture that actually works. It's not about one magic engine — it's about layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The memory stack your agent actually needs
&lt;/h2&gt;

&lt;p&gt;Before diving into the 33 engines, here's what I learned: agent memory isn't one thing. It's a stack, like a human brain has short-term memory, long-term memory, and the ability to look things up.&lt;/p&gt;

&lt;p&gt;A working agent memory stack has 3 layers:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Conversation compression — remembering what just happened
&lt;/h3&gt;

&lt;p&gt;Every conversation eventually hits the context window limit. Without this layer, your agent literally forgets the beginning of your current conversation. A conversation compressor (like &lt;a href="https://github.com/Martian-Engineering/lossless-claw" rel="noopener noreferrer"&gt;Lossless-Claw&lt;/a&gt;) keeps a DAG of summaries — compacting older turns into condensed summaries while keeping the most recent turns untouched. Your agent never loses mid-session context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Native files + semantic search — the persistent record
&lt;/h3&gt;

&lt;p&gt;Plain markdown files your agent reads and writes: daily journals (&lt;code&gt;2026-05-28.md&lt;/code&gt;), a curated &lt;code&gt;MEMORY.md&lt;/code&gt;, preference files, project notes. Simple, version-controlled, human-readable. No database, no API, no dependencies — this is the memory layer that survives everything.&lt;/p&gt;

&lt;p&gt;A local embedding model indexes these files and lets your agent search by meaning, not just keywords. "How did we handle the auth migration?" finds the right entry even if it never used the word "auth." &lt;a href="https://github.com/tobi/qmd" rel="noopener noreferrer"&gt;QMD&lt;/a&gt; runs a 333MB GGUF model locally — sub-second search, no API costs, no data leaving your machine. The files are the source of truth; the embeddings make them instantly searchable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: The long-term intelligence engine — this is where you choose
&lt;/h3&gt;

&lt;p&gt;The first two layers are table stakes. Every serious agent needs them. The third layer is where the 33 engines I tested come in — and where the real differences emerge.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 33 engines I tested
&lt;/h2&gt;

&lt;p&gt;Here's every memory framework I put through real-world use — not benchmarks, not demos, actual daily agent work. They naturally group into 6 categories, each solving a different type of remembering:&lt;/p&gt;

&lt;h3&gt;
  
  
  Vector similarity — the foundation layer
&lt;/h3&gt;

&lt;p&gt;These engines store embeddings and retrieve by semantic similarity. They're the building blocks most other memory systems are built on top of.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/chroma-core/chroma" rel="noopener noreferrer"&gt;ChromaDB&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Embedding-based semantic search, lightweight and developer-friendly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/qdrant/qdrant" rel="noopener noreferrer"&gt;Qdrant&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;High-performance vector similarity search with filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/weaviate/weaviate" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Hybrid vector + keyword search with pluggable modules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/milvus-io/milvus" rel="noopener noreferrer"&gt;Milvus&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Distributed vector database built for scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.pinecone.io/" rel="noopener noreferrer"&gt;Pinecone&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Serverless managed vector search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/pgvector/pgvector" rel="noopener noreferrer"&gt;pgvector&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Vector similarity search as a PostgreSQL extension&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/facebookresearch/faiss" rel="noopener noreferrer"&gt;FAISS&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Meta's similarity search library — raw speed, no frills&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/redis/redis" rel="noopener noreferrer"&gt;Redis Vector&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Vector similarity on Redis Stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/supabase/supabase" rel="noopener noreferrer"&gt;Supabase Vector&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;pgvector on managed Postgres with auth and APIs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/marqo-ai/marqo" rel="noopener noreferrer"&gt;Marqo&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;End-to-end tensor search engine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/activeloopai/deeplake" rel="noopener noreferrer"&gt;Deep Lake&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Vector store optimized for AI dataset versioning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/vespa-engine/vespa" rel="noopener noreferrer"&gt;Vespa&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Hybrid search + ML serving at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are excellent at "find me something similar to X" but they don't understand &lt;em&gt;what&lt;/em&gt; they're storing. A vector store treats your preferences, your project architecture, and last Tuesday's standup notes the same way — as floating-point arrays. For RAG and document retrieval, they're essential. For agent memory, they're a necessary layer but not sufficient on their own.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session &amp;amp; conversation memory — remembering the current thread
&lt;/h3&gt;

&lt;p&gt;These keep track of what's been said within and across conversations.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/getzep/zep" rel="noopener noreferrer"&gt;Zep&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Long-term conversation memory with automatic fact extraction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/getmetal/motorhead" rel="noopener noreferrer"&gt;Motorhead&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Redis-backed conversation memory server&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;&lt;a href="https://openai.com/" rel="noopener noreferrer"&gt;OpenAI Memory&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;ChatGPT's native conversation memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;&lt;a href="https://claude.ai/" rel="noopener noreferrer"&gt;Claude Memory&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Anthropic's native conversation memory&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These solve the "I already told you this" problem within a session. Zep stands out here — it goes beyond simple buffer storage and extracts structured facts from conversations. But session memory alone doesn't give your agent a persistent understanding of your world.&lt;/p&gt;

&lt;h3&gt;
  
  
  Framework memory modules — memory as a feature
&lt;/h3&gt;

&lt;p&gt;These are memory components built into larger agent/RAG frameworks.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/run-llama/llama_index" rel="noopener noreferrer"&gt;LlamaIndex Memory&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Chat memory + knowledge index integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/langchain-ai/langchain" rel="noopener noreferrer"&gt;LangChain Memory&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Buffer, summary, and entity memory modules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/langchain-ai/langmem" rel="noopener noreferrer"&gt;LangMem&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Memory management primitives for LangChain/LangGraph&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/deepset-ai/haystack" rel="noopener noreferrer"&gt;Haystack Memory&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Document store memory in RAG pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/neuml/txtai" rel="noopener noreferrer"&gt;txtai&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;All-in-one embeddings database with workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/crewAIInc/crewAI" rel="noopener noreferrer"&gt;CrewAI Memory&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Short/long/entity memory for multi-agent crews&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Good if you're already inside that ecosystem. They give you memory abstractions (buffers, summaries, entity tracking) but they're tightly coupled to their framework. Memory is a feature of these tools, not their core mission.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agentic &amp;amp; autonomous memory — the agent manages its own memory
&lt;/h3&gt;

&lt;p&gt;These let the agent itself decide what to remember and what to forget.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/letta-ai/letta" rel="noopener noreferrer"&gt;Letta (MemGPT)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Self-editing memory with inner/outer monologue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/Significant-Gravitas/AutoGPT" rel="noopener noreferrer"&gt;AutoGPT Memory&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;File + vector memory for autonomous agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/kingjulio8238/memary" rel="noopener noreferrer"&gt;Memary&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Knowledge graph memory for autonomous agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/Josh-XT/AGiXT" rel="noopener noreferrer"&gt;AGiXT&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Adaptive memory with chained agent context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/yoheinakajima/babyagi" rel="noopener noreferrer"&gt;BabyAGI&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Task-driven memory with priority queues&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Fascinating research direction. Letta/MemGPT in particular pioneered the idea of the model managing its own memory tiers. The challenge in production: you're trusting the LLM to decide what's worth keeping, and that decision quality varies with the model and context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Personal AI &amp;amp; bookmarks — memory for humans, not agents
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/khoj-ai/khoj" rel="noopener noreferrer"&gt;Khoj&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted personal AI with file-based memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/supermemoryai/supermemory" rel="noopener noreferrer"&gt;SuperMemory&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;AI-powered memory for saved content and bookmarks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/vanna-ai/vanna" rel="noopener noreferrer"&gt;Vanna&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;RAG-based memory for database queries&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are designed more as personal knowledge tools than agent memory layers. They work well for their use case, but they're solving a different problem — helping &lt;em&gt;you&lt;/em&gt; remember things, not giving &lt;em&gt;your agent&lt;/em&gt; persistent understanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Structured memory engines — purpose-built for agent intelligence
&lt;/h3&gt;

&lt;p&gt;These are the engines designed specifically to give agents structured, queryable, persistent memory:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/mem0ai/mem0" rel="noopener noreferrer"&gt;Mem0&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Intelligent fact extraction, deduplication, contradiction resolution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/topoteretes/cognee" rel="noopener noreferrer"&gt;Cognee&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Entity-relationship knowledge graphs with 14 retrieval modes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/getzep/graphiti" rel="noopener noreferrer"&gt;Graphiti&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Temporal knowledge graph with validity windows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is where it gets interesting — and where I spent most of my 6 months.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3 tiers of long-term memory
&lt;/h2&gt;

&lt;p&gt;After testing all 33, the structured memory engines stood out. But here's the insight that took me months to reach: &lt;strong&gt;these three aren't meant to run together. They're evolutionary tiers.&lt;/strong&gt; Each one supersedes the previous, adding capabilities while covering the lower tier's functionality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tier 1: Mem0 — facts and preferences
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/mem0ai/mem0" rel="noopener noreferrer"&gt;Mem0&lt;/a&gt; (48K+ GitHub stars, &lt;a href="https://mem0.ai/series-a" rel="noopener noreferrer"&gt;$24M Series A&lt;/a&gt;) is the intelligent facts layer. Tell your agent "I prefer TypeScript" on Monday and "use Python for data scripts" on Thursday — Mem0 doesn't store two contradictory entries. It updates: TypeScript for general dev, Python for data. Every fact is categorized, timestamped, and confidence-scored.&lt;/p&gt;

&lt;p&gt;Where Zep's fact extraction is a feature bolted onto session memory, Mem0's entire architecture is built around making facts reliable. Your agent starts every session already knowing your preferences, your project's quirks, and your conventions. No re-explaining.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for: developers and technical use cases.&lt;/strong&gt; If your agent mainly needs to remember preferences, conventions, and project details across sessions, Mem0 is the right choice. It's the simplest to set up and the most focused.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tier 2: Cognee — relationships and reasoning (supersedes Mem0)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/topoteretes/cognee" rel="noopener noreferrer"&gt;Cognee&lt;/a&gt; (&lt;a href="https://www.cognee.ai/blog/cognee-news/cognee-raises-seven-million-five-hundred-thousand-dollars-seed" rel="noopener noreferrer"&gt;$7.5M seed&lt;/a&gt;, &lt;a href="https://www.cognee.ai/blog/cognee-news/cognee-github-secure-open-source-program" rel="noopener noreferrer"&gt;GitHub Secure Open Source graduate&lt;/a&gt;, running in 70+ companies) builds a knowledge graph — not isolated facts, but a web of entities, relationships, and semantic connections.&lt;/p&gt;

&lt;p&gt;Where Mem0 knows "the client prefers blue branding," Cognee knows that the client's brand guidelines connect to last month's campaign performance, which connects to the audience segments that engaged most, which connects to the content calendar. It ships 14 retrieval modes and a self-improving "memify" feature that strengthens connections the more you use them.&lt;/p&gt;

&lt;p&gt;Cognee handles everything Mem0 does (facts are just nodes in the graph) &lt;em&gt;plus&lt;/em&gt; it maps the relationships between them. That's why it supersedes Tier 1 — you don't need Mem0 if you're running Cognee.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for: marketing, content, and multi-project work.&lt;/strong&gt; If your agent needs to reason across brands, campaigns, audiences, and projects — understanding &lt;em&gt;how&lt;/em&gt; things connect, not just &lt;em&gt;what&lt;/em&gt; things are — Cognee is the right choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tier 3: Graphiti — temporal reasoning (supersedes Cognee)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/getzep/graphiti" rel="noopener noreferrer"&gt;Graphiti&lt;/a&gt; by Zep is the temporal knowledge graph. Its core insight: knowing the current state isn't enough. You need to know &lt;em&gt;when&lt;/em&gt; things changed and &lt;em&gt;what was true before&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Every fact carries validity intervals. When new information conflicts with old, Graphiti doesn't overwrite — it creates a temporal record and invalidates the previous one, preserving full history. "When did this config change?" "What was different before the March deploy?" Graphiti answers directly, no digging through logs.&lt;/p&gt;

&lt;p&gt;It &lt;a href="https://arxiv.org/abs/2501.13956" rel="noopener noreferrer"&gt;outperforms MemGPT&lt;/a&gt; on the Deep Memory Retrieval benchmark using a combination of semantic search, keyword matching, and graph traversal.&lt;/p&gt;

&lt;p&gt;Graphiti handles facts (like Mem0) and relationships (like Cognee) &lt;em&gt;plus&lt;/em&gt; tracks how they change over time. It supersedes both lower tiers — but it's also the heaviest to run (FalkorDB, more compute, more complexity).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for: operations, executive, and business use cases.&lt;/strong&gt; If your agent needs cause-and-effect reasoning across time — "what changed," "when did it break," "what was true before" — Graphiti is the right choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pick one, not all three
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your use case&lt;/th&gt;
&lt;th&gt;Pick this tier&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Developer / DevOps&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mem0&lt;/td&gt;
&lt;td&gt;You need fast, reliable fact recall. Preferences, conventions, project details.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Marketing / Content&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cognee&lt;/td&gt;
&lt;td&gt;You need relationship reasoning. Brands, campaigns, audiences, how they connect.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operations / Executive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Graphiti&lt;/td&gt;
&lt;td&gt;You need temporal reasoning. What changed, when, and what broke.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The common mistake is thinking "more engines = better memory." It's not. Each tier already includes the capabilities of the one below it. Running Mem0 alongside Graphiti is redundant — Graphiti already stores facts. Running all three wastes compute and creates consistency conflicts.&lt;/p&gt;

&lt;p&gt;Pick the tier that matches your work. Pair it with the base stack (conversation compression + native files with semantic search) and your agent will remember everything that matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The full architecture
&lt;/h2&gt;

&lt;p&gt;Here's what a complete agent memory stack looks like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqncuszg1vmkgyh17k3u8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqncuszg1vmkgyh17k3u8.png" alt="Agent Memory Architecture — 3 layers: conversation compression, native files + semantic search, and a long-term intelligence engine (pick one tier)" width="800" height="970"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every layer feeds context to the model. The bottom two are always-on. The top one is your choice based on what kind of reasoning your agent needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting this running
&lt;/h2&gt;

&lt;p&gt;The base stack (layers 1–2) is built into &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; — conversation compression, native memory files, and semantic search work out of the box. The long-term engine (layer 3) requires additional setup: Mem0 needs a vector store, Cognee needs a graph database, Graphiti runs on FalkorDB.&lt;/p&gt;

&lt;p&gt;OpenClaw is open source and you can self-host the full stack. If you want to skip the infrastructure work, I've been building &lt;a href="https://clawbase.to" rel="noopener noreferrer"&gt;ClawBase&lt;/a&gt; — managed OpenClaw hosting that pre-configures the right memory stack for your use case. But honestly, even if you self-host, the main takeaway here is the architecture: &lt;strong&gt;a 3-layer memory stack where you pick the long-term engine that matches your work.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The memory compounds over time — whichever way you run it, the longer you use it, the better it gets.&lt;/p&gt;

&lt;p&gt;One thing I keep coming back to: once your agent has a real memory stack, it opens the door to something bigger — consistent shared memory across multiple agents. Imagine a team of agents that don't just remember their own context, but share a unified understanding of your projects, preferences, and decisions. That's a different kind of architecture entirely, and one I'll dig into in a future article.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>aiagents</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
