Every AI Agent Framework Has a Memory Problem. Here's How I Fixed Mine.

#ai #typescript #machinelearning #opensource

If you've built anything with AI agents, you've hit this wall. Your agent works great in a single session. You close the conversation, come back tomorrow, and it has no idea who you are, what you were working on, or why you care.

It's the most discussed unsolved problem in the AI agent community right now. I'm not guessing — I spent weeks reading every thread on r/AI_Agents, r/LangChain, r/LLMDevs, and Hacker News about it. The frustration is everywhere:

"I keep running into the same wall — they forget everything between sessions. I can dump the entire conversation history into every prompt, but that burns through tokens fast and doesn't scale."

"Memory persistence problem in AI agents is worse than I expected."

"The real trick is making the agent decide what's worth persisting vs what's throwaway."

That last quote is the one that matters. Not "how do we store everything" — but how does the agent know what's worth remembering?

What Everyone Tries (And Why It Breaks)

I tried all of these before building my own system. Quick rundown of why each one fails on its own:

Full conversation history dump. You feed the entire chat log into every prompt. Works for 5 messages. By message 50, you're burning $2 per request and the model is drowning in noise. The important stuff from message 3 gets buried under 47 messages of back-and-forth about formatting.

Summarization. Have the LLM summarize older conversations and inject that summary. Better on tokens, but summaries lose the specific details that matter. "User is working on an e-commerce project" is a lot less useful than "User's Shopify store uses custom metafields for inventory and their API key expires March 20."

Vector databases / RAG. Embed everything, retrieve by similarity. This works for knowledge bases — documentation, FAQs, reference material. It doesn't work well for personal context. "What was the user frustrated about last Tuesday?" isn't the kind of query that semantic search handles cleanly. You get adjacent results, not the right ones.

Just accept the reset. Some people give up and treat each session as fresh. Fine for a chatbot. Useless for an agent that's supposed to work on multi-day projects, track your preferences, or manage ongoing tasks.

None of these is wrong. They're all incomplete. The real problem is that memory isn't one thing — it's at least four different things pretending to be one.

Four Layers, Each Doing One Job

When I built AIBot Framework, I stopped trying to find one memory solution and built four:

Layer 1: Conversation logs (the baseline)

Every message, in and out, saved as JSONL. Append-only, timestamped. This is your audit trail, not your memory system. You can search it, but you don't inject it wholesale into prompts.

Nothing special here. Every framework does this.

Layer 2: Long-term searchable memory

SQLite with FTS5 (full-text search). When something happens that's worth noting — a decision made, a preference stated, a task completed — the agent calls save_memory with a text note. These notes are timestamped and searchable across sessions.

Before the agent acts on something from a previous conversation, it runs memory_search to pull relevant context. Not the whole history. Just what matches.

The difference from RAG: these aren't chunks of documents. They're the agent's own notes about what happened and why it mattered. Think "journal entries" not "search results."

Layer 3: Core memory (the structured model)

This is the one I haven't seen anywhere else.

Core memory is a key-value store organized by category: identity, relationships, preferences, goals, constraints, general. Each entry has a key, a value, and an importance score (1-10).

Examples of what lives here:

preferences.language → "User prefers Spanish for casual conversation, English for technical discussion" (importance: 8)
goals.current_project → "Building a SaaS for dental clinics, MVP due April 15" (importance: 9)
relationships.diego → "Operator/creator. Based in Argentina. Most responsive late morning to evening." (importance: 7)

This isn't conversation history. It's a structured model of what the agent knows. When the agent needs context, it queries core memory first — it's cheap (no LLM call, just a key-value lookup) and precise.

The agent updates this itself. When you tell it you changed your project deadline, it runs core_memory_replace to update the old value. When it learns something new about you, it appends. The categories keep it organized so "what does this user prefer" and "what are this user's goals" are different queries with different answers.

Layer 4: Context compaction

When a conversation gets long (happens a lot with autonomous agents running multi-step tasks), the system summarizes older parts of the conversation to keep the context window useful. But — and this is the key part — the important specifics have already been captured in layers 2 and 3. The compaction doesn't lose critical information because the critical information was extracted before compaction happened.

This is what makes the system work as a whole rather than as four disconnected pieces. The layers feed each other: conversation generates notes (layer 2) and structured facts (layer 3), which survive compaction (layer 4) and persist across sessions independently of the conversation log (layer 1).

The Part Nobody Talks About: Who Decides What to Remember?

This is where most memory solutions fall apart. If you make the developer tag everything manually ("this message is important, save it"), nobody does it consistently. If you save everything automatically, you get noise.

The approach that works: the agent decides.

The LLM already understands context. When it sees "my deadline moved to April 20," it knows that's a fact worth persisting. When it sees "ok sounds good," it knows that's not. The agent has instructions in its personality definition (we call them "soul files") about when to save memory, when to update core memory, and what importance level to assign.

Is it perfect? No. Sometimes it saves things that don't matter. Rarely, it misses something it should have caught. But it's dramatically better than any rule-based system I tried, and it improves as the underlying LLM models improve — without me changing any code.

The human stays in the loop too. You can manually save to core memory, and the agent flags when it's uncertain about whether to persist something. But for 90%+ of cases, automatic saves just work.

What This Looks Like in Practice

I've been running this system for weeks. Here's what changed:

Before (session-based memory): Every morning I'd re-explain my project context, goals, and preferences. By Wednesday I'd given up and just accepted that the bot was goldfish-brained.

After (4-layer memory): The bot picks up exactly where we left off. It remembers that I prefer direct communication, that I'm targeting a specific market segment, that last week's experiment didn't work and why. It remembers the names of people I work with, the tools I use, and the constraints I've mentioned once and never repeated.

The difference isn't subtle. It's the difference between talking to a stranger every day and talking to a colleague who's been on your team for months.

How to Try It

The framework is open source and self-hosted. Memory is available on every tier, including free.

Free tier: All 4 memory layers, 1 bot, local LLM via Ollama. $0.
Pro tier ($79/mo): Multiple bots sharing memory context, cloud LLM access, autonomous agent loop.
Founding member price: $49/mo locked for 12 months for the first 50 users.

No token markup. BYO API keys. Self-hosted means your memory data stays on your machine.

👉 Get early access

The code is on GitHub if you want to dig into the implementation before committing. I'm happy to answer architecture questions in the comments — especially if you've tried solving the memory problem yourself and hit walls I haven't thought of.

This is Part 2 of a series on building autonomous AI agents. Part 1 covered dynamic tool creation and multi-agent collaboration.