Emre Sarbak

Posted on Apr 4 • Originally published at emotionmachine.com

Three Memory Architectures for AI Companions

#ai #llm #architecture #machinelearning

This essay documents our evolution through three memory system versions for AI companions. Each iteration addressed limitations of its predecessor, reflecting different conceptual approaches to what "memory" means for conversational AI.

AI agent memory is how a conversational system decides what to remember about a user, stores it, and retrieves it in future conversations. At Emotion Machine, we've built three distinct architectures for this: pgvector with importance scoring, an LLM-managed scratchpad, and a filesystem that agents navigate with bash. Each solves different problems. None of them replaced the others entirely.

The core challenge: what to remember, when to remember it, and how to surface it without making the companion feel like it's reading from a dossier.

What does the AI memory landscape look like?

The field's current state heavily influenced our design decisions.

MemGPT treats the LLM context window like RAM and external storage like disk, using function calls (core_memory_append, archival_memory_search) for paging management.

From a cognitive science lens, four memory types map to things you can actually build. Working memory is the context window. Semantic memory is facts. Episodic memory is experiences. Procedural memory is instructions and skills.

Context is a finite resource with diminishing marginal returns. Every token competes for attention. Context rot is real.

There's also a tension in the field between "filesystems are sufficient" advocates and "filesystems are just bad databases" critics. We adopted a pragmatic middle ground: real files for agent navigation, database caching for fast chat access.

For conversational products, V2's simpler scratchpad model covers most needs. V3's filesystem approach suits autonomous agent workflows requiring sandboxed execution.

How does pgvector memory work? (V1)

Our first system was classic RAG with importance weighting for selective retrieval.

An LLM (gpt-4o-mini) scores each piece of information 1-10. Identity statements and deadlines score 9-10. Preferences and goals land at 7-9. Interests and tasks get 5-7. Transient details score 1-4.

We added heuristic floors as safety nets. Pattern-based rules override LLM scores: "my name is" gets a minimum score of 0.85. Goals get 0.75. Constraints get 0.65. Preferences get 0.60.

Retrieval worked in two stages. pgvector's HNSW index searches ~300 candidates, then a re-ranker combines: similarity x importance x user_weight x recency_decay.

We gated retrieval to avoid unnecessary latency. It only triggers on keywords like "remember" or "my name," or on a periodic cadence (~2 turns or 30-second gaps).

  Conversation Turn
        │
        ▼
  ┌─────────────┐    no     ┌──────────────────┐
  │ Gate check: │───────────│  Skip retrieval,  │
  │ should we   │           │  respond directly │
  │ retrieve?   │           └──────────────────┘
  └─────┬───────┘
        │ yes
        ▼
  ┌─────────────┐           ┌──────────────────┐
  │  Embed      │──────────▶│  pgvector HNSW   │
  │  query      │           │  ~300 candidates  │
  └─────────────┘           └────────┬─────────┘
                                     │
                                     ▼
                            ┌──────────────────┐
                            │  Re-rank:        │
                            │  sim × importance │
                            │  × weight × decay│
                            └────────┬─────────┘
                                     │
                                     ▼
                            ┌──────────────────┐
                            │  Top-k memories  │
                            │  → system prompt │
                            └──────────────────┘

  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

  Async Ingestion (background, never blocks user)

  User/Assistant message
        │
        ▼
  ┌─────────────┐           ┌──────────────────┐
  │ LLM scores  │──────────▶│  Heuristic floor  │
  │ importance  │           │  (identity: 0.85  │
  │ (1-10)      │           │   goals: 0.75...) │
  └─────────────┘           └────────┬─────────┘
                                     │ above threshold?
                                     ▼
                            ┌──────────────────┐
                            │  Embed + store   │
                            │  in pgvector     │
                            └──────────────────┘

A separate knowledge base (OpenAI Vector Store) handles document retrieval for PDFs and FAQs. Conceptually distinct from personal memory.

What went wrong: Selective retrieval misses important information when embedding similarity is low. The importance scoring adds latency and cost. The rubric requires adjustment for every new use case.

What is a memory scratchpad? (V2)

The core insight: maintain a small, curated semantic entry list and inject all of it into the system prompt every turn. No selective retrieval.

We also shifted the abstraction from "companions and conversations" to "relationships." A relationship is a user-companion pair, persistent across sessions and devices. Memory belongs to the relationship, not the companion.

Three state buckets:

Profile: developer-owned, permanent
Memory: scratchpad entries, permanent
Session state: temporary, cleared on session end

Entry types are straightforward: identity, preference, goal, event, relationship, other. Each with content and timestamps.

After each turn, an async background worker feeds current entries plus recent messages to an LLM (Gemini 2.0 Flash by default). It returns JSON operations: ADD, UPDATE, or DELETE. This runs in the background and never blocks the user's response.

Retrieval strategy: load all entries, format as a bullet list, inject into the system prompt. Full visibility, no gating, no relevance scoring. Trades scalability for simplicity.

  Every Conversation Turn
        │
        ├───────────────────────────────────────┐
        │                                       │
        ▼                                       ▼
  ┌─────────────┐                     ┌──────────────────┐
  │ Load full   │                     │ Async worker     │
  │ scratchpad  │                     │ (background)     │
  │ (cached 30s)│                     │                  │
  └──────┬──────┘                     │ Feeds turn to LLM│
         │                            │        │         │
         ▼                            │        ▼         │
  ┌──────────────┐                    │ ┌──────────────┐ │
  │ Format as    │                    │ │ LLM returns  │ │
  │ bullet list  │                    │ │ operations:  │ │
  │              │                    │ │  ADD / UPDATE│ │
  │ Inject into  │                    │ │  / DELETE    │ │
  │ system prompt│                    │ └──────┬───────┘ │
  │ (all entries)│                    │        │         │
  └──────┬──────┘                     │        ▼         │
         │                            │ ┌──────────────┐ │
         ▼                            │ │ Apply ops    │ │
  ┌──────────────┐                    │ │ to DB        │ │
  │ LLM responds │                    │ └──────────────┘ │
  │ with full    │                    └──────────────────┘
  │ memory       │
  │ visibility   │
  └──────────────┘

  Scratchpad entries: [ identity | preference | goal | event | relationship | other ]
  Sorted by: last modified (newest first)

Developers can override the ingestion prompt to control what gets stored, when, and how. Specify entry types with examples. Users can directly ADD/UPDATE/DELETE via API or UI.

Context assembly is layered: core prompt, behavior injections, memory context, knowledge, profile, session state, recent messages, current input. Each layer is independently pluggable, orchestrated in parallel.

V2 also includes a behavior system (priority behaviors before LLM, async behaviors after), auto-summarization at 200/400/600 message thresholds, and a config cascade (turn > relationship > companion).

V2 vs V1 tradeoffs: Simpler, full visibility vs. better for large memory stores. LLM-managed operations are more intuitive than importance rubrics. Per-relationship (correct) vs. per-companion (incorrect). Weaker at scale (hundreds of entries burn tokens) but handles typical context windows.

How does filesystem-based memory work? (V3)

Agent mode requires autonomous complex task execution (research, tool use, multi-step workflows) in sandboxed environments. This triggered V3.

The core concept: materialize all context as real files on a Modal Volume at /em/. Agents navigate with bash (ls, grep, cat). LLMs understand file operations natively. Benchmarks show these outperform specialized retrieval tools.

/em/
├── memory/       (hot_context.md, scratchpad.md)
├── knowledge/    (documents/)
├── profile/      (user.yaml)
├── workspace/    (AGENTS.md, outputs/)
├── tools/
├── .claude/skills/
├── .git/
└── .locks/

The key file is hot_context.md. It's an agent-curated relationship summary, roughly 500 words. User profile, recent context, preferences, tasks, facts. After each session, a curation step updates it. This replaces V1's rubric and V2's LLM entry management.

For real-time chat, hot_context syncs to a database cache (relationship_context_cache). Chat reads take ~1ms. The filesystem world and the real-time chat world stay connected through this bridge.

  ┌──────────────────────────────────────────────────────────┐
  │                    Modal Volume /em/                      │
  │                                                          │
  │  memory/hot_context.md    profile/user.yaml   tools/     │
  │  memory/scratchpad.md     workspace/AGENTS.md  .claude/  │
  └────────────┬──────────────────────┬──────────────────────┘
               │                      │
       ┌───────┴───────┐      ┌───────┴───────┐
       │  Agent Mode   │      │  Chat Mode    │
       │               │      │               │
       │  Agent reads  │      │  Reads from   │
       │  /em/ with    │      │  DB cache     │
       │  bash (ls,    │      │  (~1ms)       │
       │  grep, cat)   │      │               │
       │       │       │      └───────────────┘
       │       ▼       │              ▲
       │  Does work,   │              │
       │  updates files│              │
       │       │       │              │
       │       ▼       │              │
       │ ┌───────────┐ │     ┌────────┴────────┐
       │ │ Curation  │ │     │  DB cache:      │
       │ │ step:     │─┼────▶│  relationship_  │
       │ │ update    │ │     │  context_cache  │
       │ │ hot_ctx   │ │     │  (sync on end)  │
       │ └───────────┘ │     └─────────────────┘
       └───────────────┘

  Pre-hydrate:  DB → Volume (before sandbox)
  Sandbox exec: Agent in /em/, tools via Gateway
  Post-sync:    Volume → DB (after sandbox, conflict detection)

Session lifecycle:

Pre-hydrate: DB to Volume (load hot_context, profile, AGENTS.md, track versions)
Execute in sandbox: agent in dedicated directory, no direct DB calls, tools via em-tool CLI
Post-sync: Volume to DB with conflict detection

Concurrency uses git worktrees. Each session gets its own branch at /em/.worktrees/session-{id}/, merging back to main with conflict resolution. File-based locks prevent race conditions.

If both a chat session and an agent session update hot_context simultaneously, the curation step sees both versions and merges them naturally.

How do these memory approaches compare?

Here's how the three architectures compare across the dimensions that matter in production:

	V1: pgvector	V2: Scratchpad	V3: Filesystem
Retrieval	Selective (similarity + importance)	Full injection every turn	Agent navigates with bash
Latency overhead	Moderate (embedding + re-rank)	None (already in prompt)	None (agent reads what it needs)
Best for	Large memory stores (1000+ facts)	Conversational products	Autonomous agent workflows
Scaling limit	Embedding quality	~100 entries (token budget)	Disk space
Management	Importance rubric	LLM-managed (ADD/UPDATE/DELETE)	Agent-curated files
Blocks user response?	No (gated)	No (async)	No (post-session)
Concurrency	N/A	Per-relationship	Git worktrees per session

What coexists in production?

These systems serve different purposes. They don't replace each other.

Memory V1/V2: Personal user facts (preferences, goals, events). V1 for large selective stores, V2 for full-visibility scratchpads.
Knowledge base: Document retrieval (classic RAG).
Hot context (V3): Agent-curated relationship summary for fast chat access.
Conversation summaries: Incremental summarization at thresholds (200/400/600 messages) for long relationships.

All ingestion is async, never blocking the user's response. Context assembly is layered and pluggable. Each source (prompt, memory, knowledge, tools, behaviors) runs independently in parallel.

What are the unsolved problems in AI agent memory?

Consolidation and forgetting. Scratchpad entries accumulate. No mechanism for merging related entries or controlled forgetting. Arguably the hardest unsolved challenge in agent memory. The framework is intentionally left extensible for developer-specific policies.

Fuzzy retrieval and temporal reasoning. Simulating imperfect memory ("that rings a bell but...") and temporal reasoning ("we talked about this previously, you might not have reflected yet"). Unimplemented.

Cross-relationship search. V3's per-relationship design protects privacy but prevents pattern discovery across relationships at the companion level.

Checkpoint restore. Modal's snapshot_filesystem() doesn't capture mounted volumes, preventing session restoration to previous states. Git or S3 tarballs are possible but suboptimal.

Which architecture should you use?

For agent mode, the filesystem approach is the correct abstraction. Models understand files natively, developer experience is intuitive, and agents load what they need.

For most companions, V2 is optimal. LLM-managed scratchpad, developer-customizable ingestion prompts, fast, transparent. Works for coaching bots, customer support, tutoring, any regular-returning-user product. No sandbox needed.

V3's hot_context bridges agent and chat modes. Agents curate after sessions. Chat reads from database cache in ~1ms. Combines the rich agent-mode filesystem world with real-time chat latency requirements.

The hardest problem remains: what should a companion remember? What should it surface, and when? Ingestion prompt customization (V2) and AGENTS.md (V3) are our current answers, but there's a lot more to figure out here.

References

Packer et al. (2023). MemGPT: Towards LLMs as Operating Systems
Xu et al. Everything is Context: Agentic File System Abstraction for Context Engineering
Anthropic. Effective Context Engineering for AI Agents (Sep 2025)
Letta (formerly MemGPT)
OpenAI Vector Stores API

We're Emotion Machine. We help AI connect with people, across voice, memory, and phone agents.

The memory system described here powers Personality Machine, our infrastructure for AI companions that maintain consistent identity across conversations. You create companions in a builder, deploy them to a custom URL, and debug conversations in a dashboard. Memory, personality, and relationship state are handled for you.

If you're working on something in this space, reach out: hello@emotionmachine.ai

DEV Community