I spent 200 days testing 15 AI companion apps. Paid for every subscription myself. Talked to each one for weeks, sometimes months.
The single biggest thing separating the good ones from the bad ones? Memory.
Not "memory" as in context window size. Actual, persistent, cross-session memory. The kind where you tell your AI companion something in week one and it still knows it in week eight.
Most apps get this wrong. I wrote about the failure patterns elsewhere, but the short version is: apps confuse buffer size with memory, they reset between sessions, or they remember the wrong things.
What I haven't written about is what I'd actually build if I were starting from scratch. So here goes.
The problem in one sentence
LLMs don't remember anything. Every API call is a blank slate. Your entire "memory system" is whatever you manage to stuff into the prompt before the model sees it.
That's it. That's the whole constraint.
The architecture I'd build
Three layers. They each handle a different piece of the puzzle.
Layer 1: The conversation log (raw storage)
Every message goes into a database. User messages, AI responses, timestamps, session IDs. All of it. Never delete, never summarize in place.
This sounds obvious but most apps skip it. They rely on the context window as their only storage, and when the window fills up, old messages just fall off the edge. Gone.
conversations table:
id UUID
session_id UUID
user_id UUID
role ENUM('user', 'assistant')
content TEXT
created_at TIMESTAMP
metadata JSONB -- mood, topic, personal fact flags
The metadata column matters more than you'd think. If you tag messages at write time (topic, emotional tone, whether a personal fact was shared), retrieval gets way cheaper later.
I'd use Postgres for this. You don't need anything exotic. A companion app with 10,000 daily active users generating 50 messages each is ~500K rows per day. Postgres handles that without breaking a sweat.
Layer 2: The memory index (compressed knowledge)
Raw conversation logs are too large to stuff into a prompt. You need a compressed version, a living document that captures what the AI "knows" about the user.
Two parts to this:
A. Fact store. Structured key-value pairs extracted from conversations.
{
"user_facts": {
"name": "Alex",
"pet": {"type": "dog", "name": "Biscuit", "breed": "corgi"},
"job": "frontend engineer",
"mood_pattern": "vents about work on Mondays",
"important_dates": {"birthday": "March 12"},
"preferences": {"hates": ["small talk"], "likes": ["space documentaries"]}
},
"last_updated": "2026-04-10T14:30:00Z",
"extraction_count": 847
}
Run a small, cheap model (Haiku-tier) after every N messages to extract new facts. Merge them into the existing store. Costs almost nothing and runs async.
B. Relationship summary. A paragraph-length narrative of the relationship so far. Updated less frequently, maybe every 50-100 messages.
Alex is a frontend engineer who adopted a corgi named Biscuit in January.
They tend to vent about their manager on Monday evenings. They've been
talking about switching jobs since February but haven't applied anywhere
yet. Last week they mentioned feeling burned out. Tone is usually casual
and sarcastic. They don't like when responses are too earnest.
This gets injected into the system prompt. It's the AI's "sense of who you are." Without it, every session feels like talking to a stranger.
Layer 3: The retrieval engine (dynamic recall)
This is the piece most apps completely miss. Layers 1 and 2 give you persistent storage and compressed knowledge. But they don't help when a user says "remember that restaurant I mentioned last month?"
For that you need retrieval. Here's how I'd wire it up:
- Embed every message at write time (store the vector alongside the raw text in Layer 1)
- When a new message comes in, embed it and do a similarity search against the full conversation history
- Inject the top-K relevant messages into the prompt along with the fact store and relationship summary
def build_prompt(user_message, user_id):
# Layer 2: compressed knowledge
facts = get_fact_store(user_id)
summary = get_relationship_summary(user_id)
# Layer 3: dynamic retrieval
query_embedding = embed(user_message)
relevant_history = vector_search(
user_id=user_id,
embedding=query_embedding,
top_k=10,
min_similarity=0.7
)
# Assemble
system_prompt = f"""You are talking to {facts['name']}.
What you know about them:
{json.dumps(facts, indent=2)}
Your relationship so far:
{summary}
Relevant past conversations:
{format_messages(relevant_history)}
"""
return system_prompt
This is basically RAG applied to personal conversation history. The thing I keep seeing is that you need BOTH the compressed summary (Layer 2) AND the raw retrieval (Layer 3). I've tested apps that only do one or the other, and both feel broken.
Summary only? The AI "knows" you have a dog named Biscuit but can't recall the specific conversation where you told the funny vet story. Feels hollow.
Retrieval only? The AI can surface old conversations but has no coherent sense of who you are. No arc. No continuity. Just scattered fragments.
The part nobody talks about: salience
Which memories matter? You can't retrieve everything. You can't summarize everything. So you need some kind of policy for what to keep, what to surface, and what to let fade.
The apps that do this well (Nomi AI and Kindroid, from my testing) seem to weight three signals:
- Emotional intensity. Messages where the user expressed strong feelings get higher salience.
- Personal facts. "My dog's name is Biscuit" matters more than "I had pasta for lunch."
- Recency with decay. Recent messages matter more, but high-salience old messages don't fully decay.
I'd implement this as a scoring function on retrieval:
def salience_score(message, query_similarity, age_days):
base = query_similarity # 0.0 to 1.0
# Boost emotional content
if message.metadata.get('emotional_intensity', 0) > 0.7:
base *= 1.4
# Boost personal facts
if message.metadata.get('contains_personal_fact'):
base *= 1.3
# Time decay (half-life of 30 days)
decay = 0.5 ** (age_days / 30)
return base * (0.3 + 0.7 * decay) # floor at 30% to never fully forget
That 30% floor matters a lot. Without it, a user's birthday from three months ago decays to near-zero, and the app just forgets it. The floor keeps important stuff around even when it's old.
What this costs to run
Rough numbers for a 10K DAU app:
| Component | Service | Monthly cost |
|---|---|---|
| Conversation storage | Postgres (managed) | ~$50 |
| Vector embeddings | OpenAI embeddings API | ~$30 |
| Vector search | pgvector extension | $0 (same Postgres instance) |
| Fact extraction | Haiku-tier model, async | ~$80 |
| Summary updates | Sonnet-tier model, batched | ~$40 |
Total: ~$200/month for 10K DAU. That's $0.02 per user per month for a memory system that actually works.
Most apps don't skip this because of cost. They skip it because the context window approach ships faster, and nobody realizes it's broken until users start complaining that the AI keeps "forgetting" them.
The test I'd run
Before shipping, I'd run this benchmark on my own system:
- Simulate 1,000 messages over 20 sessions with a test user
- Plant specific personal facts at message 50, 200, 500, and 900
- Close all sessions
- Open a new session and ask about each planted fact
- Score: did the system recall correctly, hallucinate, or fail silently?
If it can't pass that test, it's not a memory system. It's a cache with extra steps.
So why don't more apps do this?
The gap between "AI companion that remembers you" and "chatbot you have to re-introduce yourself to every week" is maybe 500 lines of infrastructure code and $0.02/user/month. Postgres, embeddings, a retrieval function. None of this is new.
The real problem is organizational. The context window approach ships in a weekend. Building proper memory takes a few weeks. And product teams usually don't prioritize it until users are already churning.
I test AI companion apps and write about what I find at AI Companion Picker. If you're building in this space, I'm always down to compare notes.
Top comments (0)