Debugging a Ghost Outrage With an Agent That Actually Remembers

#ai #programming #microsoft #devops

When Your Chatbot Finally Stops Asking Who You Are

The first time a user complained that KAIRO kept forgetting their name between sessions, I brushed it off. The second time, I told myself it was a known limitation of stateless LLM calls. By the fifth time, across five different users in the same week, I stopped pretending it was acceptable.

I'd built KAIRO as a personality-switching conversational assistant — a Streamlit application layered over LangChain and Ollama that could shift between Professional, Friendly, Funny, and Technical personas depending on what the user needed at any given moment. The core idea was simple: same underlying model, radically different behavior based on the system prompt injected at runtime. It worked. Users liked it. But every session was a blank slate. KAIRO had no idea who it was talking to, what they'd discussed last Tuesday, or whether the user had corrected it three times for using jargon they didn't understand. That's a chatbot, not an assistant. There's a difference.

How KAIRO is Structured

The architecture of KAIRO is deliberately minimal. On the frontend, Streamlit handles the UI — personality selector in the sidebar, chat input at the bottom, conversation history rendered with role indicators above it. LangChain handles the chain composition: a ChatPromptTemplate takes the selected personality's system message and injects it alongside the user's query, the Ollama LLM wrapper sends that to the local model, and StrOutputParser converts the response back into plain text.

personality_prompts = {
    "Professional": "You are KAIRO, a professional, polite, and formal assistant.",
    "Friendly": "You are KAIRO, a friendly, casual, and warm assistant.",
    "Funny": "You are KAIRO, a humorous assistant, always adding light jokes.",
    "Technical": "You are KAIRO, a highly technical, precise, and detailed assistant.",
}

prompt = ChatPromptTemplate.from_messages([
    ("system", personality_prompts[personality]),
    ("user", "{query}")
])

chain = prompt | llm | output_parser

The chain is re-instantiated on every interaction with the selected personality. That's the flexibility point — KAIRO can be whoever the user needs right now. But notice what's missing: there's no user_id, no retrieval call, no persistent context injected between that system prompt and the user's query. The model starts cold every time. It knows nothing about the person in front of it, and more importantly, it learns nothing from them.

Session state in Streamlit is ephemeral by nature:

if "chat_history" not in st.session_state:
    st.session_state.chat_history = []

This tracks messages within a single browser tab session. The moment a user refreshes or returns the next day, it's gone. You could serialize this to disk or a database, but that only solves history — it doesn't solve learning. Replaying 200 previous messages into every context window is expensive and noisy. It doesn't help the agent understand that this particular user prefers terse answers, works in finance, and gets frustrated when KAIRO explains what an API is.

The Memory Problem Isn't Storage, It's Distillation

I knew I needed agent memory — not a log, not a chat buffer, but a system that could distill raw interactions into reusable facts and behavioral signals over time. There's a real difference between "store everything and search it" and "understand what matters and surface it at the right moment."

I started reading about the approaches: naive vector databases over raw chat history, knowledge graphs, RAG over session transcripts. Each has real problems in this context. Semantic search over raw messages doesn't capture the evolution of preferences — if a user corrected me twice in February and once in March, retrieving those three individual messages doesn't tell the agent what it should have learned from them. Knowledge graphs are powerful but operationally painful to maintain at scale. And full RAG over chat history burns tokens and returns irrelevant context far too often.

After some research, I decided to try Hindsight for agent memory. It positioned itself differently from the other options I'd looked at: instead of just recalling raw memories, it focused on agents that genuinely learn over time. The distinction mattered.

Integrating Hindsight into the Chain

The Hindsight agent memory system operates on three primitives: retain, recall, and reflect. You push information in with retain, retrieve semantically relevant memories with recall, and synthesize understanding with reflect. The architecture underneath uses a combination of vector similarity, BM25 keyword matching, entity/temporal graph links, and a cross-encoder reranking step — but from the application side, that's abstracted away.

Getting it running locally took about ten minutes with Docker:

export OPENAI_API_KEY=sk-xxx

docker run --rm -it --pull always -p 8888:8888 -p 9999:9999 \
  -e HINDSIGHT_API_LLM_API_KEY=$OPENAI_API_KEY \
  -v $HOME/.hindsight-docker:/home/hindsight/.pg0 \
  ghcr.io/vectorize-io/hindsight:latest

After that, integrating into KAIRO required adding a retain call after each exchange and a recall call before the chain executes. I scoped memory to individual users via Hindsight's bank_id — one bank per user, keyed by session identifier. This is exactly the per-user personalization pattern the library is designed for.

from hindsight_client import Hindsight

hindsight = Hindsight(base_url="http://localhost:8888")

# Before generating a response — retrieve relevant context
memories = hindsight.recall(
    bank_id=f"user-{user_id}",
    query=input_txt
)

# Inject memories into the prompt context
enriched_prompt = ChatPromptTemplate.from_messages([
    ("system", personality_prompts[personality] + memory_context(memories)),
    ("user", "{query}")
])

# After generating a response — store this exchange
hindsight.retain(
    bank_id=f"user-{user_id}",
    content=f"User asked: {input_txt}. KAIRO responded: {response}"
)

The memory_context() function is a small helper that formats retrieved memories into a concise context block appended to the system prompt. It tells KAIRO what it should already know about this person before saying a word.

What happens inside Hindsight when you call retain is worth understanding. It uses an LLM to extract entities, facts, relationships, and temporal signals from the raw text. Those get normalized and indexed across multiple representations — dense vectors, sparse vectors, entity links. When you later call recall, it runs four retrieval strategies in parallel and merges the results using reciprocal rank fusion before a final reranking pass. The output is relevant context that's been earned through actual signal, not just cosine similarity to the query string.

What Changes in Practice

The behavioral difference shows up immediately in a few specific scenarios that used to generate complaints.

The most obvious: returning users. Before Hindsight, every session opened with KAIRO at zero. After integrating memory, when a user who'd spent three sessions asking about algorithmic trading came back with "what did we cover last time?", KAIRO recalled that they'd discussed order book mechanics, that they preferred the Technical persona, and that they'd asked twice about latency in order execution. That's not magic — it's just memory being used correctly.

The more interesting case is behavioral adaptation. When a user consistently rephrases KAIRO's responses into simpler language, or keeps asking follow-up questions of the form "can you explain that more simply?", those signals get retained. Over time, the memory bank reflects a picture of what works for this person. The reflect operation makes this explicit — it synthesizes across multiple retained memories to generate observations about patterns:

insight = hindsight.reflect(
    bank_id=f"user-{user_id}",
    query="What communication style works best for this user?"
)

I ran this after a week of real usage on a handful of test users. For one user who worked in product management, the reflection came back noting that this person preferred concrete examples over abstractions, avoided finance-specific terminology, and consistently engaged more with responses under 150 words. None of that was explicitly stated — it was inferred from the pattern of exchanges retained over time. The agent built a working model of that user and can act on it.

Lessons Learned

Memory and personality are orthogonal concerns, and that's a feature. The personality system in KAIRO operates at the system prompt level — it shapes voice and tone. Memory operates at the context level — it shapes what the agent knows. Keeping these separate means you can have a user whose memory bank says they prefer terse, direct answers, and still respect their explicit request to talk to the Friendly persona. The layers don't collapse into each other.

Stateless is not the same as simple. The original KAIRO architecture looked clean precisely because it was stateless. No persistence layer, no retrieval step, no async memory operations. Adding memory introduces real complexity: you need to handle memory retrieval latency, decide what to retain and when, and manage the case where retrieved memories are confidently wrong or outdated. None of this is free. Budget time for it.

The reflect operation is where things get interesting. retain and recall handle the mechanics of storage and retrieval. reflect is what makes the system feel like it's doing something more than database lookups. Asking it to synthesize a working model of user communication preferences, and getting back structured insight from that, is where the "learning" part of agent memory becomes tangible rather than theoretical.

LangChain chains are easy to enrich — the injection point matters. Injecting Hindsight memories into the system message rather than the user message kept the recall context in the right position in the conversation structure. Putting it in the user turn confused the model's role separation and degraded response quality noticeably. Small implementation detail with outsized effect.

Per-user memory scoping is not optional in multi-user applications. Hindsight's bank_id parameter makes this straightforward, but it requires discipline in your session management. If you're sloppy about user identification, memories bleed across users and you get an assistant that confidently tells the wrong person what they prefer. Build the user identity layer before you build the memory layer.

KAIRO is still a relatively focused tool — a multi-persona assistant that now actually learns from the people who use it. The personality switching remains useful. But the memory layer is what makes repeat usage feel coherent rather than disjointed. When an assistant knows who you are and adjusts based on what's worked before, it stops feeling like a query/response interface and starts behaving like something with a working model of you. That's the shift worth building for.