I replaced my scratchpad with Hindsight and it worked.
My chatbot had no idea who it was talking to. Every session started cold. Every conversation was disposable. That was the problem I set out to fix when I built Kairo-AI — and the solution ended up being less about the model and more about how memory is structured around it.
What Kairo-AI Does and How It Hangs Together
Kairo-AI is a conversational assistant built on Streamlit, LangChain, and Ollama. On the surface it looks simple: you pick a personality mode — Professional, Friendly, Funny, or Technical — type a message, and get a response. But the interesting engineering is in what happens between sessions and across users, which is where the naive version completely falls apart and where the real work started.
The stack is deliberately local-first. Ollama runs the Llama model on-device, which means no API latency on the inference path and no per-token billing surprise at the end of the month. LangChain's ChatPromptTemplate wires the personality system prompt to each query, and StrOutputParser strips the response back to clean text before it hits the Streamlit layer.
prompt = ChatPromptTemplate.from_messages([
("system", personality_prompts[personality]),
("user", "{query}")
])
llm = Ollama(model="llama2")
output_parser = StrOutputParser()
chain = prompt | llm | output_parser
The pipe operator here is doing more than it looks like — it's composing a lazy evaluation chain. Nothing runs until .invoke() is called with actual input. That means the personality swap in the sidebar isn't re-instantiating any objects; it just feeds a different string into the same template at call time. Clean.
But the session state management exposed the first real gap. Streamlit's st.session_state gives you in-memory persistence for the lifetime of a browser session. Close the tab, lose the history. Refresh, lose the history. This is fine for a demo. It is not fine for a system where the assistant is supposed to learn anything about its users over time.
The Core Technical Problem: Context Without Memory Is Theater
Here's what the early version did in practice. A user would tell the assistant they're a backend engineer working in Go, ask a few questions about goroutines, get useful answers, and then come back the next day. Day two: the assistant had no idea about Go, no idea about the engineer, no idea the conversation had ever happened. The "personality" system was cosmetic — Professional vs Friendly changes the tone, not the knowledge. The assistant was performing memory, not having it.
I knew I needed agent memory that persisted across sessions and could surface relevant context without me manually reconstructing conversation history in the prompt.
The usual approach here is to dump conversation logs into a vector store and retrieve them at query time. This works up to a point. The problem is that raw conversation text is noisy. A user mentioning "I hate Python" in the context of a joke doesn't mean Python should be excluded from every recommendation forever. What you actually want is derived knowledge — distilled facts and preferences extracted from conversation, weighted by recency and reliability.
That's where Hindsight came in.
Integrating Hindsight: From Logs to Learned Context
Hindsight agent memory works differently from a simple RAG pipeline over chat history. Instead of storing raw message text and hoping the retriever finds the right chunk, Hindsight processes conversations and extracts structured facts — things the agent should remember about the user, their preferences, their context. It maintains those facts over time and handles conflicts: if a user's stated preferences contradict each other across sessions, Hindsight reconciles rather than concatenates.
The integration point in Kairo-AI is at the chain boundary. Before the ChatPromptTemplate is assembled, a memory fetch happens:
from hindsight import HindsightClient
hindsight = HindsightClient(api_key=os.environ["HINDSIGHT_API_KEY"])
def build_prompt_with_memory(user_id: str, query: str, personality: str) -> str:
memory_context = hindsight.recall(user_id=user_id, query=query, top_k=5)
system_prompt = personality_prompts[personality]
if memory_context:
system_prompt += f"\n\nWhat you know about this user:\n{memory_context}"
return system_prompt
And after each response is generated, the exchange gets committed back:
def commit_exchange(user_id: str, query: str, response: str):
hindsight.remember(
user_id=user_id,
messages=[
{"role": "user", "content": query},
{"role": "assistant", "content": response}
]
)
Two things stand out here that took iteration to get right. First, top_k=5 matters. Retrieving too much context bloats the prompt and degrades response quality — the model starts trying to reconcile every recalled fact rather than answering the question. Five facts is usually enough to personalize the response without overwhelming the context window. Second, recall is query-aware, not user-aware. It doesn't dump everything Hindsight knows about the user; it surfaces the facts most relevant to the current question. That's the part that raw session logs can't replicate without significant engineering.
Personality + Memory: The Interaction That Actually Matters
The personality system and the memory system are independent by design, but they interact in a way that wasn't obvious upfront. A user who prefers the "Technical" personality is implicitly signaling something about how they want to receive information. Hindsight captures that preference as a learned fact — not because the system was explicitly programmed to detect it, but because consistent personality selection correlates with the kinds of responses the user validates through continued engagement.
By session three or four with an active user, the recalled context starts including entries like "prefers technical depth over simplicity, responds well to code examples, uses Go professionally." The assistant doesn't need to ask. It doesn't need the user to re-select their personality every time. The memory has already tuned the context.
Here's what that looks like in a concrete interaction:
Session 1:
User: "How do I handle errors in Go?"
Kairo: [Technical mode, generic goroutine error explanation]
Session 4:
User: "What's the right pattern for retrying failed API calls?"
Kairo: [Technical mode, with recalled context about Go preference] "Given you're in Go, here's an exponential backoff pattern usingcontext.Contextfor cancellation..."
The model didn't get smarter. The context got better. That distinction matters for how you think about improving these systems — you're not always chasing a better model. Sometimes you're chasing better memory architecture.
What the Personality System Gets Right
The four-mode personality system — Professional, Friendly, Funny, Technical — is simpler than most production assistants end up using, but the simplicity is a feature. Users don't want to configure their assistant; they want to signal intent quickly and get an appropriate response. A dropdown with four options is a three-second decision. A multi-axis preference matrix is a five-minute setup nobody completes.
personality_prompts = {
"Professional": "You are KAIRO, a professional, polite, and formal assistant.",
"Friendly": "You are KAIRO, a friendly, casual, and warm assistant.",
"Funny": "You are KAIRO, a humorous assistant, always adding light jokes.",
"Technical": "You are KAIRO, a highly technical, precise, and detailed assistant.",
}
These are short. Short system prompts leave more of the context window for memory and for the actual query. I tried longer, more elaborately specified personality prompts in earlier iterations, and the responses were more "in character" but less accurate. The model was allocating attention to style compliance at the expense of factual correctness. Four sentences is enough.
Lessons Learned
1. Session state is not memory. Stop pretending it is. Streamlit's st.session_state is a convenience, not an architecture. If you're building anything that users are expected to return to, you need persistent memory from day one. Retrofitting it is painful.
2. Query-aware recall beats full-user-dump. Loading everything known about a user into the prompt sounds thorough. It's actually counterproductive. Relevant retrieval — surface the facts germane to this question — produces better responses than comprehensive retrieval.
3. Short system prompts outperform elaborate ones. Every token you spend on stylistic instruction is a token not spent on context. Keep personality directives concise and let memory carry the personalization weight.
4. The local model creates real constraints you have to engineer around. Running Llama locally through Ollama means no API latency, but it also means a smaller context window than you'd get from a hosted frontier model. Memory systems that are selective about what gets injected into the prompt matter more, not less, in this environment.
5. Hindsight handles the hard part. Conflict resolution across sessions, recency weighting, relevance scoring — these are genuinely hard problems if you build them yourself. Offloading them to a purpose-built agent memory layer let me focus on the application logic rather than the memory infrastructure. I'd wasted two weeks building a naive vector-log retriever before switching. The two weeks weren't wasted — I understand the problem better for having attempted it — but the switch was the right call.
The underlying insight behind Kairo-AI is that personality and memory are orthogonal dimensions of a good assistant. Personality sets the communication style. Memory provides the context. You need both, and they need to be composable. Getting that separation right, and using Hindsight to handle the memory side properly, is what turned a stateless chat widget into something that actually improves with use.
Top comments (0)