Why stateless LLMs can't make consistent decisions, and how Hindsight fixed that

#agents #ai #programming #python

I spend most of my time in the frontend. I build dashboards, wire up routes, push the thing to a CDN, and obsess over the half-second between a click and something useful appearing on screen. So when we set out to build ExceptionOS — a platform that helps companies make consistent, explainable decisions about business exceptions like refunds, discount approvals, and SLA compensation — the part I owned was the surface: the React app, the deploy pipeline, and a chat-plus-voice assistant that anyone could talk to.

The interesting problem turned out not to be the UI at all. It was what sat behind it: a memory layer that remembers every decision an organization has ever made, and an assistant that answers questions by recalling from it. This is the story of building that assistant, the dumb mistake I made that made it feel slow and weird, and how a memory system called Hindsight ended up shaping the whole product.

What the system actually does

ExceptionOS captures a business exception — say, a customer asking for a refund outside policy — and runs it through a debate. Ten specialized agents look at the case from different angles: one finds the applicable policy, one estimates the financial hit, one assesses churn risk, one digs up similar past cases, and one plays critic and pokes holes in the emerging recommendation. The output is a structured recommendation with reasoning a human can read and override.

That debate is only as good as its memory. An agent that finds "similar past cases" needs somewhere those cases live. We use Hindsight Cloud for that — agent memory as a managed service, with three operations we lean on constantly: retain (store a decision), recall (find relevant ones), and reflect (surface patterns over time). Every organization gets its own memory bank, created on first use:

bank = await self.hindsight.create_bank(
    name=f"exceptionos-{org_name}-{organization_id}",
    description=f"Memory bank for organisation '{org_name}' on ExceptionOS",
)

One bank per org means recall is naturally scoped — Acme's assistant never sees Globex's decisions. That property mattered a lot once I started building the front-facing assistant.

The assistant: a chat orb that knows your history

The feature I'm proudest of is a floating orb that lives in the corner of every screen. You can type at it or talk to it. Ask "what's our approval rate for contractor exceptions?" and it answers from your organization's actual decision history, then reads the answer aloud.

The frontend side is deliberately thin. The browser doesn't talk to the memory layer or the LLM directly — it posts a message, an optional bank ID, and the current page context to one backend endpoint, and gets back an answer plus the sources it used:

export async function askAssistant(
  message: string,
  opts: { bankId?: string; context?: string } = {},
): Promise<AssistantReply> {
  const res = await fetch(`${API}/api/v1/assistant/chat`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json', ...authHeader() },
    body: JSON.stringify({ message, bank_id: opts.bankId, context: opts.context }),
  })
  const json = await res.json()
  const data = json.data || json
  return { answer: data.answer, sources: data.sources || [], provider: data.provider }
}

Passing context — the case or page the user is currently looking at — is what makes the assistant feel like it's there with you. Ask "is this one risky?" while staring at a specific case and it knows what "this" means. Passing bankId is what makes it org-aware. Two small fields, most of the perceived intelligence.

On the backend, the endpoint recalls grounding memories from Hindsight and feeds them to the LLM as context. The recall itself is one HTTP call:

async def recall_all(self, bank_id: str, query: str, top_k: int = 5):
    return await self.hindsight.recall(bank_id, query, top_k=top_k)

Then a system prompt tells the model to use those memories only when the question actually calls for them, answer in one to three sentences, and keep a natural spoken tone — because that same text gets sent to ElevenLabs for voice synthesis and read back to the user.

The mistake: recalling on "hi"

Here's where I got it wrong. My first version was clean and uniform: every message went through the same path. User says something, we recall from memory, we hand the memories to the model, we answer. Symmetry felt right.

It was terrible.

You'd open the orb, type "hi", and wait. Behind that one word the system was doing a full vector recall against the org's entire decision history, pulling five "relevant" memories about refunds and NDAs, and stuffing them into the prompt. The model, dutifully handed a pile of past cases, would respond to "hi" by listing refund precedents. It was slow — a network round-trip to the memory layer before any greeting — and it was unsettling, like saying hello to someone who immediately recites your file.

The fix was to admit that not every message is a query. Small talk shouldn't touch memory at all:

normalized = message.lower().strip(" .!?")
is_smalltalk = normalized in GREETINGS or len(normalized) <= 3

memories: list[dict] = []
if not is_smalltalk:
    memories = await RecallService(get_hindsight_client()).recall_all(bank_id, message, top_k=5)

Before: "hi" → recall five memories → 2-second pause → an awkward dump of past refund cases.
After: "hi" → no recall → instant, warm one-liner inviting you to ask about a case.

The system prompt reinforces it: greetings get a warm sentence and never enumerate cases; memories get referenced only when the user asks about a case, refund, discount, policy, or decision. The lesson generalizes well beyond greetings. Recall is not free — it costs a round-trip and it costs prompt space — and a memory system is most impressive when it stays quiet until it has something worth saying.

The gotcha that cost me an afternoon: metadata is strings only

A quieter lesson lived at the boundary between our data model and Hindsight's. Hindsight's memory metadata accepts string values only, so anything structured — case IDs, financial figures, nested objects — has to be coerced or JSON-encoded before a retain call, or it silently fails to stick. The fix was a small normalizer that every retain passes through. The takeaway: when you adopt a managed memory layer, learn its type contract early — the constraints are usually there for good reasons, and guessing from the client side just wastes an afternoon.

What I'd tell someone starting this

Shipping the frontend taught me that a memory layer doesn't live in the backend — it leaks into every product decision you make. Whether to recall, when to recall, how much to show, what to read aloud: those are UX calls as much as infrastructure calls. Two small request fields (bank_id, context) carried most of the assistant's apparent intelligence. One conditional (is_smalltalk) carried most of its perceived speed and warmth.

If you're building something similar, start with the operations — retain, recall, reflect — and resist the urge to apply them uniformly. The product feels smart not when it remembers everything, but when it knows the difference between a question and a hello.

You can see the project at github.com/Avi36005/ExceptionOS, read more about the memory layer in the Hindsight docs and its GitHub repo, or dig into the concept of agent memory itself.