How I Used Hindsight to Build an Unforgiving Accountability Agent

#python #ai #machinelearning

"Did it seriously just remember that?" My teammate stared at the screen as AXIOM cited an excuse from three days ago — without us feeding it any chat history. That's when we knew Hindsight was doing something different from RAG.

What We Built

AXIOM is a personal discipline agent for engineering students. It's not a to-do list or a reminder app. It's an AI that holds you accountable across days and sessions — because it actually remembers what you said yesterday, last week, and the week before that.

Every session starts with a structured check-in: did you work on your capstone project? Did you go to the gym? Did you complete your coursework? You answer honestly. AXIOM scores you out of 1000, updates a 30-day activity heatmap, and stores everything into Hindsight — a persistent memory layer built specifically for AI agents. Tomorrow, when you open it again, it knows exactly what you did and what you avoided.

The stack is deliberately simple: Streamlit for the UI, Groq (Llama 3.1) for fast streaming inference, and Hindsight Cloud for per-user persistent memory. The entire application lives in a single app.py file. No database setup, no complex infrastructure. Just three APIs connected together.

The Memory Problem I Didn't Know I Had

My first version used st.session_state — a Python dictionary that lives only while the browser tab is open. It worked fine within a single session. Close the tab, come back tomorrow — the agent had no idea who you were. It gave the same generic advice every single day. That's not accountability. That's autocomplete with a friendly tone.

My second attempt was saving raw conversation text to a simple list and doing keyword search over it. This felt like progress until I actually tested it. If someone typed "I skipped the gym because I was exhausted and had a headache after the lab session," a keyword search for "gym" the next day would return it — but as a raw blob of text with zero structure. The agent would hallucinate context, misattribute timing, and occasionally reference things that had nothing to do with the current conversation.

The core problem wasn't storage. It was retrieval quality. I needed something that could answer "what excuses has this user made about gym this week?" and return structured, contextual, temporally-aware results — not raw text matches.

That's what Hindsight's agent memory system actually does. When you call retain(), it doesn't store your raw text. An LLM extracts structured facts — who said what, when, in what context, with what relationships between entities. When you call recall(), four search strategies run in parallel: semantic similarity, BM25 keyword matching, knowledge graph traversal, and temporal reasoning.