I Built an Agent That Remembered My Worst Decisions So I'd Stop Repeating Them

#webdev #hindsight #agents #automation

Every bad decision I've ever made had one thing in common: I'd made a version of it before and forgotten.

That's the problem Retrospect is built to solve. It's a personal decision memory agent — you log real decisions you've made, with outcomes, and the agent retains every one of them. When you're about to make a new call, you ask the agent. It recalls the most semantically relevant decisions from your own history and reasons over them using Gemini 2.5 Flash. The advice isn't generic. It's built entirely from your own patterns.

The core of the system is two things working together: Hindsight for persistent agent memory, and Gemini for pattern reasoning over what Hindsight recalls. Neither piece is interesting alone. Together, they produce something that gets meaningfully better the more you use it.

How the System Hangs Together

The stack is Next.js 14 App Router, SQLite via better-sqlite3, JWT auth, and Tailwind CSS. The interesting work lives in three files: src/lib/hindsight.ts, api/decisions/log/route.ts, and api/decisions/ask/route.ts.

Every decision a user logs gets written to SQLite — that's the source of truth for the UI, the dashboard, and the insights page. Simultaneously, it gets retained in Hindsight memory — that's the source of truth for the agent. When the agent is asked a question, it recalls the top 5 semantically relevant memories before Gemini sees a single token. Gemini never operates on a blank slate. It always reasons over the user's actual history.

The UI has a split layout on the agent page: chat on the left, a live memory feed on the right. When the agent recalls specific memories, those cards glow in the UI. You can watch Hindsight working in real time rather than trusting it's happening somewhere in the backend.

The Problem With Recency-Based Context

The first version I shipped had no persistent memory at all. I was pulling the user's last 5 decisions from SQLite and injecting them into the Gemini prompt. It worked well enough to demo, but the advice was consistently off.

Ask it "should I run Instagram ads?" and it would pull the 5 most recently logged decisions regardless of relevance — maybe 3 finance decisions and a hiring call — and try to reason from there. Gemini would hedge. The output felt generic because the input was wrong.

The shift happened when I stopped thinking about this as a "give the LLM more context" problem and started thinking about it as a retrieval problem. The question isn't "what did this user do recently?" It's "what did this user do that's most similar to what they're asking about now?"

That reframe is what pushed me toward Hindsight agent memory. Instead of recency-based retrieval, Hindsight does semantic recall. The query "should I run Instagram ads?" now pulls the 5 decisions most semantically related to that question — past marketing experiments, ad spend outcomes, content channel tests. Gemini gets the right context every time.

The Hindsight Wrapper

I wrapped Hindsight in a small module that handles lazy initialisation and graceful fallback:

// src/lib/hindsight.ts

let hindsightClient: any = null;

async function getHindsight() {
  if (hindsightClient) return hindsightClient;

  const apiKey = process.env.HINDSIGHT_API_KEY;
  if (!apiKey) {
    console.warn("HINDSIGHT_API_KEY not set — using mock memory store");
    return null;
  }

  try {
    const { Hindsight } = await import("@vectorize-io/hindsight");
    hindsightClient = new Hindsight({ apiKey });
    return hindsightClient;
  } catch (e) {
    console.warn("Hindsight SDK not available — using mock memory store", e);
    return null;
  }
}

The lazy initialisation matters more than it looks. Early in development I kept hitting cold-start failures where the SDK would fail silently and the entire agent flow would break with no useful error. Building in a null-return path meant I could fall back to an in-memory store with keyword matching during development — not accurate enough for production, but accurate enough to build and test the full UI flow without a live API key. The fallback cost me almost nothing to write and saved hours of blocked work.

Retaining a Decision

When a user submits a decision, two writes happen in sequence:

// api/decisions/log/route.ts

// 1. Write to SQLite — source of truth for UI
const dbResult = db
  .prepare(
    "INSERT INTO decisions (user_id, decision, category, outcome, result, date) VALUES (?, ?, ?, ?, ?, ?)"
  )
  .run(payload.userId, decision, category, outcome, result, date);

// 2. Retain in Hindsight — source of truth for agent
await retainMemory({
  content: `Decision: ${decision}. Category: ${category}. Outcome: ${outcome}. Result: ${result}. Date: ${date}`,
  metadata: {
    userId: payload.userId,
    category,
    outcome,
    date,
  },
});

The content string format took more iteration than I expected. I originally retained only the decision text itself. Recall quality improved significantly when I added category, outcome, and result into the content string — because now the embedding captures not just what the decision was, but what kind it was and what actually happened afterward. Treat the content string like a schema, not a log message. It's the most important thing you'll write in the entire system.

Recalling and Reasoning

The ask route is where everything comes together:

// api/decisions/ask/route.ts

// 1. Semantic recall from Hindsight
memories = await recallMemories({
  query: question,
  topK: 5,
  filter: { userId: payload.userId },
});

// 2. Build context string
const context = memories
  .map((m, i) => `Memory ${i + 1}: ${m.content}`)
  .join("\n\n");

// 3. Log the recall event
db.prepare(
  "INSERT INTO recall_events (user_id, query, recalled_count) VALUES (?, ?, ?)"
).run(payload.userId, question, memories.length);

// 4. Pass to Gemini
const answer = await askGemini(question, context);

The recall event logging was something I nearly skipped. It powers the "Patterns Found" stat on the dashboard and gives me data on which queries trigger the most recalls — useful for understanding where embedding quality is holding up. Log your recall events from day one. It's the only way to know whether the memory layer is doing useful work or just adding latency.

The confidence score is deliberately simple:

confidence: Math.min(95, Math.round((memories.length / 5) * 82))

Five recalled memories returns 82% confidence. Fewer returns proportionally less. It never hits 100% — the agent shouldn't be that certain. This number grows as a visible meter in the UI as you log more decisions, communicating one clear thing: the more you put in, the smarter it gets.

What the Insights Page Actually Shows

After 17 decisions, the insights page surfaces three patterns in plain English: Operations decisions succeed 60% of the time, Marketing is the best performing category, and Finance is where mistakes cluster. These aren't generated by Gemini on demand — they're calculated from the SQLite decisions table and updated every time a new decision is logged.

The memory confidence bar sits at 55% here. At 17 decisions logged it hasn't peaked — the system is honest that it needs more signal. That bar is the clearest UI expression of something true about memory-powered agents: they're not useful immediately. They're useful after you've given them enough to work with.

What I'd Do Differently

Make the memory visible first. The split layout with the live memory feed — showing which memories were recalled and highlighting them when the agent uses them — was the last UI feature I built. It should have been the first. Without it, the memory layer is invisible, and an invisible memory layer is indistinguishable from no memory layer at all. Users, collaborators, and anyone evaluating the system need to see it working. Don't make them take it on faith.

Build the fallback before you need it. The mock memory store and SQLite fallback in the ask route are not production code — they exist to keep the system running when Hindsight is unavailable or misconfigured. I needed them more than once during development and during a live demo. Write them first.

Be honest about the cold start. The agent isn't useful on day one. With 2 or 3 decisions logged, recall is shallow and advice reflects that. With 15 or 20, patterns emerge that the user couldn't have articulated themselves. The confidence meter is one way to communicate this — it starts low and grows. Set the expectation early.

The content string is more important than the prompt. I spent most of my tuning time on the Gemini system prompt and almost none on what I was retaining. It should have been the opposite. Structured, specific retained content produces precise recall. Loose content produces noise. The retrieval quality ceiling is set by what you put into Hindsight, not by how clever your prompt is.

Most bad decisions aren't original. They're reruns. The reason people keep making the same mistakes is that nothing in their environment remembers the first time.

Retrospect does. And the longer you use it, the more it knows about the specific ways you tend to go wrong — and right. That's what persistent agent memory makes possible — not a smarter model, but a system that accumulates knowledge about you specifically, and uses it every time you ask.