How Hindsight Turned My Credit Flow Into Something With a Paper Trail

#gautham #fardeenkhan #ai #webdev

The first time I asked my credit decision agent the same question twice, it gave me two different reasons for the same approval. Same income, same liabilities, same repayment history — different reasoning, different confidence score. That's the moment I stopped trusting a stateless LLM call to make lending decisions, and started building a memory layer underneath it.

This is the story of why I put Hindsight in front of every financial flow in FinShield AI, what it actually changed about how the system behaves, and where I'm routing model calls through cascadeflow so the expensive reasoning only happens when it needs to.

What the system does

FinShield AI is a unified financial intelligence platform: credit decisioning, fraud monitoring, AML risk detection, expense optimization, market intelligence, and a conversational financial advisor, all sitting behind one dashboard.

Under the hood, every one of those modules is a Google Genkit flow — a typed input schema, a typed output schema, and a prompt that asks a model to fill in the schema. That part is standard. Plenty of teams wire an LLM to a Zod schema and call it a day.

What's not standard, at least not in most of the credit-scoring code I've read, is that none of these flows call the model directly. Every request goes through a wrapper called runWithHindsight, which does three things around the actual model call: it retrieves relevant memory before the call, it runs the call, and it writes a reflection after the call. The model never sees a blank slate twice.

The core problem: stateless agents don't learn from their own mistakes

Here's the thing about wrapping an LLM in a nice Zod schema: it makes the output structured, but it says nothing about consistency. Ask a credit model the same borrower profile on Monday and Friday, and if your prompt and temperature aren't pinned down exactly, you can get a different risk category. In a credit context, that's not a curiosity — it's an audit problem. If a regulator asks "why did you approve this loan," the honest answer can't be "the model felt like it that day."

I wanted every credit decision to be able to point at what came before it: prior decisions for that user, prior categories, prior confidence levels, and — critically — a record of whether earlier decisions turned out to be shaky. That's what pushed me toward giving the system persistent, retrievable memory instead of relying on model determinism.

I looked at Vectorize's writeup on agent memory early on, and the framing that stuck with me was the distinction between memory as a chat log and memory as retrieval — you don't want to replay the whole conversation history into every prompt, you want to pull the handful of prior interactions that are actually relevant to the current request. That's the model I built around: a memory store you query, not a transcript you dump.

How the credit flow actually uses Hindsight

Every domain flow — credit, fraud, AML, advisory, expenses — calls the same runWithHindsight wrapper before touching the model:

export async function creditDecisionAnalysis(
  input: CreditDecisionInput
): Promise<CreditDecisionOutput> {
  const { result } = await runWithHindsight(
    {
      category: 'credit',
      request: `Credit decision analysis for income ${input.income}`,
      metadata: { source: 'credit-decision' },
    },
    async (context) => creditDecisionAnalysisFlow({ ...input, hindsight_context: context }),
    (result) => ({
      response: JSON.stringify(result),
      confidenceScore: result.confidence_score,
      metadata: { approvalDecision: result.approval_decision },
    })
  );
  return result;
}

Three things happen here, in order:

1. Retrieve. Before the model runs, Hindsight pulls up to four prior memories tagged category: 'credit' that are relevant to this request.
2. Execute with context. The retrieved memories get flattened into a hindsight_context string and injected straight into the prompt, alongside income, liabilities, and repayment history.
3. Finalize and reflect. After the model responds, the result — including its own self-reported confidence_score — gets written back as a new memory, and a reflection gets generated against it.

That orchestration lives in one function, and it's the same function every flow in the codebase calls:

export async function runWithHindsight<T>(
  options: HindsightFlowOptions,
  executor: (context: string) => Promise<T>,
  finalize?: (result: T, context: string) => { response: string; confidenceScore: number; metadata?: Record<string, unknown> }
): Promise<HindsightFlowResult<T>> {
  const memories = await retrieveRelevantMemories({
    userId: options.userId,
    category: options.category,
    request: options.request,
    limit: 4,
  });
  const context = buildMemoryContext(memories);
  const result = await executor(context);
  const resolved = finalize ? finalize(result, context) : { /* default response shape */ };
  const { reflection } = await finalizeHindsightInteraction({
    userId: options.userId,
    request: options.request,
    response: resolved.response,
    confidenceScore: resolved.confidenceScore,
    category: options.category,
    metadata: { ...options.metadata, ...resolved.metadata, memoryContext: context },
  });
  return { result, memories, reflection, context };
}

If a decision comes back with a confidence score under 0.55, Hindsight doesn't just log it — it flags the interaction as an issue and records an explicit improvement note ("add more supporting financial context before answering"). If the response text contains hedge language like "unclear" or "insufficient," that gets flagged too. None of this changes the credit decision after the fact. It builds a running quality signal that the next request in that category gets to read.

What this looks like from the outside

On the actual credit decision screen, a borrower with $95,000 income, $12,000 in liabilities, and five years of clean repayment history comes back with a 95/100 risk score, a "Low" risk category, and an "Approve" decision, with the specific factors the model weighted listed right next to it:

That output isn't just persisted for the dashboard — it's the same object Hindsight stores and later retrieves as context for the next applicant with a similar profile. And it's visible. The Hindsight dashboard itself is a real, queryable view into the memory store: total memories, how many are recent, how many reflections have accumulated, average confidence across everything the system has decided:

You can filter by category, search by request text, and see the exact confidence score attached to each stored decision — a credit memory sitting at 98%, a fraud memory at 88%, a stock sentiment call at 93%. That table is the audit log I wanted from the start: not a black box that outputs "Approve," but a system where every decision is attached to a request, a confidence number, and a category, and where you can trace what the model saw before it answered.

Where cascadeflow fits

Hindsight solves the "does this system remember anything" problem. It doesn't solve the "why is every single call — including trivial ones — hitting the most expensive model available" problem. That's where cascadeflow comes in: routing simple queries to fast, cheap inference (Groq's free tier, local Ollama models) and escalating only the sensitive or ambiguous cases — the ones actually touching credit or fraud decisions — to a paid frontier model.
The architecture looks like this end to end:

Genkit handles flow orchestration and schema validation. Hindsight sits on one side, reading and writing memory for every flow. Cascadeflow sits on the other side, deciding which model tier actually needs to run. A stock sentiment check for a casual question doesn't need the same model budget as a credit approval — and it shouldn't cost the same to answer. Routing that decision at the infrastructure layer, instead of baking it into every individual flow, is what makes cost control a property of the system rather than something each flow author has to remember to implement.

Lessons learned

Memory as retrieval, not replay, keeps prompts small. I don't append the whole interaction history to every request — I pull the four most relevant prior memories and summarize them into a short context string. That keeps token cost predictable regardless of how long a user has been using the system.

A confidence score is only useful if something reads it later. Plenty of LLM responses include a self-reported confidence field that nobody ever checks. Wiring that number into an automatic reflection step — flagging anything under a threshold — turned a decorative field into something that actually changes future behavior.
Categorizing memory by domain matters more than I expected. Tagging every memory with a category (credit, fraud, aml, chat) means retrieval doesn't just find "similar text," it finds similar decisions. A fraud memory should never leak into a credit context, even if the wording happens to overlap.

Persistence has to survive infrastructure failure. I didn't want a Firestore outage to mean the system silently forgets everything mid-session, so the store falls back to a local file when the primary database is unreachable, and reads/writes go through the same interface either way. The flows calling runWithHindsight never know which backend actually served the request.

Routing cost decisions away from individual flows scales better. Once you have more than a handful of flows, hardcoding "use the expensive model" into every prompt file becomes something you have to audit by hand. Pushing that decision into a runtime layer — cascadeflow's docs describe this as cost-aware routing — means adding a new flow doesn't mean re-deciding your model budget from scratch.

If you're building anything that makes a repeated, categorized decision — credit, fraud, compliance, anything where "why did it decide that" is a real question someone will ask — the pattern I'd take away from this isn't "add an LLM." It's "add a memory layer that the LLM has to read from and write back to, and make the read/write cheap enough that you can afford to do it on every single call." That's the difference between a model that answers and a system that can explain itself.

Published by Gautham.B & Fardeen Khan.F