Most AI agents are stateless. They process a request, return a result, and forget everything. For most tasks that's fine. For accounts payable, it's a disaster.
I spent two days building Finley, an invoice intelligence agent for SMBs, and the most interesting engineering decision wasn't the LLM prompt design or the extraction logic. It was what happened when I wired in Hindsight as a persistent memory layer. The before and after is stark enough that I want to document exactly how it works.
What the System Does
Finley processes invoices through an 8-step pipeline:
Upload → LLM Extraction → Memory Retrieval → Analysis → Decision → Output → Feedback → Memory Update
A user uploads a PDF invoice. The backend extracts structured fields using Claude, queries a vendor-scoped memory namespace for historical context, runs analysis, and returns a verdict: approve, flag, or reject. After the user acts on that verdict, their action gets written back to memory for next time.
The stack is straightforward: Node.js/Express backend, React/Vite frontend, Claude via the Anthropic SDK, and Hindsight agent memory for persistence.
The Memory Architecture
Every vendor gets its own namespace in Hindsight:
javascript
function vendorKey(name) {
return vendor:${name.toLowerCase().replace(/\s+/g, "_")};
}
So `Prakash Office Supplies Pvt. Ltd.` becomes `vendor:prakash_office_supplies_pvt._ltd.`. When a new invoice arrives, we query that namespace for the top 20 most relevant past interactions:
javascript
export async function retrieveVendorMemory(vendorName, _currentInvoiceId) {
const key = vendorKey(vendorName);
const data = await hindsightFetch(
/v1/memories/query?namespace=${encodeURIComponent(key)}&limit=20
);
if (data?.memories) return data.memories;
return localGet(key).slice().reverse();
}
After processing, the user's feedback — approved, rejected, or overridden — gets written back:
javascript
const payload = {
namespace: key,
content: JSON.stringify(entry),
metadata: {
vendorName,
invoiceId: entry.invoiceId,
date: entry.date,
verdict: entry.agentDecision,
userAction: entry.userAction,
},
};
await hindsightFetch("/v1/memories", "POST", payload);
The metadata fields let Hindsight perform semantic retrieval — it doesn't just do exact key lookup, it finds contextually relevant memories. That matters when you're asking "has this vendor submitted duplicate invoices before?" against a set of unstructured memory entries.
The Analyzer Is Where Memory Actually Earns Its Keep
The real work happens in the analyzer. It receives both the extracted invoice fields and the retrieved memory array, and passes them to Claude together:
javascript
const analysis = await analyzeInvoice(extracted, memory);
Inside analyzeInvoice, the memory gets serialized into the prompt. The LLM can then reason over patterns: "Invoice #INV-2025-0009 for ₹47,500 — I've seen this vendor submit an invoice with this exact amount and number before. That's a duplicate."
Without memory, the analyzer produces generic checks: amount format, required fields, basic plausibility. With 9 prior interactions in context, it starts surfacing things like:
- "This vendor consistently invoices on Net-30 despite a Net-45 contract"
- "Rounding errors of ₹0.50–₹2.00 are a documented pattern on this vendor's billing system"
- "3 duplicate invoices in 6 months — hold for review"
These aren't hard-coded rules. They emerge from the LLM reasoning over memory the agent has accumulated through normal use.
The Fallback Is Non-Negotiable
I designed the memory layer to degrade gracefully. If the Hindsight API key isn't set, everything falls through to an in-process Map:
javascript
if (!HINDSIGHT_API_KEY) {
console.warn("[memory] HINDSIGHT_API_KEY not set — using local fallback");
return null;
}
The local store works for demos and development. You lose cross-session persistence, but the pipeline itself stays intact. This mattered during building — I could test the full 8-step flow without a live Hindsight connection.
What "Invoice #1 vs Invoice #10" Actually Looks Like
The demo script for Finley has two acts for a reason.
Act 1: Select a new vendor with no prior history. The agent approves with 0 memory recalls. The checks are generic. It has no basis for anything else.
Act 2: Select Prakash Office Supplies on their 10th invoice. The agent retrieves 9 prior interactions, flags the duplicate, auto-corrects the payment terms to Net-45 (learned from previous corrections), and holds for review. Same pipeline, same code, completely different outcome.
That's not a prompt trick. It's what accumulated agent memory looks like in practice.
Lessons
Store structured metadata alongside content. Hindsight accepts a metadata object with each memory entry. Using it to index verdict, userAction, and date makes retrieval more precise. Don't skip it.
Namespace granularity matters. We went vendor-scoped. You could also namespace by vendor+invoice-type or vendor+region. The right granularity depends on how you want memory to cluster.
The fallback determines your dev experience. If the memory layer has no local substitute, your development loop gets slow. Build the fallback first.
Feedback is how the agent learns. The initial seeded memories are for demos. The real value builds up through actual user corrections over time. Design the feedback loop early — it shapes the whole memory schema.
The code for Finley is at finley-rho.vercel.app. The full memory integration uses Hindsight — worth reading their docs if you're building anything that needs an agent to remember context across sessions.
Top comments (0)