Rules Caught Nothing, Memory Caught Everything.

#ai #automation #learning #agents

Every invoice processing system has rules. "Flag amounts over $50,000 for manual review." "Reject invoices missing a vendor registration number." These are clear, manageable, and easy to apply.

The problem is that most cases of invoice fraud, duplicate submissions, and billing mistakes don’t trigger these rules. They appear to be ordinary invoices. A vendor submitting slightly different duplicate invoices,a matching amount but a different invoice number,passes all field-level checks. The pattern only becomes noticeable when you know the vendor's history.

Building Finley's decision engine taught me how to blend rule based checks with pattern detection that comes from experience. Here’s how these two layers work together.

The Decision Engine Structure

Finley follows two steps before making a decision: an analyzer that generates flags and checks, and a decision builder that interprets them.

// Step 4: Contextual analysis
const analysis = await analyzeInvoice(extracted, memory);

// Step 5: Decision engine
const decision = buildDecision(analysis);

The analyzer examines both the current invoice and the retrieved memories. The decision builder only receives the analysis output. This separation is important: the analyzer interprets while the decision builder applies the logic. The decision builder itself is straightforward, given the same analysis, it produces the same decision every time.

Layer 1: Deterministic Checks

Some issues don’t require complex reasoning. A missing invoice number is always a problem. An amount that doesn’t match the total of line items is also always a problem. These are field-level checks that run before any complex calls.

const checks = [
  { 
    name: "Invoice ID present", 
    pass: !!extracted.invoiceId,
    severity: "error" 
  },
  { 
    name: "Amount matches line items", 
    pass: Math.abs(lineItemSum - extracted.totalAmount) < 0.01,
    severity: "warning"
  },
  // ...
];

These checks run quickly, yield clear results, and catch the obvious issues without using up API credits.

Layer 2: Memory-Backed Pattern Detection

The more intriguing layer involves what the LLM does with vendor memory. When Finley retrieves 9 previous interactions from Hindsight for a vendor, these memories join the current invoice fields in the analyzer prompt.

The analyzer can then identify patterns that no static rule would catch:

Duplicate detection with variation: "INV-2025-0009 for ₹47,500—vendor submitted INV-2025-0007 for the same amount 3 weeks ago. Similar amounts from this vendor: 3 in the last 6 months, 2 with identical totals."

Payment terms drift: "Invoice states Net-30. Memory shows user has corrected this to Net-45 twice in the past. Vendor consistently invoices on incorrect terms."

Rounding pattern: "Amount is ₹47,500.00. Historical pattern for this vendor shows rounding errors of ₹0.50–₹2.00. This amount is clean, no flags."

None of these patterns are hard-coded. They develop from LLM reasoning based on the memory the agent has built up over time. This is the key benefit of agent memory in a business workflow: the agent improves at spotting vendor-specific issues without anyone needing to write new rules.

The Flag/Check Distinction

The analysis output generates two separate lists: flags and checks . Flags indicate problems. Checks confirm details.

{
  flags: [
    { 
      type: "potential_duplicate",
      message: "Similar invoice amount submitted 3 weeks ago (INV-2025-0007)",
      severity: "high",
      memoryBacked: true
    }
  ],
  checks: [
    { name: "Vendor registered", pass: true },
    { name: "Invoice date valid", pass: true },
    // ...
  ],
  confidence: 87
}

The memoryBacked field on flags is a significant design choice. It informs the decision builder, and the user, whether a flag comes from field-level validation (which is always dependable) or from memory based pattern detection (which relies on the quality of the memory). A flag from 9 high quality previous interactions is more trustworthy than a flag from only 1.

The Verdict Logic

buildDecision translates the analysis output into a verdict based on clear thresholds:

Any severity: "error" flag → reject
Any severity: "high" flag → flag(i.e hold for review)
Multiple severity: "medium" flags → flag
Clear checks with no significant flags → approve

The confidence score from the analyzer feeds into the result but doesn’t override the decision logic. A 90% confidence duplicate flag still results in a hold— the confidence is informational, not a deciding factor.

What Doesn't Work

The current design has a real flaw: memory quality can decline if users consistently approve items that should be flagged. If an accountant approves duplicate invoices for months, the agent's memory fills with "approved" actions for duplicates. Future pattern detection will weaken because the historical signal becomes confusing.

The solution is tracking feedback quality flagging when user actions repeatedly contradict agent recommendations and highlighting that to reviewers. We didn’t build this yet, but it’s the logical next step.

Another limitation is that memory retrieval from Hindsight provides the top 20 most relevant entries. For vendors with many invoices, the retrieved 20 might not include the specific previous duplicate that is most relevant. Better retrieval query design,like filtering by invoice amount range,would help.

The Takeaway

Rules are necessary and straightforward. Pattern detection from memory is what truly makes the agent useful. The effective structure: run deterministic checks first, then provide the LLM with memory context to identify patterns that rules won’t catch. Keep the decision logic straightforward on both types. Also, monitor whether user feedback strengthens or harms the memory the agent relies on.

Finley is at finley-rho.vercel.app.

#HackWithIndia #devnovate