DEV Community

billkhiz
billkhiz

Posted on

I built an AI bookkeeping agent that reached the AWS semifinals from 10,000+ entries

Table Of Contents

Every month, I sit down with bank statements from multiple clients and manually assign each transaction to the correct nominal code — a process called transaction categorisation.

It takes hours. There are 166 standard UK nominal codes, five VAT rate categories, and endless edge cases. "AMAZON MARKETPLACE" could be office supplies, stock purchases, or a personal expense depending on the client. Multiply that across hundreds of transactions per client, per month, and you start to understand why 75% of CPAs are expected to retire in the next decade with fewer graduates replacing them.

So I built LedgerAgent - an AI-powered bookkeeping agent that categorises bank transactions automatically using Amazon Bedrock. It reached the semifinals of the AWS 10,000 AIdeas competition (top ~1,000 from over 10,000 entries) in the EMEA Commercial Solutions category.

Here it is in action:

Here's how it works under the hood.

The architecture


Stack: React 19 + Express + 8 AWS services (Bedrock, DynamoDB, S3, SQS, Lambda, API Gateway, EventBridge, Cognito)

LedgerAgent uses 8 AWS services working together:

Browser (React 19)
    │
    ├── Cognito JWT auth
    │
Express Server (port 3001)
    │
    ├── Amazon Bedrock ──── Claude 3.5 Haiku (categorisation)
    │                       Claude 3.5 Sonnet (receipt OCR)
    ├── DynamoDB ─────────── Client vault (transactions, learned patterns)
    ├── S3 ───────────────── File storage (uploads, receipts, backups)
    ├── SQS ──────────────── Async job queue (large batches)
    │     │
    │     └── Lambda ─────── Serverless batch processor
    │
    ├── API Gateway ──────── REST endpoint for job status
    └── EventBridge ──────── Daily DynamoDB → S3 backup
Enter fullscreen mode Exit fullscreen mode

The frontend is React 19 with Vite and Tailwind. The backend is Express running on Node.js 20. All AI inference runs through Amazon Bedrock - Claude 3.5 Haiku for transaction categorisation (fast and cheap) and Claude 3.5 Sonnet for receipt OCR (multimodal image understanding).

The key design decision was using DynamoDB as a persistent "vault" for each client. Every accounting practice manages multiple clients, and each client has their own transaction history, confirmed categorisations, and learned patterns. DynamoDB's pay-per-request billing made this economical - so I'm not paying for idle capacity between categorisation runs.

The categorisation engine

The core of LedgerAgent is the chartOfAccounts.mjs service. It loads two data files at startup:

  • nominal_codes.json — 166 UK standard accounting codes (from 1001 Fixed Assets through to 9999 Suspense)
  • global_rules.json — 365 vendor-to-category mapping rules built from my experience coding thousands of real transactions

The system prompt establishes a UK bookkeeper persona with the full code reference. When a transaction comes in, the buildUserMessage function constructs the prompt:

// Conceptual flow — simplified
function buildUserMessage(transaction, confirmedExamples) {
  // 1. Transaction details (date, description, amount)
  // 2. Any previously confirmed categorisations for this client
  //    injected as few-shot examples
  // 3. Request structured JSON response with
  //    account_code, account_name, confidence, reasoning

  // The full prompt includes the 166 UK nominal codes
  // and 365 vendor-to-category rules as system context
}
Enter fullscreen mode Exit fullscreen mode

Bedrock returns a structured JSON response with the nominal code, account name, a confidence level (high, medium, or low), and a reasoning string explaining the decision. The confidence scoring was essential - it tells me which transactions I can trust and which need manual review.

Few-shot learning that actually improves over time

This is the part I'm most proud of. When I review a categorisation and confirm it's correct (or manually correct it), that decision gets saved to the client's confirmedExamples array in DynamoDB:

// Conceptual flow — the key insight is per-client learning
// When a user confirms "AMAZON MARKETPLACE → 7502 Stationery",
// that decision is stored against the client in DynamoDB.
//
// Next time we categorise for that client, confirmed examples
// are injected into the prompt as few-shot context.
//
// Max 50 examples per client, deduplicated by description.
// This means a retail client and a tech consultancy categorise
// the same vendor differently — because their confirmed
// examples are different.
Enter fullscreen mode Exit fullscreen mode

The next time I categorise transactions for that same client, those confirmed examples are injected into the Bedrock prompt as few-shot context. The model sees: "Last time you saw AMAZON MARKETPLACE for this client, it was coded to 7502 Stationery & Printing."

This creates a per-client learning loop. A retail client's Amazon purchases get categorised differently from a tech consultancy's Amazon purchases - because the confirmed examples are client-specific. After confirming 20-30 transactions, accuracy jumps noticeably because the model has real context about how this particular business operates.

Handling real-world bank statements

UK bank CSVs are a mess. Every bank uses different column names, different date formats, and different ways of representing debits and credits. The csvParser.mjs service handles this with intelligent column detection:

// Simplified from csvParser.mjs
function detectColumns(headers) {
  const map = {};
  headers.forEach((h, i) => {
    const lower = h.toLowerCase().trim();
    if (/date|trans.*date|posted|value.*date/.test(lower)) map.date = i;
    if (/description|narrative|details|memo|payee/.test(lower)) map.desc = i;
    if (/^amount$|^value$|^sum$|^total$/.test(lower)) map.amount = i;
    if (/debit|dr|money.*out|paid.*out/.test(lower)) map.debit = i;
    if (/credit|cr|money.*in|paid.*in/.test(lower)) map.credit = i;
  });
  return map;
}
Enter fullscreen mode Exit fullscreen mode

It handles three different amount formats: a single amount column (negative for debits), separate debit and credit columns, and amounts with pound signs and comma formatting. This means I can upload a Lloyds statement, a Barclays statement, and an HSBC statement without any manual configuration.

Batched processing with concurrency control

For large bank statements (100+ transactions), hitting Bedrock sequentially would take minutes. LedgerAgent uses a parallel worker pool with concurrency of 3:

// Conceptual flow — concurrency-controlled batch processing
// Transactions are processed in parallel chunks (concurrency of 3)
// to balance speed against Bedrock rate limits.
//
// For batches over 100 transactions, the async pipeline kicks in:
// Express → SQS queue → Lambda picks up job → Bedrock AI → DynamoDB
// Frontend polls API Gateway for completion status.
Enter fullscreen mode Exit fullscreen mode

For even larger batches, the async pipeline kicks in - transactions get sent to SQS, picked up by a Lambda function, processed against Bedrock, and results are written back to DynamoDB. The frontend polls for completion via API Gateway.

Double-entry done right

One thing that surprised me during development: most "AI bookkeeping" demos I've seen online produce a single-entry list of categorised transactions. That's not bookkeeping - it's just labelling. Real bookkeeping requires double-entry, where every transaction creates two ledger entries that must balance.

In LedgerAgent, the bank account (nominal code 1200) acts as the contra account for every transaction:

Transaction type Bank account (1200) Categorised account
Money out Credit Debit
Money in Debit Credit

The trial balance splits automatically at the code 4000 boundary - codes below 4000 go on the Balance Sheet (assets, liabilities, equity), codes 4000 and above go on the Profit & Loss (income, expenses). Total debits must always equal total credits.

This sounds basic to anyone with accounting training, but getting an AI system to consistently produce balanced double-entry output required careful prompt engineering and validation logic.

What I learned


Key takeaways:
  • Domain knowledge is the moat - not the AI wrapper
  • Few-shot learning beats fine-tuning when per-client variation is high
  • Confidence scoring changes the entire review workflow

Domain knowledge is the moat. The 166 nominal codes, 365 vendor rules, VAT rate handling, and double-entry logic aren't things you can prompt-engineer from scratch. They come from years of sitting with bank statements. Any developer can connect to Bedrock — few can tell you that a Deliveroo transaction for a sole trader should be coded to 7901 (Staff Welfare) not 7400 (Travel & Subsistence) unless it was a client entertainment expense, in which case it's 7601 (Entertaining).

Few-shot learning beats fine-tuning for this use case. I considered fine-tuning a model on accounting data, but the per-client variation is too high. A retail business and a tech consultancy categorise the same vendors completely differently. Dynamic few-shot context from confirmed examples handles this naturally.

Confidence scoring changes the workflow. Without confidence scores, you'd have to review every single categorisation. With them, I can filter to "low confidence" transactions and review only the 10-15% that genuinely need human judgement. The rest can be confirmed in bulk.

The numbers

  • 166 UK nominal codes mapped
  • 365 vendor-to-category rules
  • 5,860 lines of code across 39 source files
  • 8 AWS services integrated
  • Top ~1,000 from 10,000+ entries in AWS AIdeas

LedgerAgent is currently a tool I use for my own practice, but I'm planning to open it up to other small accountancy firms. If you're an accountant drowning in manual transaction categorisation, or a developer building fintech tools, I'd like to hear from you.

Connect with me on X/Twitter to discuss AI in Fintech!

If you're interested in the code or want to connect, check out the repository and my profile:

Check out my GitHub Profile


Built with React 19, Express, Amazon Bedrock (Claude 3.5 Haiku + Sonnet), DynamoDB, S3, SQS, Lambda, API Gateway, EventBridge, and Cognito.

Top comments (0)