I wanted to know whether an AI coding agent could build something real instead of a prototype, so I gave it the least forgiving problem I could find. Accounting. Turning Stripe's webhook events into proper double-entry bookkeeping you could hand to QuickBooks or Xero.
Accounting is a good test because there's nowhere to hide. The debits equal the credits or the entry is wrong. An accountant either recognizes what you booked or they don't. There is a right answer and you can't argue your way to it.
Two weeks later I have Ledgerly. It's on npm, it's Apache-2.0, and I'm not fully sure every entry is correct, which is the reason I'm writing this.
Most days I gave Claude a single instruction and let it pick the most useful next thing to build. It read what already existed, chose something, and implemented it. I made the calls at the real forks, approved or rejected the accounting decisions, and pushed back when an entry looked off. I didn't write the code. My job was deciding what to build and checking whether it was right.
Fifteen releases later it does more than I expected going in:
- 13 Stripe event types: charges, refunds, disputes, payouts, invoices.
- Revenue recognition. Pay a year up front and it books the cash to deferred revenue, then recognizes it across twelve months. The monthly pieces reconcile back to the original amount exactly.
- Sales tax kept as its own liability instead of mixed into revenue, and refunds draw it back down proportionally.
- Currency handling. Entries post in whatever currency your Stripe balance settled in, and it records a realized gain or loss when the rate moved between a charge and a later refund.
- Exporters for QuickBooks Online and Xero.
- A webhook server with signature verification, dedup, SQLite storage, a retry queue that dead-letters what keeps failing, and OAuth for both platforms.
- A Docker image and an npm package, both with signed build provenance.
- 589 tests.
The moment that stuck with me was about payouts. When Stripe pays you into a bank account held in a different currency than your balance, it converts the money and takes a fee, and the bookkeeping for that fee is genuinely fiddly. I expected the agent to wing it, since making things up is the failure everyone warns about. It didn't. It told me it couldn't model that case correctly without real example payloads, wrote up what it would need to see, and left the code refusing those payouts with a clear error rather than posting a number it couldn't defend. I've worked with people who would have shipped the guess.
Is the accounting right? I think it mostly is. The entries balance, the recognition schedule reconciles, there's a test behind every case, and there's a document in the repo that explains the reasoning for each entry so you can check the logic without reading the code. But me being fairly sure is not the same as an accountant confirming it, and I won't pretend it is.
That's the ask. If you keep books for a SaaS, or you know the corners of the Stripe API, or you just like finding the thing that's broken, go break it. It's npm install ledgerly and the core is small enough to read in one sitting.
What surprised me most had nothing to do with the code. Writing it was never the constraint. The constraint was knowing what to ask for, and knowing what correct looked like in a field where I'm not the expert. The agent produced every line, but I still had to decide what a refund does to a tax liability and whether revenue recognizes monthly, and some of the most useful moments were when the right answer was to admit I didn't know yet. I'd rather you check it than take my word for it.
Top comments (2)
The constraint being domain knowledge rather than code is the real observation here. I've been building a personal finance app with AI as the coding engine for the past year and the pattern holds exactly - Claude can implement anything you can describe with precision. The bottleneck is whether you can describe it.
Financial calculations have a similar nowhere-to-hide property. Either the available balance calculation is right or someone wonders where their money went. The moments where the agent pushed back or asked for more specification before committing turned out to be the most valuable signals. An agent that confidently produces a wrong number is far more dangerous than one that says it needs real examples first.
Good luck with the audit - the currency conversion edge cases sound like the right thing to stress test.
I completely agree on the pushback being the signal. The refusals actually raised my confidence over time, and the entries that scared me were the ones that looked clean and were quietly wrong, with no error to catch them.
The year on a finance app is what I'd want to hear more about. How do you keep the numbers honest over time? I could lean on Stripe's fixtures and a test per case, but a personal finance app doesn't have that same ground truth to check against.