arun rajkumar

Posted on May 28

AI Agents Are Great at 80% of Our Code. The Other 20% Is Why We Still Need Seniors.

#ai #startup #webdev #discuss

Fintech state logic and engineering judgment

We let AI agents loose on a payment platform. They crushed the boring stuff. Then they silently broke the stuff that matters.

A survey came out last week. 54% of all code is now AI-generated. Up from 28% last year.

I read that number and thought: yeah, that tracks. We're probably in that range too.

But here's the thing nobody's asking — which 54%?

Not all code carries equal weight. A CRUD endpoint for fetching merchant details? Low risk. The webhook handler that transitions a payment from pending to complete? That's someone's rent. Someone's payroll. Get that wrong and money moves where it shouldn't, or worse, money doesn't move at all.

I'm the CTO of a payment platform. FCA-authorised, processing real money, real merchants, real consequences. We run NestJS microservices, Docker, Traefik — the usual stack. And we've been using AI agents aggressively for over a year now.

I'm not here to tell you AI is dangerous. It's not.

I'm here to tell you it's dangerous when you forget what it's actually good at.

The 80% Where AI Agents Are Genuinely Brilliant

Let me give credit where it's due. AI agents have made our team faster in ways that would have seemed absurd two years ago.

API scaffolding. Generating service boilerplate. Writing Zod validation schemas. Spinning up new endpoints. Creating test stubs. Refactoring imports. Migrating patterns across repos.

We run multiple microservices. When we need a new service, an agent can scaffold the entire thing — module structure, base configuration, Docker setup, Traefik labels — in minutes. What used to be a half-day of copy-paste-and-tweak is now a conversation.

When we overhauled our env management across all repos, AI agents did the grunt work. They mapped every .env file, found naming conflicts, identified common variables, and generated a unified Zod schema. What would have taken a team days of grep-and-spreadsheet work took hours.

For this 80% of the codebase — the predictable, pattern-following, structurally repetitive code — AI agents are the best junior developers money can buy. Tireless. Cheap. No ego. Almost never make a mistake on the stuff they're good at.

An army of juniors sitting at your terminal.

Then You Hit the Other 20%

Here's where it gets interesting.

We had an agent build out a webhook handler. Webhooks in payments are critical — they're how you know a payment succeeded, failed, or needs attention. The agent wrote the handler. It looked clean. Tests passed.

But it silently ignored the edge cases.

Status transitions have rules. A payment can go from pending to complete. It cannot go from complete back to pending. When a human developer builds this, they think about the illegal transitions because they've seen what happens when money moves backwards. They build the guard because they've felt the pain of not having it.

The agent didn't care about that. It built the happy path beautifully and treated the edge cases like they didn't exist.

When we do this work manually, this type of error never happens. A senior developer who has worked in payments for years doesn't forget the impossible transitions. It's not in their code — it's in their bones.

The Pattern I Keep Seeing

This isn't a one-off. After months of working with AI agents on a regulated payment stack, one pattern is consistent:

AI agents optimise for completion, not correctness.

They want to finish the feature. Get to the green checkmark. And to get there efficiently, they take shortcuts that look reasonable on the surface.

The agent builds what should happen. It rarely builds what should not happen. In payments, the negative cases are where all the real risk lives. What happens when a webhook arrives twice? What happens when a refund is requested on an already-refunded transaction? What happens when the bank returns an unexpected status code? The agent doesn't think about any of that unless you explicitly tell it to.

Then there's the reusability problem. We have shared utility packages. Helper functions. Common patterns that the team has standardised on over years. The agent doesn't care. It writes its own version from scratch. It works, but now you have two implementations of the same logic — one tested and trusted in production, one freshly generated and untested. The agent is focused on completing this feature, not maintaining the architecture.

And the subtlest one — agents seem to optimise for fewer back-and-forth turns. It looks like they're saving cost, saving context. Complex validation? Skip it, the basic case works. Error handling for a rare edge case? Not worth the tokens. The result is code that passes every test you wrote but fails on the scenarios you didn't think to test — because those are exactly the scenarios the agent also didn't think about.

Juniors Don't Ship Products. They Write Code.

Here's the frame that made this click for me.

Claude — or any coding agent — is the best junior developer money can buy. An army of juniors. Tireless, cheap, no ego, near-zero error rate on routine work.

But juniors don't ship products. They write code.

The difference between code and a product is judgment. Knowing which transitions are illegal. Knowing that the retry logic has a specific backoff curve because you've been burned by what happens when it doesn't. Knowing that the webhook handler needs idempotency because banks sometimes send the same notification three times.

That knowledge doesn't come from training data. It comes from years of operating a system, debugging at 2am, explaining to a merchant why their settlement was delayed.

The most dangerous mistake a CTO can make in 2026 is buying AI to replace senior engineers. The right move is buying AI to enable them.

Replace your senior with AI? You get speed plus silent disasters.

Enable your senior with AI? You get an architect with an army.

What We Actually Do About It

I'm not writing this to complain about AI. I'm writing this because we've built a system that works, and it might help you too.

The first thing we did was make our architecture machine-readable. We extract design patterns and architecture rules into formats that agents can consume. When an agent works on our codebase, it doesn't just see code — it sees boundaries, patterns, rules about what belongs where. Not documentation nobody reads. Lints and constraints that the agent can't ignore.

Then we invested heavily in testing the negative cases. Every PR — human or AI — runs through the same suite. But we specifically built tests for the stuff agents skip: illegal state transitions, duplicate webhook handling, idempotency checks. If the agent silently drops a negative case, the tests catch it before it ships.

And seniors still review everything that touches money. No AI-generated payment logic ships without a senior looking at it. Not because we don't trust AI — because we know exactly where it's blind. The review isn't checking syntax. It's checking judgment. Did the agent handle the ambiguous bank status? Did it respect our existing retry logic? Did it use the shared utility or reinvent the wheel?

This problem bothered me enough that I started building Bodhi Orchard — an open-source agentic development framework. The core idea: don't just let agents write code. Feed them the full context — architecture, design patterns, test plans, existing utilities — so they stop making the same blind-spot mistakes. Human decisions over human busywork, with guardrails that actually enforce quality.

The Real Question for 2026

The survey says 54% of code is AI-generated. I believe it.

But here's my question: what percentage of bugs in 2026 will be AI-generated?

And more importantly — who's going to find them?

Not the agents. They wrote the bugs in the first place. Not the juniors — they won't know enough to spot what's missing.

It's going to be the seniors. The architects. The people who've operated these systems long enough to know where the bodies are buried.

The 80% is solved. AI won. Celebrate that.

Now invest in the humans who understand the other 20%. Because that's where your product lives or dies.

I'm Arun, CTO & Co-Founder of Atoa — a UK open banking payment platform. I write about what it's actually like to build fintech with AI, not what the conference slides say it's like. If this resonated, follow me here or on X @mickyarun.

And if you're curious about building AI-native development with proper guardrails, check out Bodhi Orchard.

Top comments (61)

Mykola Kondratiuk • May 29

the right split isn't complexity - it's blast radius. AI fails on the paths where wrong code has externally visible consequences. your webhook handler nails it: same to write, completely different stakes if broken.

arun rajkumar • Jun 3

Blast radius is the better framing, you're right. We've actually started using exactly that language internally when routing work — not "is this complex?" but "what breaks if this is wrong?" A CRUD endpoint and a webhook handler are the same complexity to write. The difference is that one quietly corrupts payment state and the other doesn't. That asymmetry is what makes the 80/20 split so deceptive.

Mykola Kondratiuk • Jun 3

the CRUD-vs-webhook example is exactly it — same complexity, different blast radius. once you start routing by what breaks externally, you also notice that AI failures cluster on those external-consequence paths, not the complex internal ones. that asymmetry is worth building into your review criteria explicitly.

arun rajkumar • Jun 6

Yes — and the clustering is the useful bit: AI failures don't spread evenly, they pile up on the external-consequence paths, exactly where you can least afford them. That's the argument for routing review by blast radius instead of diff size. Anything that touches money gets a senior's eyes regardless of how "small" the change looks. Good addition.

Mykola Kondratiuk • Jun 6

the clustering pattern is what finally convinced me to retire the 'review every AI change' rule - if failures aren't random, blanket review is the wrong tool. route to where the risk actually pools.

Andrii Krugliak • May 29

"Which 54%?" is the question the headline number always hides. A CRUD endpoint and a payment-state webhook are not the same risk, but the stat treats them as one. The 20% that needs a senior is exactly the part where a confident wrong answer moves money the wrong way.

arun rajkumar • Jun 3

Exactly. The headline number is seductive but meaningless without weighting by consequence. We could probably get to 90% AI-generated if we counted by lines. But the 10% that handles payment state transitions, retry logic, and settlement timing is worth more than the other 90% combined. The stat treats a login form and a refund handler as equal. They're not.

Andrii Krugliak • Jun 4

Weighting by consequence is the only honest way to read that number. Lines of code makes a settlement webhook look the same as a tooltip, and that webhook is the part you can't hand off. I'd rather see it reported as percent of risk automated than percent of code.

arun rajkumar • Jun 6

"Percent of risk automated" instead of "percent of code" — I'm stealing that. A settlement webhook and a tooltip are one line each and worlds apart in blast radius, and every "54% of code is now AI" headline flattens exactly that distinction. The number that would actually mean something is how much of the risky surface you've automated and still sleep at night. Spot on.

Andrii Krugliak • Jun 9

Risk-weighted is the only honest read. A settlement webhook and a tooltip are one line each on the diff and worlds apart at 2am when one of them is down. The number I actually trust is how much of the scary surface you handed off and can still sleep through.

arun rajkumar • Jun 10

"Surface you handed off and can still sleep through" — that's the metric. We talk about it as blast radius, not line count: a diff that can't move money or leak data can ship on a junior's say-so; a diff that touches settlement gets a senior even if it's three characters. The honest org chart isn't seniority by years, it's who's allowed near the scary surface. The trap is teams that measure AI adoption by % of code merged and never look at which 20% it was.

Andrii Krugliak • Jun 10

Blast radius over line count is exactly right. We ended up baking it in: anything that can move money or touch user data goes to a stricter agent tier even when the diff is three lines. Percent-merged is a vanity number that hides which 20% actually shipped.

arun rajkumar • Jun 15

Baking it into the agent tier is exactly the right move — and the fact that the diff can be three lines makes it more important, not less. Small diffs in the wrong place are the ones that slip through review because they look harmless. We do something similar: the routing decision isn't based on file size or complexity, it's based on whether the change touches a state transition or a financial record. Those paths have their own review gate regardless of how many characters changed.

Andrii Krugliak • Jun 17

The state-transition trigger is sharper than mine. I gate on "money or user data," which is really a proxy for the same thing, but yours catches the quiet state bug that touches neither and still breaks everything downstream. Stealing that.

arun rajkumar • Jun 2

A lot of you asked the same question in the comments: how do you actually measure that 20% when you're hiring?

I wrote the sequel. It covers how we flipped our interview, why we stopped asking candidates to write code from scratch, and a design thinking challenge I'd love your take on.

arun rajkumar

Jun 2

How We Hire for the 20% AI Can't Do (And Why We Stopped Asking Candidates to Code From Scratch)

#career #ai #hiring #webdev

10 min read

Varsha Ojha • May 28

That 20% is where the real engineering judgment sits. AI can generate a lot of code, but seniors are still needed for tradeoffs, architecture, edge cases, security, and knowing when the “working” solution will become a future problem.

arun rajkumar • May 29

Spot on. The part that catches most teams off guard is your last point — knowing when a working solution becomes a future problem. AI agents will happily generate a solution that passes every test today but creates a coupling that makes the next feature impossible. That's the judgment call that still needs a human with context.

Varsha Ojha • May 29

Exactly. Technical debt rarely looks like debt when it's created. Most of the time it looks like a fast win, which is why experience matters. Someone has to think about the second and third order effects, not just whether the code works today.

Adam Lewis • May 28

The line about illegal transitions sitting in the senior's bones is the one I keep coming back to. What's worked for us is treating those exact rules as the highest-value tests - the failing case that proves the impossible transition still throws, the contract test that catches the duplicate webhook. The senior still reviews, but the same blind spot doesn't slip past twice. The catch is that negative cases catch nothing day-to-day, so you only find out the agent skipped them when something goes wrong in prod, which on a payments stack is too late.

arun rajkumar • May 29

This is exactly the approach we've landed on too. We call them "scar tests" internally — every time a senior catches something an agent missed, that specific scenario becomes a permanent test. The agent still does the bulk work, but the test suite encodes the team's institutional memory. Over time, the blind spots shrink. Not because the agent gets smarter, but because the guardrails get sharper.

Adam Lewis • May 29

"Scar tests" - I might steal that :)

The human would still check and find issues, but the agent would catch the regression the next time around. Over time you'd end up with a test suite that's basically a record of every mistake the team has ever had to fix, which is one of the best things you can hand a new agent or a new joiner.

prickles.org/tenet/living-document...

Scarab Systems • May 30

“Scar tests” is a great phrase, but I wonder if the unit should be a little broader than tests.

Every scar probably needs to become part of the repo’s memory, but not every scar should become another test. Some mistakes should become tests, yes. Others are better captured as boundary rules, diagnostic checks, ownership constraints, repair patterns, or notes about what the agent must not normalize as baseline.

Otherwise the test suite itself can become a drift surface: every past mistake gets encoded as another assertion, the agent starts optimizing around the tests, and the repo slowly accumulates verification bloat.

The deeper idea, to me, is that scars should become governed signals. The repo should remember what hurt it before, but it should choose the right enforcement surface instead of turning every wound into another test.

Adam Lewis • May 31

Fair point. A test is the easiest thing to add so it ends up doing too much of the work. A lint rule for the kind of thing the agent keeps proposing does the same job without making the suite bigger. The bit where you catch it is the same either way, someone spots it and the team agrees it shouldn't happen again, but the fix doesn't have to be a test.

prickles.org/tenet/linter-as-law/TA1

xulingfeng • May 30

The 80/20 split is real — and the hard part isn't the 20%, it's knowing which 20% you're in before you ship. We've started routing every AI-generated diff through a cheap local model review gate that flags "suspicious confidence" (clean code that subtly breaks edge cases). Caught 3 leaks and 2 race conditions last sprint alone. Do you run any automated review on the AI-generated parts or just eyeball them?

arun rajkumar • Jun 2

We do both. Automated: every PR runs through our standard test suite plus what we call "scar tests" — specific edge cases we've caught before. But we also have architecture lints that check whether the agent used existing shared utilities or reinvented them, and schema validation that catches impossible state transitions at compile time. Manual: any code that touches money movement gets a senior review, non-negotiable. The automated layer catches about 80% of agent mistakes. The senior review catches the 20% that requires judgment about intent, not just correctness.

xulingfeng • Jun 2

scar tests + architecture lints is a solid combo — especially catching when the agent reinvents existing shared utilities instead of reusing them. We tried something similar internally and it worked well. And the non-negotiable senior review for money-touching code is something we've been sticking to as well.

Valentin Monteiro • May 30

The 20% is defined by consequence, not difficulty, which is exactly why it doesn't shrink as the models get better. You're FCA-authorised, so you live this: the risky code isn't the hard code, it's the code nobody can explain. AI output that works but that no one can defend to an auditor is still a liability, correct or not. So the senior's real job there isn't writing that 20%, it's being able to stand behind it when someone asks why it made the call it did.

arun rajkumar • Jun 3

This is the FCA angle that doesn't get enough airtime. "The risky code isn't the hard code, it's the code nobody can explain" — that's exactly it. We've had auditors ask why a specific retry backoff was chosen, and the answer can't be "the AI picked it." Someone has to own the reasoning. AI-generated code that works but has no defensible rationale is a compliance risk in regulated fintech, full stop. The senior's real value isn't writing that 20% — it's being the person who can explain it under questioning.

Valentin Monteiro • Jun 4

"The AI picked it" as the answer to an auditor. That image should scare every team shipping AI-generated code in regulated environments. The senior's value isn't the code. It's the defensible rationale attached to it.

arun rajkumar • Jun 6

You said it better than my whole article did — the senior's value is the defensible rationale, not the code. An auditor won't accept "the AI picked it," and neither should a CTO. The code is cheap now; the why behind it is the thing you're actually paying a senior for. Thanks for reading.

Valentin Monteiro • Jun 6

Your article framed the problem, I just sharpened one edge. Most people still think the gap is complexity. It's not. It's who signs off on this when a regulator asks why.

BlackwellJohnL • Jun 2

It's a popular moot meme .... In answer to your Title I would say "This year yeh. But next year? I'm not so sure."
Instead what would be more accurate would be to say that "We will always need Experienced." That is just about the only future for everyone herein.
Architecting and Directing and Problem Solving.
Anything else is just circle jerk.

arun rajkumar • Jun 2

Fair pushback. I'd reframe it slightly: the title of "senior" might become less meaningful. What stays permanently valuable is the judgment that comes from operating a system under real constraints — regulatory, financial, human. Even if models get dramatically better at code generation, someone still needs to decide what to build, what not to build, and what the system should refuse to do. In payments, that's not a coding problem. It's a domain judgment problem. And I'd bet that's still human territory in 2030.

Harjot Singh • May 30

The 80/20 split is the most useful frame for this whole debate. Agents crush the well-trodden 80% (CRUD, boilerplate, glue, standard patterns) because that's where training data is dense, and they faceplant on the 20% that needs system-level judgment, novel tradeoffs, and knowing what NOT to build.

The practical consequence people miss: you should route by that split. The 80% genuinely doesn't need your most expensive model or your most senior human - cheap model, light review. The 20% is where you spend both the premium model AND the senior's attention. Treating all code as equally hard is what makes AI coding feel either too expensive or too risky depending on which half you're looking at. Really well-argued piece - the "why we still need seniors" conclusion is the honest one.

arun rajkumar • Jun 2

This is exactly our approach. We use Sonnet for the routine 80% — API scaffolding, test stubs, boilerplate — and escalate to Opus with full architectural context for anything that touches payment logic. The cost difference is significant but the reliability difference is bigger. The routing isn't just about model choice though — it's about context. The 20% needs structured context files that describe service boundaries, shared schemas, and constraint rules. Without that, even the best model drifts.

Daniel Stolf • May 29

The "scar tests" frame from your reply to @adam_lewis_427616cbc93f0b is the right destination. The underrated part is when in the lifecycle you get there.

Most teams pick up the negative cases reactively: the agent ships the happy path, the senior reviews, finds the missing impossible-transition guard, the test gets added after the fix. That works, but it puts the senior in the role of "the thing that catches what the agent skipped". That doesn't scale, and burns the most expensive person on the team on deterministic checks.

The shift that's worked for us: list the negative cases before any code exists. A spec for "webhook handler" doesn't reach implementation until someone has answered, in writing, what transitions are illegal, what's the behaviour on duplicate delivery, what happens when the bank returns an unknown status.

Each answer becomes a failing test before the agent is prompted. Then the agent has to satisfy them, and the senior reviews the spec (ten minutes, scan-level) instead of hunting omissions in the diff.

The 20% doesn't disappear. It just stops being something a senior discovers after the fact and becomes something the team commits to before the keystrokes happen. Same judgment, applied earlier, where it's cheaper to enforce and harder for the agent to route around.

Scar tests still matter, they're the upgrade path. The first time a negative case bites in prod, it goes into the spec template for that class of feature, and the next webhook handler is born with the guard already required. The institutional memory compounds at the spec layer, not just the test suite.

arun rajkumar • Jun 2

This is one of the sharpest comments on this thread. The insight about when in the lifecycle you capture the negative cases changes everything. We've been moving towards exactly this — writing the impossible transitions, the idempotency requirements, and the failure modes into a spec before the agent gets prompted. The spec becomes the acceptance criteria. The agent has to satisfy it. The senior reviews ten lines of spec instead of hunting through hundreds of lines of diff. You're right that this doesn't make the 20% disappear — it makes it cheaper to enforce.

Adam Lewis • May 31

Daniel, the ordering is right. I really like the idea of a spec template per type of feature. Writing the acceptance criteria up front means the agent has something to check itself against, and a senior can review the spec instead of looking for what's missing in the diff. The other thing is that the spec ends up being what the agent has to satisfy. If it's in the repo the agent reads it each time, and the same thing stops getting missed.

prickles.org/tenet/spec-first-exec...

View full discussion (61 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.