A few weeks ago I published a piece called "AI Agents Are Great at 80% of Our Code. The Other 20% Is Why We Still Need Seniors."
It got 25 reactions and 34 comments. Several of those comments asked the same question in different words:
"How do you actually measure that 20% when you're hiring?"
Fair question. I dodged it in the first article because I didn't have a clean answer yet. Now I do. Or at least, we have a process that's working better than anything we tried before.
This is that answer.
The old interview was broken before AI made it obvious
For years, we ran the standard playbook. Whiteboard problem. Timed coding exercise. "Build a REST endpoint in 45 minutes." You know the drill.
Here's what that interview actually tested: can this person write syntactically correct code under pressure, from memory, with someone watching?
That's a real skill. It's just not the skill that matters anymore.
AI handles the code-writing part. I don't mean it handles it perfectly — I wrote a whole article about the 20% it gets wrong. But the 80% that is boilerplate, CRUD, API wrappers, standard patterns? An agent will generate that in seconds. Clean. Typed. Probably with better variable names than I'd pick.
So if I'm hiring someone and my interview tests whether they can do what an agent already does faster — what exactly am I learning?
That they can type under pressure. Great. So can the agent. And it doesn't get nervous.
The interview we actually run now
We stopped asking candidates to write code from scratch. Instead, we hand them code an AI agent already wrote.
The code looks fine. It passes the tests we included. The variable names are clean. The types are correct. A junior looking at it would say "ship it."
But it's wrong.
Not wrong in a way that crashes. Wrong in a way that costs money three weeks later. Wrong in a way that only someone who thinks about consequences would catch.
Here's the shape of it. We give candidates a webhook handler for processing payment confirmations. The handler works. It receives the event, updates the database, returns a 200. Clean code.
What's missing: idempotency. If the bank retries the webhook — and banks always retry — the handler processes the payment twice. The customer gets charged twice. We get an FCA complaint. The code is correct. The system is broken.
Or we show them a payment flow with state transitions. pending to authorised to settled. Looks right. But there's a path where a payment can go from settled back to pending. That's an illegal state transition. In our domain, that means money that was already in a merchant's account could theoretically get pulled back without a refund record. No test catches it because no test was written for a transition that shouldn't exist.
We ask candidates to review this code. Not write it. Review it.
The ones who have the 20% find these things. Not always immediately. Sometimes they stare at it for five minutes and then say "wait — what happens if this gets called twice?" That moment is worth more than any algorithm they could whiteboard.
It's not what they'd change. It's why.
We added a second part to the interview. Once a candidate identifies issues in the AI-generated code, we ask them to walk us through a PR rejection.
Not "what would you change." We already know what needs to change. We want to hear why they'd reject it.
This is where you separate pattern-matchers from engineers.
A pattern-matcher says: "There's no idempotency key. You should add one." Correct. Also surface-level. They've seen the pattern before and recognized its absence. That's good, but it's not enough.
An engineer says: "There's no idempotency key, which means a network retry from the bank will double-process the payment. The customer sees two debits. Your support team gets a ticket. You file a dispute with the acquiring bank. The refund takes 5-7 business days. And if this happens at volume, you've got a regulatory reporting obligation."
Same observation. Completely different depth. The first person knows the pattern. The second person knows what happens downstream when the pattern is missing.
That downstream awareness — the ability to trace a bug forward through the business — is the 20%.
Hire for intent, not resumes
Our interview process changed because our hiring philosophy changed first.
We don't hire resumes. We hire intent. Let me give you three examples.
One developer's resume listed one skill under "Technical Proficiency": Googling. I'm not paraphrasing. That's what it said. B.Sc. No fancy internships. No side projects on GitHub. Just someone who was honest about what they knew and relentless about learning what they didn't. Today they own our merchant-facing app. The whole thing.
Another cold-messaged us asking for a job. No referral. No warm intro. Just a direct message. In the interview, they were quiet. Not shy — quiet. Listened more than they talked. When they did talk, they went straight to the solution. No preamble, no hedging, no "well it depends." Just: here's the problem, here's how I'd fix it, here's what could go wrong.
A third started as an intern. They're now building our Open Banking integration end-to-end. Not assisting. Not maintaining. Building.
The common thread isn't a degree or a tech stack or years of experience. It's three things: curiosity, ownership, and willingness to be wrong.
The first didn't pretend they knew things they didn't. The second didn't try to impress with volume — they impressed with clarity. The third didn't wait for someone to assign harder problems — they grew into them because the problems were there and they weren't afraid to try.
None of them would have passed the old coding interview particularly well. All of them are exactly the kind of engineer you want reviewing an AI agent's output.
The 20% isn't just code — it's design thinking
Here's something most "AI and hiring" articles miss: the 20% that matters isn't only about catching bugs in payment logic. It's about knowing how to think about problems that don't have a spec yet.
Before Atoa, I spent years in design thinking — working with clients who were showcasing products at CES. One project sticks with me. A world-leading chocolate manufacturer wanted to launch a series of chocolates based on human emotions — Anger, Disgust, Sad, Happy, Wimpy. The brief: build software that captures a person's emotion in real-time, recommends the matching chocolate, and makes it go viral on social media.
Now imagine the marketing manager walks into your office and drops this on your desk. The brief is: make it viral. Which platform do you build on? Where does the experience live? What's the feature that makes someone want to share it?
Assume technically anything is possible. The technology isn't the constraint. Your thinking is.
Here's the filter: if your first answer is "I'll build a mobile app and a web app" — that's a straight reject. Not because mobile and web are wrong technologies. But because you jumped to how before you thought about why. You're solving for delivery before you've solved for virality. You're thinking like a developer when the brief asked you to think like a designer.
The interesting answers start with questions. Who's the audience? Where do they already spend time? What makes someone stop scrolling and share something? What's the 3-second hook? How does the chocolate brand benefit from every share? What's the mechanic that makes this grow without paid media?
Now here's my challenge to you: how would you approach this? Drop it in the comments. Not the tech stack — the thinking. How do you decompose this brief into something that actually goes viral?
There's no single right answer. That's the point. This is a design thinking exercise — the kind of problem where the 20% lives. The brief is intentionally vague. The constraints are real. And the interesting part isn't the technology you pick. It's how you think about a problem before you write a single line of code.
No AI agent is writing a spec for that. No benchmark captures the ability to look at a brief like "emotion-based chocolate recommendation engine for CES" and turn it into a system. That's design thinking. The ability to hold a vague, human problem in your head and translate it into technical architecture — while keeping the user experience front and centre.
I look for this in interviews too. Not the ability to solve a well-defined problem. The ability to define the problem in the first place. When I ask a candidate "how would you approach this?" and they immediately start writing code — that tells me something. When they first ask "who's using this, where, and what does success look like?" — that tells me something very different.
The 20% is judgment about code. But it's also judgment about products, users, and what should exist in the world. AI can generate solutions. It can't ask the right question.
What the 20% actually looks like in an interview
Here's what I'm watching for when a candidate reviews code. It's not a checklist — it's a set of signals.
Do they think about what shouldn't happen?
Most engineers think about the happy path. The payment goes through. The webhook fires. The database updates. Done.
The 20% engineers think about the unhappy path first. What happens when the webhook fires twice? What happens when the database write succeeds but the response times out? What happens when the bank says "yes" and our system says "no" and now the money exists in a state neither side agrees on?
If a candidate's first instinct is "how does this work?" — that's fine. If their first instinct is "how does this break?" — that's the signal.
Do they ask about failure modes before writing anything?
We've started noticing this in walkthroughs. Some candidates immediately start typing fixes. Others ask questions first. "What's the retry policy on these webhooks?" "Is there a dead letter queue?" "What happens to in-flight payments if this service goes down?"
The ones who ask first are almost always better engineers. Not because asking is inherently better than doing. But because in the 20% territory — the code that handles edge cases, race conditions, regulatory requirements — the cost of building the wrong thing is higher than the cost of asking one more question.
Can they explain a tradeoff they made, not just what they chose?
This is the question I ask every candidate, regardless of seniority: "Tell me about a technical decision where you chose the worse option on purpose."
The interesting candidates have an answer. "We chose synchronous calls between two services because the audit trail was easier to reason about, even though async would have been more resilient." "We kept a manual process instead of automating it because the edge cases weren't well understood yet and we didn't want to automate the wrong thing."
The 20% is full of decisions like this. The right answer isn't always the technically superior one. Sometimes the right answer is the one that's easier to debug at 2am, or the one that produces a cleaner audit trail, or the one that a new engineer can understand without reading three pages of context.
The junior training pipeline problem
Here's the question that kept me up after the first article: if AI handles 80% of the code, how do juniors ever build the judgment that makes seniors valuable?
The 80% used to be the training ground. You learn to write CRUD endpoints. You learn to wire up a database. You learn to handle HTTP errors. You make mistakes in the boring code, you get them caught in review, and slowly you develop an instinct for the less boring code.
If an agent writes all of that for you on day one, what are you actually learning?
This is a real problem. And "just let them use AI" isn't the answer, because using AI well requires the judgment you're supposed to be building.
I'll be honest — I've had to let someone go because of this exact gap. They were using AI for everything. But they were using the default model in Cursor while the rest of the team had moved to Opus for anything that touched critical code. They weren't thinking about which tool to use when. They were just pressing tab and shipping. The code looked fine. The judgment wasn't there. And in a payment system, that's not a skill gap you can coach around — it's a risk.
At Atoa, we pair juniors with seniors on the hard problems. Not the 80% problems. The 20% ones. The payment state machine that handles twelve edge cases. The webhook handler that has to be idempotent across retries, timeouts, and partial failures. The reconciliation logic where our system says one thing and the bank says another.
The senior doesn't watch the output. They watch the process. They're looking for two things.
First: "What did you skip?" Not what did you get wrong — what did you not even consider? That gap is where the learning lives. A junior who writes a webhook handler and doesn't think about idempotency hasn't made a mistake. They have a blind spot. Mistakes you can catch in tests. Blind spots you can only catch by asking the right question at the right time. That's what the senior is there for.
Second: "What happens when this fails?" Not "did you handle the error." Did you think about what the system does when this component fails? Does the rest of the pipeline stall? Does the customer see a broken state? Does the merchant lose money? The junior doesn't need to have the answer. They need to have the habit of asking the question.
The painful lessons still happen. They just happen faster because the senior is there to compress the feedback loop from "you'll figure this out in three years" to "let me show you why this matters right now."
The best hire isn't the best coder anymore
Three years ago I'd have hired the candidate who wrote the cleanest code the fastest. That person is still good. They're just not rare anymore. An AI agent writes clean code fast. That's table stakes.
The hire I'm looking for now is the person who reads an AI agent's clean, well-typed, properly structured code — and says "this will break in production, and here's exactly how."
That person can tell an agent what it got wrong. More importantly, they can explain why it matters. Not just "add an idempotency key" but "add an idempotency key because the bank will retry, and without it, this elegant code will charge a customer twice."
The 20% was never about writing harder code. It's about knowing which code is dangerous.
We changed our interview because the job changed. The job isn't writing code anymore. The job is judgment.
And judgment is the one thing you can't generate with a prompt.
This is a sequel to AI Agents Are Great at 80% of Our Code. The Other 20% Is Why We Still Need Seniors. If you're building a team that works with AI agents, I'd love to hear how your hiring process has changed. Drop a comment or find me on X @mickyarun.
Top comments (1)
Strong follow-up to the 80/20 piece. The thing I'd push back on: the new interview still measures something, and the failure mode of "we removed the wrong filter" is sneakier than the failure mode of the old whiteboarding.
Two concrete things we'd add to the "20% interview" you described:
1) Make the 20% exercise multi-session on purpose. Hand the candidate a repo + a frozen bug ticket; ask them to plan, walk away, come back tomorrow, and resume. The signal isn't the fix — it's whether they can recover their own context the next morning. AI handles the typing, but it does not (yet) handle the "I remember why I rejected this approach at 11pm" recovery. A senior's value shows up across sessions, not within one.
2) Pair-review a real PR from the team's recent history. Not "would you have caught this bug" — that rewards cynicism. Better: "explain to the next agent, in writing, what you would do differently." That isolates the senior's judgment from their typing speed, and the writing is a thing the AI can then act on. It's a real artifact that lives in the repo.
The danger of "stopped asking candidates to code from scratch" is that you accidentally re-introduce the whiteboarding tax in a different costume (live architecture design with someone watching). The cleanest way to avoid it is to make the artifact — not the live performance — the deliverable.