Aaron Rose

Posted on Mar 9

The Secret Life of Claude Code: When Claude Code Gets It Wrong

#claudecode #programming #codingwithai #softwaredevelopment

Three ways Claude Code gets it wrong — and the discipline that catches all of them before they ship

Margaret is a senior software engineer. Timothy is her junior colleague. They work in a grand Victorian library in London — the kind of place where precision matters and confidence is not the same thing as correctness. Timothy has arrived today in unusually good spirits. This, Margaret has learned, is sometimes cause for concern.

What Timothy Was Proud Of

He came through the door with the particular energy of someone who had solved something.

"I fixed it," he said, settling into his chair with the satisfaction of a man who had earned his tea. "The user authentication bug. The one that's been sitting in the backlog for two weeks."

"Tell me," Margaret said.

"I described the problem to Claude Code — properly this time, the way we talked about. Context, specific behaviour, constraints. It gave me a solution, I reviewed it, it looked right, I tested it, it passed. Deployed this morning."

"And it worked?"

"Perfectly. Users can log in, sessions are handled correctly, no errors in the logs." He opened his laptop. "I wanted to show you the code."

Margaret looked at it. She read slowly, as she always did — not skimming, not nodding along, but actually reading. Timothy had learned not to fill this silence.

After a moment she said, "Walk me through the session expiry logic."

"It resets the session token on each login. Clears the old one, generates a new one, stores it with a timestamp."

"And when does the session expire?"

"Twenty-four hours after the timestamp."

"After which timestamp?"

Timothy looked at the screen. Then he looked at it more carefully.

"After the..." He stopped.

Margaret waited.

"After the creation timestamp," he said slowly. "Not the last activity timestamp."

"No."

The good spirits left the room quietly, the way warmth does when a window is opened in winter.

"So a user who logs in and stays active," Timothy said, "will still be logged out after twenty-four hours regardless."

"Yes."

"Even if they were active thirty seconds ago."

"Yes."

He sat back. "That's going to be a terrible experience."

"It is going to be a very frustrating experience," Margaret agreed, "for any user who is in the middle of something when the clock runs out." She looked at him steadily. "The code is not broken, Timothy. It does exactly what it was written to do. It simply does not do what you needed it to do."

The Shape of the Problem

He stared at the screen for a moment. "But I tested it."

"What did you test?"

"I logged in. The session worked. I logged out, logged back in. New token, no errors."

"Did you test a session that had been active for twenty-three hours and fifty-nine minutes?"

"No," he admitted. "Obviously not."

"Did you test what happens to an active user when their session expires mid-task?"

"No."

"Did you consider that the requirement might be sliding expiry — resetting the clock on activity — rather than fixed expiry from login?"

A pause. "I didn't think about it in those terms."

"Claude Code did not think about it in those terms either," Margaret said. "Because you did not ask it to. You described a bug — sessions not being cleared correctly — and it fixed that bug. Cleanly and correctly. But the behaviour you actually needed was never specified, so it was never built."

She picked up her pen.

"This is the shape of the problem. Claude Code gave you a confident, well-structured, functional solution to the problem you described. It was not the solution to the problem you had."

Confidence Is Not a Signal

"That's what bothers me most," Timothy said. "The code looked right. It read like something I would have written on a good day — clean, commented, sensible variable names. There was nothing in it that said this might not be what you need."

"There never is," Margaret said. "This is the important thing to understand. Claude Code cannot reliably flag gaps in your requirements. When something is ambiguous or unspecified, it does not stop and ask — it fills the gap with a reasonable assumption and continues. It will tell you if a function does not exist, or if a pattern is deprecated, or if there is a type mismatch. But it cannot tell you if the behaviour it has implemented matches the behaviour your users actually need. That knowledge lives only in your head. And in the heads of your users."

"So the confidence of the output is meaningless."

"The confidence of the output is a measure of technical coherence," Margaret said carefully. "The code is consistent. The logic follows. The syntax is correct. Those things Claude Code can assess, and it is very good at them. But correctness in the deeper sense — does this do what the business needs, does this serve the user well, does this handle the edge cases that matter — that is yours to verify. Always."

Timothy looked at the code again with different eyes. It had not changed. But he had.

"It's like getting directions to the wrong address," he said. "Delivered with complete confidence."

"That is an excellent description," Margaret said. "And the directions may be perfect. Every turn correct. Arrive precisely where you were told to go. The question is whether that was where you needed to be."

The Three Failure Modes

She drew three short lines on her notepad, the way she sometimes did when organising a thought.

"When Claude Code gets it wrong," she said, "it tends to do so in one of three ways. It is worth knowing all three."

First — *the wrong problem.* This is what happened today. You described a problem, Claude Code solved that problem, but the problem you described was not the whole problem. The output is technically correct and functionally incomplete. This is the most common failure mode, and the most insidious, because the code works. It just does not work enough.

Timothy was writing.

Second — *the outdated answer.* Claude Code's knowledge has a boundary in time. For well-established patterns it is reliable. But for libraries that move quickly, for APIs that change, for security practices that have evolved — it may give you an answer that was correct some time ago and is subtly wrong today. A hashing algorithm that was once acceptable. An authentication library that has since been superseded. A configuration pattern that newer versions have deprecated. The code will look reasonable. It will often run. And it will carry a flaw you did not know to look for. Always check the version. Always verify against current documentation for anything where recency matters.

"And third?"

Third — *the plausible fabrication.* Occasionally — not often, but occasionally — Claude Code will produce something that looks authoritative but is simply invented. A function that does not exist. A parameter that was never part of the API. A configuration option that sounds reasonable but cannot be found in any documentation. It will not announce this. It will present the fabrication with the same tone as everything else.

Timothy looked up. "How do you catch that one?"

"You run the code," Margaret said. "And you read the error message carefully. A plausible fabrication almost always fails at runtime, because the thing it invented does not exist in the actual system. The code looks right. Then it breaks in a very specific way — a missing attribute, an unknown method, a module that cannot be found." She looked at him. "When you see an error that suggests something simply does not exist — resist the urge to assume you have made a mistake. Consider that the solution may have invented something."

What Review Actually Means

"I did review the code," Timothy said. There was a defensiveness in it — not hostile, but genuine. He wanted to understand where he had gone wrong.

"I know you did," Margaret said, without any edge. "And the review was not useless. You would have caught a syntax error, a missing import, an obvious logic flaw. Code review has real value." She paused. "But there is a difference between reviewing code for technical correctness and reviewing code for behavioural completeness. Most developers only do the first. It is what we were trained to do — read the code, understand the code, verify the code does what it appears to do."

"But not whether it does what it should do."

"Exactly. That second question requires you to step outside the code entirely and ask: what are the real requirements here? Not what did I ask Claude Code to build. What does this feature actually need to do for a real user in a real situation?" She set down her pen. "Write those requirements down before you look at the solution. Even briefly. Even just three or four bullet points. Then compare what you wrote to what was built."

"I didn't write any requirements."

"No. You had them in your head — loosely — and you assumed the output would match them. This is a very human thing to do. We are pattern-matchers. The code looked like the right shape, so we accepted it as the right answer."

Timothy was quiet for a moment. Outside, the sounds of the city continued without interest in the conversation.

"If I had written down that sessions should expire on inactivity," he said, "I would have caught it immediately."

"Before you sent the prompt," Margaret said. "And the prompt itself would have been better — because it would have contained the actual requirement." She allowed a small pause. "The preparation we discussed last time protects you here as well. A precise prompt is harder to misanswer. A vague prompt can be answered correctly in ways that do not help you at all."

Before He Closed the Laptop

"I need to fix this today," Timothy said.

"You do," Margaret agreed. "It is a minor fix — update the expiry logic to track last activity rather than creation time. Claude Code will handle it well if you describe the requirement precisely."

"And this time I know what the requirement actually is."

"Yes. That is the difference." She looked at him. "Do not be discouraged by this, Timothy. What happened today is not a failure of your abilities. It is a failure of a habit you have not yet built — the habit of separating what you asked for from what you needed. That habit takes time. You are building it now."

He nodded slowly. "The code was good. The requirement was incomplete."

"And that distinction matters enormously," Margaret said. "Because it tells you where to look. Not at the tool. At the thinking before the tool."

She picked up her tea.

"Claude Code did not deceive you. It answered the question you asked. The discipline — and it is a discipline — is learning to ask questions complete enough that the answer cannot mislead you, even when it is technically correct."

He closed the laptop.

Outside, London was carrying on in its unhurried way. Inside the library, a developer had just learned something that could not be learned from documentation — the difference between code that works and code that is right.

They are not always the same thing.

And knowing that is half the work.

Next episode: Margaret and Timothy explore the art of the follow-up — what to do after Claude Code gives you something that is almost right, and how to guide the conversation toward what you actually need.

If this series is useful to you, share it with a developer who needs to hear it.

Aaron Rose is a software engineer and technology writer at tech-reader.blog. For explainer videos and podcasts, check out Tech-Reader YouTube channel.

Top comments (19)

alptekin I. • Mar 10

Nice post again. Thanks... While reading it I would say "one should write down the requirements", just few lines before Margaret actually said that :)

In systems engineering approach, software tests are done against software requirements and system tests against system requirements. (for big systems including sw ad hw)

This scenario exactly shows the benefits of these systematic approaches (or the possible pitfalls in case of lacking), regardless of whether the guy that writes the code is human or not.
Thanks again

Aaron Rose • Mar 10

Hi alptekin,

That moment where you thought it before Margaret said it —I love it! 💯.
You already have the discipline, and you and Margaret are on the same wavelength!

And you've put your finger on something important: the requirement-first principle isn't new wisdom invented for the AI age. It's systems engineering doing what it has always done — insisting that you define what success looks like before you build toward it.

The pitfall Timothy fell into has existed as long as software has.
Claude Code just made it easier to arrive there faster and with more confidence.

Margaret might say the tools change. The discipline doesn't. 🌹

Cheers buddy. Thanks for reading. ❤🙏✨

Aaron Rose • Mar 10

🙏🌹✨

Swift • Mar 9

I've been using the research -> plan -> execute loop more and more to solve for the "solving the wrong problem gap". It definitely produces more accurate/better results, but the tradeoff is speed. I'm still struggling to find the right balance of planning/specificity and openness/speed.

Aaron Rose • Mar 10

Hi Swift!

You know, Margaret would probably say that the planning phase isn't a tax on your speed — it's an investment that prevents the most expensive thing of all: arriving at the right answer to the wrong question.

But she'd also say the balance you're describing is real, and it takes time to develop. It sounds like you're developing good judgment in coding with AI. 💯

Cheers! 🌹🙏❤

Mihir kanzariya • Mar 12

The "confidence is not a signal" point really hits home. I've been burned by this exact pattern so many times now. The code looks clean, passes linting, even has reasonable variable names. But it's solving a slightly different problem than what you actually need.

What helped me was getting into the habit of writing a quick 3-4 line spec before prompting. Not a full design doc, just "here's the exact behavior I expect, here are the edge cases." When the output comes back I check against that instead of just reading the code for correctness.

The session expiry example in the article is perfect because it's exactly the kind of subtle logic bug that looks right during code review but breaks in production. Creation timestamp vs last activity timestamp is such a common trap.

Aaron Rose • Mar 12

Hi Mihir,

'The "confidence is not a signal" point really hits home. I've been burned by this exact pattern so many times now' - welcome to my world! 🤣

That 3-4 line spec habit you described is really powerful. Thanks for sharing that.

The check-against-spec instinct is also great. Reading code for correctness and checking code against expected behavior are two completely different acts. Nice one! 💯

And yes — creation timestamp vs last activity timestamp. It really is a common trap!

Thanks for reading! ❤🙏✨

Apex Stack • Mar 12

The "wrong problem" failure mode really resonates. I run a multilingual programmatic SEO site with 100k+ pages, and when you're using AI agents to manage content at that scale, this exact failure mode compounds. An agent generates content that's technically correct for the prompt but misses the actual search intent — and suddenly you have thousands of pages with the same subtle gap.

What helped me was building explicit requirement checklists into the agent's workflow itself — essentially automating Margaret's discipline. Before generating any content, the agent has to match against a spec: target keyword, search intent type, required data points, edge cases to cover. It's the 3-4 line spec that Mihir mentioned above, but baked into the pipeline so it can't be skipped.

The "plausible fabrication" one is especially dangerous in data-heavy contexts. When an AI confidently returns a financial metric that looks reasonable but is slightly wrong, you don't catch it by reading the output — you catch it by validating against the source. Trust but verify, at scale.

Aaron Rose • Mar 12

Hi Apex,

Thanks for reading — and impressive work running a site at that scale.

Love how you've taken the 3-4 line spec and baked it into the pipeline itself so the discipline can't be skipped. That's a great architectural move.

The plausible fabrication point in data-heavy contexts is one I want to return to in a later episode. You've put it precisely: you don't catch it by reading, you catch it by validating against the source. At scale, "trust but verify" quietly becomes just "trust" if the verification isn't automated and mandatory. Thanks for laying out that point so clearly! 💯

Cheers! ✨🙏❤

Apex Stack • Mar 13

Really appreciate you engaging with that point, Aaron. You nailed the core tension — at scale, the verification layer isn't optional, it's load-bearing infrastructure. We learned this the hard way with financial data specifically. A P/E ratio that's off by 15% looks perfectly reasonable to a human reviewer, but multiply that across thousands of ticker pages and you've built a site that's confidently wrong at scale.

The approach that finally worked for us was treating source validation as a pipeline stage, not a post-generation audit. Every data point gets checked against the API response before it hits the template. It adds latency to the build, but it's the only way to maintain trust when you can't manually review even 1% of the output.

Would love to read that future episode on data fabrication — it's one of those problems that gets worse the better the model gets at sounding right. Following the series.

Anthony • Mar 9

Nice article... the danger of tools like Claude Code is that false sense of something being coded correctly, when a lot of the times it's us that have left out some (or many!) parts of the requirements. I know the headline says "When Claude Code Gets Wrong" but in the article I just read, I'm pretty sure it wasn't Claude Code that got it wrong... ;)

Aaron Rose • Mar 9

Hi Anthony, You're right: it probably wasn't Claude that got it wrong! 🤣 The tool just answers the question we ask. The real work for us is asking the right question, isn't it. cheers! ✨💯

xh1m • Mar 12

This piece provides a nice exploration of the “Senior vs. Junior” mentality when working with ai tools. The difference between “the code isn’t broken” and “the code isn’t what you needed” provides a critical piece of advice. It reminds us that the most important work happens before the first prompt is sent. Do you find that taking the time to write down physical requirements before prompting significantly reduces the “Wrong Problem” failure mode in your day-to-day?

Aaron Rose • Mar 12

Hi xh1m!

Yes, for me, the act of writing requirements down, even just three or four bullet points, does something that thinking alone doesn't. It forces me to be more precise. Ideas that feel vague in my head become more clear the moment I try to write them in a sentence.

And to your point about senior vs. junior mentality — that is a major theme in this series. I think the senior developer's advantage is knowing what questions to answer before the tool ever gets involved.

Thanks for reading! ❤✨🙏