BaoDev Studio

Posted on May 21 • Originally published at baodev.studio

an honest list of what AI agents cant do in 2026

#agents #ai #claude #softwaredevelopment

i run 35 specialized claude code agents across my projects. most of whats written about AI agents in 2026 is either marketing (look how much they can do) or doom (look how much theyll replace). both miss the practical layer: where do these agents consistently fail, even with the best prompts, the best context, the best tools?

this is that list. drawn from running these agents across 3 production codebases for the last 6 months. specific failures, not abstract concerns.

judgment under partial information

biggest single category. AI agents fail when the right action requires waiting, choosing not to act, or saying "i need more info."

client message: "can you make the dashboard faster?" agent reads the request, looks at the dashboard code, identifies three optimization opportunities, starts implementing. senior reads the same message, asks: "faster for whom? on what data volume? slow on initial load or on filter operations? whats the SLA?"

the agents confident execution costs hours of work that might solve the wrong problem. the seniors pause costs 5 minutes of clarification. the pause is the right move 80% of the time.

ive tried building this into agent prompts ("ask 3 clarifying questions before starting"). it works sometimes. but agents ask FORMULAIC questions, not THE question that disambiguates. knowing which question to ask is itself judgment under partial information.

this manifests as:

deciding when a feature is done vs needs another iteration
picking which 1-2 of 5 AI-generated draft replies are worth sending
STOPPING the addition of flags/options to a configurable system
subtractive thinking — "remove this rather than build around it"

concrete failure case: building a multi-tenant data isolation layer for a saas project last quarter. agent kept adding configuration flags for edge cases ("what if a tenant wants flag A but not B?"). by flag #7 the system was unmaintainable. i deleted 5 flags and replaced them with a default-secure single mode. config space went from 128 combinations to 4. senior judgment was "stop adding, start removing."

common thread: right move is restraint. agents are calibrated for action.

reading the codebase context thats not in the prompt

agents are good at search. agents are bad at synthesis from large context.

concrete: asked an agent to refactor a slow 40-line function. the rewrite was technically correct. but the original contained try/catch with comment // don't remove — handles malformed JSON from legacy webhook v1. the rewrite "cleaned up" that try/catch.

agent saw 40 lines. actual scope was the whole webhook chain, the legacy contract, the production data that occasionally hits the malformed path. none of that was in the prompt.

deployed the rewrite to staging. crashed within 6 hours when the daily webhook v1 batch fired. rolled back, restored the original try/catch, added a regression test that explicitly fires malformed JSON. lesson cost ~3 hours and a degraded staging window.

this isnt fixed by more context tokens. the context that matters is implicit — "this comment was load-bearing", "this duplication was intentional", "this naming convention was chosen for a reason". agent reads the lines but doesnt have the memory of why theyre there.

related: agents over-abstract. asked one to extract a pattern shared by 3 functions. it produced a beautiful generic helper that the 4th similar function — written 2 weeks later — could never quite fit. the 3 specific implementations were better than the 1 generic abstraction. agent has no "predict the 4th case" capability.

reading PEOPLE

this one i underestimated. agents are bad at reading tone in human messages.

specifics:

client going silent for 3 days is a strong signal — possibly losing interest, possibly stuck on internal decision, possibly got a competing quote. agents read silence as "no update yet" and continue per plan.
"can we add X?" (genuine question) vs "can we add X?" (testing whether u'll pushback on scope creep) is invisible to agents. senior knows from timing, prior conversations, how it was phrased.
tone for difficult convos — scope-creep pushback, missed-deadline notes, refund discussions — agent versions are either too soft (gets walked over) or too corporate (loses earned trust).

specific exchange from last month. a client asked for a "small change" 6 weeks into a project. agent drafted a polite, structured reply explaining the change-request process. i sent something different: "sure, let me think about whether this needs a CR or if it fits the current scope — give me 24 hours."

agent reply was formally correct. actual right reply was warmer + bought thinking time. the relationship needed the warmth more than it needed the formality.

i now never let agents send client comms without human review. tone-reading is unreliable enough that the risk isnt worth it.

eval / judgment about correctness

this is where i most expected agents to excel + where theyre most disappointing.

building LLM-based products requires evals. what does "good" mean? what threshold do we ship at? which test cases matter? upstream of implementation, heavily judgment-based.

agents do badly:

generate exhaustive test cases but cant tell me which 5 matter most for product viability
measure whats measurable (BLEU, semantic similarity, response length) instead of what matters (does the customer find this helpful)
cant design human-in-the-loop eval samples — recommend either fully-automated or fully-manual, never the right hybrid

specific case: building the support agent eval harness for the e-commerce project (last quarter). agent suggested measuring response accuracy via semantic similarity to a "golden answer" set. that would have been wrong in 2 ways. first, the golden answers themselves were judgment calls. second, the actual metric that mattered was "did the customer ask a follow-up that suggests they were confused." the eval design needed real customer conversation data + human classification of "this was helpful" / "this missed." cant be done from training data.

eval-design is the failure i expect least progress on in 2026. requires judgment about what humans value. not in training data.

bonus failure: estimating real-world performance

agent says "this query should be fast" based on indexed columns. in production with cold cache + network jitter + concurrent load, its 800ms slow. agents are bad at production reality because they reason from the code, not from operational behavior.

ive seen agents recommend caching strategies that look correct on paper, but ignore the cache invalidation cost when the cached data changes 50x/day. or recommend "just add an index" without thinking about write amplification on a write-heavy table.

senior knows "this looks fast but will be slow in production for THESE reasons" because senior has seen production reality. agent has seen the docs.

what this means in practice

i keep the 35 agents because the 70% they do well saves real time. but i architect the workflow so the 30% they cant do has explicit human handoffs:

"should we start?" decision (judgment under partial info): human only
cross-codebase refactors where load-bearing weirdness lives: human-driven, agents as implementation tools
client-facing communication: human review minimum, often human-authored
eval design and threshold-setting: human-authored, agents run the harness
production-readiness assessments: human walks through the operational model, agent helps document it

the hype frames this as "AI will do everything." the doom frames this as "AI will replace everything." the practical layer is neither: AI does 70% of any workflow that doesnt require judgment under uncertainty. the 30% that does is exactly where senior engineers earn their living.

if ur building agent systems in 2026, plan the workflow around what they cant do, not what they can. the wont-do list is more load-bearing than the will-do list.

Top comments (1)

Ken Imoto • May 21

The "70% capable, 30% requires human handoff" framing maps cleanly to what I see running 6+ autonomous publishing pipelines. The non-obvious takeaway is that the remaining 30% is becoming the entire job. My harness ships the labor parts fine; what I actually spend time on now is shaping the gates, picking what's worth publishing, and reading the room on which markets care. Agree on "context for why the code exists" being uncrossable in the near term - that's not a memory problem, it's a stakeholder-history problem and there's no API for it.