Last month I wrote “AI Turns Ignorance Into an Advantage”, about how outsiders without the baggage of knowing how hard something is are more willing to use AI to try things that look impossible.
I still believe that. But agents burned me four times in a row recently, so I need to revise.
The sweet spot isn’t knowing nothing; it’s knowing just enough. You have common sense, you grasp the big picture, but you don’t get lost in technical details. Total beginners do dare to try, which is good. But they can’t tell whether the agent’s output is actually reliable.
1. Fake Data Can Trick You by Orders of Magnitude
I’ve been optimizing an inference engine lately.
I checked the results on the first night. The metric had hit a target I’d considered seriously challenging. I was excited. Had we really cracked it that fast?
If I knew nothing about this domain, I’d probably have cheerfully shared the results with my partners. But because I had some common sense, something felt off. I had it run a correctness check. The output was nothing but exclamation marks. After fixing correctness, performance dropped by orders of magnitude.
I thought that was the end of it. But as we kept optimizing, the rhythm still felt wrong. The numbers climbed too fast, suspiciously fast. I looked at the test flow and found that before each official test it quietly ran a warm-up using the exact same prompt. Every subsequent test was hitting the prefix cache, essentially cheating on an open-book exam. After isolating the cache, performance dropped by orders of magnitude again.
Still not done. Once prefill returned to normal, decode speed suddenly became absurd. A Windows build of the engine was somehow outperforming the Linux version. I ran the real-prompt test script I’d written earlier, and performance took another ten-fold hit. The problem was that the agent’s synthetic prompts were too simple and too regular, letting speculative decoding hit an acceptance rate above 80%. Switch to real prompts and the acceptance rate cratered, taking performance with it. Teams that have shipped speculative decoding have documented the exact same trap: real-world production performance is 40% to 60% lower than in the lab, a gap large enough to make you wonder if it’s the same system.
Three layers of illusion, stacked. If I’d believed that first number and shared it externally, the cleanup would have been miserable. You give someone the wrong expectation, and they start scheduling around it. Then you have to go back and say, “Sorry, we’re off by orders of magnitude.” That feels way worse than saying “We’re not there yet” from the start.
After that, every optimization target explicitly included two rules: prefill must not be affected by prefix-cache interference, and decode must use real prompts. Only then did we see a normal curve that crept upward, bit by bit.
2. It Will Brick Your Lab Machine
The latest agents can work autonomously for a full day or longer. The longer they run, the higher the chance something goes wrong.
My agent has, more than once, trashed the entire OS of a lab machine mid-run because of a missing quote or a flipped command-line argument. Files gone, environment wiped. It happens in a split second. You can’t stop it in time.
It’s not just me. In April, when an agent hit a credential mismatch, it didn’t stop to ask a human. It found a token with full privileges and deleted an entire company’s production database and all backups in nine seconds. Thirty-plus hours of downtime. Three months of customer data, gone. There have been at least a dozen similar documented incidents in the past two years.
Anthropic and OpenAI are now pushing sandboxing. The idea isn’t complicated. Filesystem isolation on one layer, network isolation on another. Without filesystem isolation, the agent can touch things it shouldn’t. Without network isolation, a compromised agent can steal your keys.
My own approach is more low-tech: dedicate a machine exclusively to the agent, and don’t store anything else on it. If it runs for dozens of hours straight, the probability of a dumb mistake is non-zero. Reinstalling the OS costs time. Losing important data costs your sanity.
3. It Will Spin in Circles Until You Step In
Agents have another bad habit: they circle the same problem.
A recent goal was to run an inference engine on Windows in BF16 precision. The model weights were over 60 GB, and loading them caused an immediate OOM crash.
The agent’s response was interesting: it kept trying to work around the memory bottleneck. Load only some weights, dynamically read the rest during inference, every offload trick in the book. None of them worked, and each ate up a lot of time. It even added a warm-up to the tests to hide loading latency. That was part of the root cause of the prefix-cache problem I mentioned earlier.
I finally cut in and said: stop tweaking performance and fix the memory issue first. Until that bottleneck is solved, everything else is wasted effort.
The agent actually executes well. Once pointed in the right direction, it quickly found a series of system-level Windows settings to expand available memory and VRAM. After that was fixed, the optimization path smoothed out immediately. All the previous workarounds were suddenly useless. That time was basically wasted.
The problem is that it won’t proactively redefine the problem. Hand it “optimize performance” and it will keep grinding on that goal, even when stuck on a prerequisite. It finds ways to work around it rather than telling you, “This assumption is false; we need to handle something else first.” Recognizing the real blocker and pulling the agent out of the dead end is a judgment call only a human can make.
4. Set the Bar Too High and You Ship Nothing
The last pitfall isn’t the agent’s fault. It’s mine.
The more powerful agents get, the easier it is to set the bar too high. Because they can run for days, you start thinking anything is fair game. Every direction looks like a top-conference breakthrough. So you spin up multiple threads, each one ambitious.
The result? Every thread is active, every thread shows progress, but nothing ships.
You keep burning tokens, you keep seeing “progress,” yet nothing reaches the user’s hands. It looks like work. It’s actually just burning money. I made this mistake recently: several threads were the kind that would be huge if they landed, but the execution risk was equally high. An agent isn’t a genie; if it can’t be done, it can’t be done. I burned a mountain of tokens and delivered nothing.
I eventually realized: narrow the scope. You need something shippable in the short-to-medium term and some worthwhile long-term explorations, not only the latter. Deliver what can be delivered first, stabilize the rhythm, then go after the big bets.
5. Knowing Just Enough Is Exactly Right
Look at the four pitfalls together and one thread connects them: none requires you to be a deep expert to avoid.
Performance jumped by orders of magnitude? Check whether you measured wrong first. The agent needs to run on your main machine all day? Give it a dedicated one. Stuck on the same spot after three optimization rounds? That spot is the real problem. Every thread is running but none ships? Kill a few.
It’s all common sense.
An MIT Sloan article this year on managing in the age of agentic AI noted that the most important skills for managing agents are defining problems and validating outputs. Those are things AI still can’t do well. “Agent Manager” is already showing up on job boards, and one line in the job description stands out: domain common sense matters more than AI expertise.
Going back to my previous post. “Ignorance is an advantage” still holds: you have to not know what’s hard in order to dare to try. But courage alone isn’t enough. The most valuable state is this: willing to try, yet able to sense when something is off at the critical moment.
Total beginners get carried away by fake data. Deep experts get shackled by priors. The people in the middle, the ones who know just enough, are bold enough to act, yet wise enough to pull the reins when needed.
Agents will keep getting stronger. But that bit of human common sense, whether the numbers check out, whether the direction is right, or whether this should ship now, will only become more valuable. These are still things agents can’t handle.
References
- GPT-5.1 Codex Max Can Work Autonomously for Over 24 Hours — OpenAI
- GPT-5.5 Released: Multi-Hour Autonomous Session Capability — OpenAI
- Speculative Decoding’s Hidden Traps in Production: Real-World Performance 40–60% Lower Than in the Lab — tianpan.co
- MTP Acceptance Rate Variations Across Task Types — SudoStack
- Cursor + Claude Agent Deletes Entire Company Production Database in 9 Seconds — The Register
- 10 Real-World AI Agent Incidents Reviewed — DEV Community
- Claude Code Sandbox Design: Two-Layer Isolation Cuts Permission Prompts by 84% — Anthropic Engineering
- Agentic AI Redefines Management: 69% of Experts Call It a Paradigm Shift — MIT Sloan Management Review
- Core Skills for Managing AI Agents — World Economic Forum
- Agent Manager: The New Enterprise Role in 2026 — Beam AI
Top comments (0)