This is aggregated experience - mine personally and from public engineers, current and former colleagues who code daily in enterprise, build their own products, countless prototypes: Viktor Tulskyi, The Prime Gen, Theo (t3.chat), Peter Steinberger, Gregory Orosz, and many other pragmatic engineers.
This isn't a research paper with metrics. These are practical conclusions from those who tried everything and came back to simplicity.
AI Can Be Exciting and Useful, But There Will Be "Buts"
I enjoy AI. I use it daily. I can now quickly do things I never understood before - at the level of PoC, MVP, internal utilities that turn a five-minute task into a second.
I built internal extensions for infrastructure, a console utility for transcribing voice to text via clipboard, a macOS utility for the same but more convenient - this has become exciting for me.
Every day, I follow what's happening in the AI world. Every day I try something new:
- Tool updates, services like Supabase, Vertex, Convex
- New bullshit benchmarks from providers
- New papers on approaches, prompt combinations, contexts, models, and tokenization
So much to learn - and at the same time, so much that doesn't meet expectations. But useful stuff, everything I follow - ideas already worked out by other engineers, companies, researchers.
And also - so many promotional, sales, marketing videos. How someone built multi-agent orchestration, and agents solve tasks independently, practically without participation. Demos of full autonomy. Beautiful, polished, convincing.
I tried this too. Experimented. Temperatures, system prompts, fallbacks, and playgrounds with different models. Multi-model setups, multi-agent systems, sub-agents, RAG.
Everything, literally! It's pouring from empty to empty. Everything falls apart in the details when it meets reality.
My Prompting Journey Looked Like This:
Stage 1: "Fix this, here's the stacktrace"
Stage 2: Multi-model, multi-agent, orchestration, RAG, complex system prompts, temperatures, fallbacks...
Stage 3: "Fix this, here's the stacktrace, here's when it happens, probably the problem is here"
The difference between first and third stage - I now know exactly what needs to be fixed. Where to look, what context to give, how to formulate.
Narrowing context always works better than 1 MILLION input tokens for writing code.
All that stuff in the middle - multi-agents, orchestration, RAG - was marketing I swallowed. And you can try it too. Try it! Seriously. So you understand firsthand HOW it doesn't work. And when it potentially could work.
But don't spend too much time on it - in the long run, for MVPs and large projects, it makes weakly controllable changes with their own unique consequences.
Accuracy
One of the key parameters for evaluating LLMs for code generation is accuracy.
More specifically:
- Pass@k - did at least 1 of k attempts pass tests
- Pass@1 - did the code complete the task on the first try
Not "almost works", not "needs tweaking". Works or doesn't.
Not "how nicely it sounds". Not "how confidently written". But how repeatable and correct.
Accuracy isn't the only metric. For brainstorming, analysis, and review, variability can be useful. But for code that must compile, pass tests, and work in production, accuracy is more critical.
Code either works or it doesn't.
The interpreter and compiler aren't creative personalities having a discussion with the author, and a "500 error" isn't production abstractionism.
The Agent/LLM Doesn't "Think"
Sorry to repeat, but an LLM is a probability generator for the next token. And when you put two token generators to communicate with each other, you don't get a "team". You get accumulating hallucinations with each step.
What happens to "Accuracy" when agent A passes results to agent B, which passes to agent C? It drops. Exponentially. Each step adds variability. Each step moves further from the expected result.
Multi-agent is a marketing term for "I wrote several prompts, and they call each other".
Multi-agent isn't an architecture. It's hoping that LLM will magically understand the context you haven't formalized yourself.
"But Devin, Cursor, Windsurf background agents - they work!"
They work. In specific cases. On particular workflows. But compare: how much time will you spend configuring Cursor rules, custom agents, your own RAG system vs. understanding the project architecture, knowing where to look for the problem, and asking AI to brainstorm solution options?
A person who understands the system + simple conversation with a "model" = (more often gets) faster and more accurate solution than juggling multi-agent setups you configured for a week.
Sub-agents? Charade - as Peter Steinberger aptly called it in his blog post about how you can work more simply with "agents".
What others do through sub-agents, you can do through separate terminal windows. Full control. Full context visibility. Less exponential progression of hallucinations. And most importantly - athe bility to verify results at each step!
MCP? Marketing Token Burner and Hallucination Igniter
Most of them should just be CLI or API clients.
GitHub MCP eats 23k context tokens from the start. gh CLI does the same - "for free".
Or generate a simple script that calls GitHub API for a specific task - you'll get predictable results without the magic of "now the agent will figure it out thanks to MCP".
MCP provides structure? API does too. CLI with the proper output format, too. The difference is that you fully control a script, while MCP is a black box, often limited, that can return anything because it accidentally called the wrong tool and ate context just because you initialized it in the agent.
Note: MCP for Figma is a separate story - for frontend work, it's genuinely useful. I'm more skeptical about MCP for browser/database - the value there is questionable.
RAG for Code?
In an enterprise for searching documentation, regulatory requirements, and internal wikis, RAG makes sense. If you have a DEDICATED team maintaining it, write in comments what cases it actually helps, and that you're from a "wealthy family".
But if you're a developer who wants to set up RAG for your own codebase so the "model better understands the project," it's overkill without value. Modern models already search code well when given the right context.
Your time is better spent on understanding the architecture/structure than on configuring vector indexes.
A Separate Illusion - Context Window
"Gemini has a million tokens! You can throw in the entire codebase!"
There's research "lost in the middle" - the model loses information in the middle of a large context. Yes, the research is old, and new models show better results in synthetic tests like "needle in a haystack".
But even Google Gemini 3.0 Pro declares the same - the more facts you search for simultaneously (like in real work) - accuracy drops sharply!
In practice, mine and my colleagues' in an enterprise - large context still works worse. Not because the model "forgets". But because more context = more noise = more response variability = lower "Accuracy".
Compare:
- Throw 500k tokens of code and say, "find and fix the problem"
- Know where the problem likely is, give the relevant file and context for the fix
For simple things - find a function, update implementation, see where it's used - agents handle it fairly well. But for more complex changes or refactoring, you often get "shrapnel": changes scattered across the project, extra code created that then needs cleanup.
When you give more hints about the structure, it works more reliably. But for that, you need to understand the architecture and codebase. A million context tokens won't replace that.
Again:
LLM is a probability generation machine, not a logical processor.
More noise in - more noise out.
Autonomy
Agent autonomy works exactly until the task goes beyond the demo scenario. First edge case - and the whole "orchestration" falls apart.
Why don't those selling courses and tools talk about this? Because "it works for 70% of cases, and the rest needs manual work" - doesn't sell. "Full autonomy" - sells.
What Actually Works?
Iterative work!
Just talk to the model, literally. Check the result. Adjust. Repeat. Stop when something goes wrong.
Ask: "let's come up with different solutions", "what best practices exist for working with these specific problem domains", "now let's make a prompt for the next iteration" (APE/APRO)
Instead of blindly approving every action.
"Accuracy" grows not from a larger context or a larger number of agents. It grows through iterations under the control of someone who understands the task.
Most likely:
- You don't need overkill frameworks with tons of abstractions and configurations. Don't need sub-agents.
- Don't need MCP for every service - write a script that does a specific thing, or just use SDK or API.
- Don't need a million input context tokens - need the right context.
You need your head. Your technical expertise. Your understanding of the task. And a model you talk to like an engineer - not a magical oracle that will understand, find, and solve everything for you.
Neither expanded context, nor multi-agent, nor MCP, nor RAG - none of these increases accuracy/stability by itself. These are all tools that amplify an expert. They don't replace one.
Everything else is silver bullinng for investors and content for LinkedIn, sales presentations of yet another charade of autonomy.
P.S. This post wasn't written by LLM, it was transcribed from monologues with my PoC Dictate to Buffer, links and proofs used from my bookmarks stash, certain parts with counterexamples researched thanks to Claude
P.P.S. Most of my conclusions I reached myself based on my own trial and error, and then it turned out others had already tried and done this at a larger scale 😅
Top comments (0)