AI Bug Slayer 🐞

Posted on Jun 8

What Happens When You Run 10 AI Agents at Once in a Real Codebase

#aiagents #ai #llm #webdev

I spend a lot of time in the AI space -- reading papers, building things, talking to engineers who are actually shipping. And there is a gap between what the demos show and what production systems actually look like that nobody is being fully honest about.

So here is my honest take on where things actually are.

The Problem With How We Talk About AI Agents

Everyone is calling everything an "agent" right now. A function that calls a tool? Agent. A chatbot with memory? Agent. A script with a loop? Agent.

This dilution is not just semantic. It is causing real engineering mistakes.

When you do not have a precise definition for what you are building, you end up over-engineering simple pipelines and under-engineering genuinely complex ones. I have seen teams spend weeks adding "agentic" orchestration to workflows that would have been fine as a single well-structured prompt.

Here is the definition I keep coming back to: an agent is a system that has an objective, not just an instruction. It decides what to do next. It handles failure. It knows when it is done.

Everything else is just a fancy function call.

🟢 If your system needs a human to tell it each step, it is not an agent. It is a chat interface.

🔵 If your system can recover from a failed tool call and try a different approach, you are getting somewhere.

✅ If your system can decompose a goal into subtasks and delegate them, that is the real thing.

What Is Actually Happening in Production Right Now

The honest picture from teams I follow and talk to:

Most real agent deployments are narrow. They do one thing well. Customer support triage. Document extraction. Code review on a specific codebase. They are not general-purpose reasoning engines. They are purpose-built pipelines with some intelligence in the decision layer.

The teams getting good results are not chasing the latest model release. They are obsessing over:

☑️ Tool design -- what can the agent actually call, and how clean is the interface

☑️ Failure handling -- what happens when a tool returns nothing useful

☑️ Observability -- can you trace exactly why the agent made the decision it made

The teams getting bad results are the ones that swapped out GPT-4 for the latest frontier model and expected different behavior without changing anything else.

Something I kept seeing pop up recently: Google just redesigned the search box for the first time in 25 years — here’s why it matters more than you think. from VentureBeat AI. For a quarter century, the Google search box has been one of the most recognizable interfaces in computing: a thin white rectangle, a blinking cursor, a few typed words, and a list...

Worth reading if you have not yet: https://venturebeat.com/technology/google-just-redesigned-the-search-box-for-the-first-time-in-25-years-heres-why-it-matters-more-than-you-think

Something I kept seeing pop up recently: Railway secures $100 million to challenge AWS with AI-native cloud infrastructure from VentureBeat AI. Railway, a San Francisco-based cloud platform that has quietly amassed two million developers without spending a dollar on marketing, announced Thursday that it raised $100 million...

Worth reading if you have not yet: https://venturebeat.com/infrastructure/railway-secures-usd100-million-to-challenge-aws-with-ai-native-cloud

Something I kept seeing pop up recently: Claude Code costs up to $200 a month. Goose does the same thing for free. from VentureBeat AI. The artificial intelligence coding revolution comes with a catch: it's expensive.Claude Code, Anthropic's terminal-based AI agent that can write, debug, and deploy code a...

Worth reading if you have not yet: https://venturebeat.com/infrastructure/claude-code-costs-up-to-usd200-a-month-goose-does-the-same-thing-for-free

The Framework Wars Are a Distraction

LangChain. LangGraph. CrewAI. AutoGen. Semantic Kernel. Every month there is a new one and someone is writing a post about why the old one is dead.

Here is what I actually think: the framework matters less than the patterns.

The patterns that keep working regardless of what framework you use:

✔️ Plan-then-execute. Have one reasoning step that produces a plan, and a separate execution step that follows it. Do not mix them.

✔️ Separate retrieval from reasoning. Fetching context and using context are different jobs. Systems that conflate them get confused.

✔️ Explicit handoffs. When one agent passes work to another, the handoff should be structured and logged. Not a string passed through a prompt.

I have rebuilt the same architecture in three different frameworks and the results were similar each time. The framework is scaffolding. The architecture is the building.

The Retrieval Problem Nobody Has Solved

RAG is standard now. Almost every production AI system that touches proprietary data uses some form of it. But there is a problem that the tutorials do not cover well.

The chunk boundaries are wrong.

When you split a document into chunks and embed them, you are making assumptions about what pieces of context belong together. Those assumptions are often wrong. A paragraph that only makes sense in light of the paragraph before it gets retrieved in isolation and the model hallucinates the missing context.

🟢 Better chunking strategies help. Overlapping windows, semantic chunking, parent-document retrieval.

🔵 But the real fix is rethinking what you are storing. Sometimes the right thing to store is not the raw text but a structured representation of the information.

✅ If your RAG pipeline is returning technically correct but contextually useless results, the problem is almost certainly in the chunking or the metadata, not the embedding model.

Where I Think This Is All Going

The models are going to keep getting better. Context windows are going to keep expanding. The cost per token is going to keep dropping.

None of that changes the fundamental engineering challenge: building systems you can trust to behave correctly when you are not watching.

That is the problem worth solving. Governance, observability, and reliable tool use. Not chasing benchmarks.

The engineers who are going to matter in two years are the ones who can build AI systems that other engineers can maintain and trust. That is a different skill set than fine-tuning or prompt engineering.

It is closer to systems design than it is to model research.

If any of this resonates with what you are building, or if you have a completely different take, I want to hear it. Drop your experience in the comments. The interesting conversations in this space are not in the keynotes -- they are in the threads where people are actually honest about what works.