TL;DR
- AI agents in real products fall into 4 levels: LLM wrapper → intent classifier → context-aware → agent loop.
- Most "AI agents" you meet in production are stuck at level 1 or 2, which is why they feel dumb on top of very smart models.
- The gap between levels is rarely the model. It's context management and the agent loop. Part 2 covers how to climb the levels — this post is about what the levels are and why most products stall.
Last month I tried to cancel an order through a SaaS company's shiny new "AI agent."
Me: "Cancel my latest order if it hasn't shipped yet."
Agent: "Here is our refund policy: [link]. Is there anything else I can help with?"
Then I asked a customer-support bot a follow-up referencing my previous message. It had already forgotten.
We are in the middle of an AI boom. Frontier models can write production code and reason through multi-step problems. And yet the AI agents shipped inside real products often feel like 2018-era chatbots in a new costume.
Here is the part I find funny. After two messages with one of these things, you can usually guess the architecture from the outside. The second message reveals the architecture, every time.
Why? Because most of what people call an "AI agent" is not one. It is an LLM API call wearing the word as a marketing label.
A note before we start: classifications of agents already exist — Anthropic's workflows vs. agents, Harrison Chase's cognitive architectures, Lilian Weng's planning / memory / tool use decomposition, and a handful of "SAE-style autonomy levels" posts. The four levels below are not new theory. They are the four shapes I keep seeing when I poke at real production agents from the outside.
Level 1 — The LLM API Chatbot
The most basic form.
User input → LLM API call → Response
System prompt, maybe. No tools. No memory. No retrieval. No state.
I am honestly not sure this should be called an agent at all. It is an LLM API wrapper with a friendly UI on top. But at the product level, plenty of teams still call this "our AI agent."
It can handle FAQ-style questions. The moment a user says something like:
"Can you do the same thing with the settings I mentioned earlier?"
…or:
"Check my latest order and cancel it if it hasn't shipped yet."
…the seams show immediately. Nothing is connected to anything. The model is guessing.
This is the level most "AI features bolted onto an existing SaaS" sit at. And it is exactly why users walk away thinking AI is overhyped.
Level 2 — Intent Classification Agent
Probably the most common level in production today.
User input
→ Intent classification
→ Intent-specific handler
→ Response
For a customer-support bot the intents might look like:
- Refund request
- Shipping question
- Payment issue
- Account problem
- Escalate to human
Within a tightly scoped domain this works surprisingly well. If you know your user requests fall into a small number of buckets, intent classification is cheap, fast, and easy to monitor.
The weakness shows up the moment users do what users actually do: combine intents.
"I want to cancel the thing I paid for yesterday, but I think it may have already shipped."
That is payment, cancellation, refund, and shipping in one sentence. A classic single-intent classifier routes this to one handler and ignores the rest. The user gets half an answer and gives up.
Modern multi-intent classifiers help. They do not fix the ceiling: the agent is only as good as the intents you predefined. Anything outside the schema falls off a cliff.
If you ask Claude or Codex to "build me an AI agent," there is a good chance you will end up here. It is a fine starting point. It is not a high-level agent.
Level 3 — Context-Aware Agent
Level 3 is the first level where the user stops feeling like they are filling out a form.
User input
+ Previous conversation context
+ Stored facts / preferences
→ Reasoning
→ Response
By "context" here I mean conversational memory across turns. What the user said two messages ago. Their stated preferences. The entities they already referred to. Not the runtime working context an agent loop carries between tool calls. (That shows up in Level 4.)
The agent maintains state across the conversation. "Use the option I mentioned earlier" actually works. The user does not have to repeat themselves every turn.
The prompt is the easy part. Context management is where it gets ugly. LLMs do not have infinite memory, and naively stuffing the full history into every call breaks cost and quality at the same time.
The usual strategies:
- Keep the most recent N messages verbatim
- Summarize older messages into a compact form
- Extract durable facts ("user prefers email over Slack", "ordered SKU 1042") and store them separately
- Slide the context window: keep the head (system + key facts) and the tail (recent turns), compress the middle
A Level 3 agent feels like it remembers you. A Level 2 agent feels like every message is its first day on the job.
Level 4 — Agent Loop
This is where an agent becomes an actual agent.
The model does not just generate a reply. It decides what action to take, executes a tool, observes the result, and decides again — until the task is done or a budget is hit.
Say the user asks:
"Find and fix the login bug in this project."
A Level 1 chatbot guesses. A Level 4 agent does something like:
- Inspect the project structure
- Search for login-related files
- Read the relevant code
- Check error logs or failing tests
- Hypothesize the root cause
- Edit the code
- Run tests
- If tests fail, loop back to step 3
- Report
At this point the agent is no longer answering questions. It is doing work.
The part nobody warns you about when you start building one: the model matters, but tools matter at least as much. A great model with badly designed tools will pick the wrong one, pass the wrong arguments, or loop forever. A merely-good model with well-designed tools punches far above its weight.
That is the topic of Part 2.
A Note on the "Levels"
These are not strict maturity levels. Memory (Level 3) and the agent loop (Level 4) are independent axes, not stacked floors. A stateless coding agent can run a strong Level 4 loop inside a single task with almost no conversational memory. A customer support assistant can have rich user memory and no autonomous loop at all. I am using "levels" as shorthand for product behavior the user perceives, not a formal architecture ladder where each rung technically depends on the one below.
If you take only the ladder away from this post, you took the wrong thing. Take the four shapes.
Why Most Products Stall at Level 1–2
If Level 4 is so clearly better, why are most shipped "AI agents" stuck on the bottom two rungs?
A few honest reasons:
-
Level 1 is one weekend of work. A system prompt and a
chat.completionscall. Demos beautifully. Falls over the moment a real user shows up. - Level 2 fits how PMs already think. Intents map cleanly onto support tickets, KPIs, and existing decision trees. The org chart pulls toward Level 2.
- Level 3 requires unsexy infra. Context summarization, fact extraction, durable per-user state. None of it is one prompt away. All of it is operationally annoying. (Note: this is conversational memory, not RAG. Retrieval is a separate axis, not a prerequisite.)
- Level 4 requires real tools. A loop is meaningless without things to call. Building, scoping, and securing tools that touch your production systems is the part that scares teams — so they ship the chatbot version and call it a day.
The result is a market full of "AI agents" that share the badge but not the behavior. The badge is cheap. The badge has been cheap for two years. The behavior is what users are still waiting for.
Part 2 — "How to actually design a Level 4 agent" — covers tool design, a reasonable starting architecture, and the mistakes that make agents loop forever. Coming next.
If this matched your experience building or using AI agents, I'd love to hear which level your current project is at — and what broke when you tried to climb to the next one.

Top comments (0)