I've watched teams spend weeks rewriting the same system prompt.
Different phrasings. More examples. Clearer instructions. The agent still picks the wrong tool. Still hallucinates. Still feels broken.
Then they rename six functions and accuracy jumps 30%.
This pattern shows up constantly. The model doesn't care how clever your prompt is. It cares about what it can see.
The problem I see everywhere
Teams treat prompts like magic spells. Say the right words, get the right output.
But agents aren't following instructions. They're making predictions based on everything in context. The tool names. The API responses. The error messages. The structure of your data.
That's perception. And it matters way more than your system prompt.
Most teams optimize the wrong layer. They iterate on prompts for weeks while their tool names are handleData and processRequest. The model has no chance.
Here are 10 patterns I've seen work across the past two years of helping teams build production agents 💪
1. Tool names are the real prompt
Bad tool names are invisible to the model.
I audit client codebases and find this constantly:
// ❌ The model has no idea what this does
async function handleRequest(data: unknown) { }
// ✅ Now it knows exactly when to use this
async function createShipmentFromOrder(orderId: string) { }
One client had 47 tools. Half had names like processData or executeAction. The model was guessing.
We renamed 12 functions. Tool selection accuracy went from 60% to 87%. No prompt changes.
2. Tool descriptions matter more than you think
The model reads descriptions to decide which tool to pick.
I tell clients: write descriptions like you're onboarding a new developer. Because you are.
// ❌ Vague description
const tool = {
name: "searchRecords",
description: "Search for records in the system"
}
// ✅ Specific description with constraints
const tool = {
name: "searchShipments",
description: "Search shipments by tracking number, origin, destination, or date range. Returns max 50 results. Use filters to narrow results before searching."
}
Specific descriptions reduce wrong tool selection by 30-40% in my experience.
3. Passing everything into context is lazy
I've reviewed architectures where teams dump entire conversation histories into context. 20 turns. 50 tool results. Everything.
The model drowns.
What works:
- Last 3 turns by default
- Relevant retrieved docs only
- Structured summaries instead of raw data
Less context. Better decisions. Faster responses.
One team cut their context by 60% and saw answer quality improve. Counter-intuitive until you realize the model was distracted by noise.
4. Scoped retrieval beats broad retrieval
Early RAG implementations pull from everywhere. The whole knowledge base. 200+ docs. The model has no idea which ones matter.
I push clients toward module-level filtering. If someone asks about shipments, only retrieve shipment docs.
// ❌ Retrieve from everything
const docs = await retriever.search(query);
// ✅ Scope to relevant module
const docs = await retriever.search(query, {
module: detectModule(query),
maxResults: 5
});
Recall goes up. Hallucinations go down. Should be the default from day one.
5. Structured outputs prevent downstream chaos
If another agent or system consumes your output, structure it.
// ❌ Free text response
"I found 3 shipments that match. The first one is #12345 going to Chicago..."
// ✅ Structured response
{
"shipments": [
{ "id": "12345", "destination": "Chicago", "status": "in_transit" },
{ "id": "12346", "destination": "Denver", "status": "delivered" }
],
"total": 3,
"hasMore": true
}
Unstructured responses compound errors. Each downstream consumer has to parse and guess. I've seen entire pipelines break because one agent returned prose instead of JSON.
6. Silent failures are invisible failures
The model can't fix what it can't see.
I audit error handling in every client codebase. Same pattern:
// ❌ Silent failure
if (!hasPermission) {
return null
}
// ✅ Loud failure
if (!hasPermission) {
return {
error: "PERMISSION_DENIED",
message: "User lacks 'shipments.create' permission",
requiredPermission: "shipments.create",
suggestedAction: "Request access from workspace admin"
}
}
Explicit errors let the agent reason about what went wrong. And let you debug faster.
7. Real system state beats assumed state
I watched an agent confidently tell a user their shipment was delivered.
It wasn't. The agent assumed based on typical timelines. It never checked the actual record.
This happens when teams don't pass real state:
// ❌ Agent has to guess
const context = {
orderId: "12345"
}
// ✅ Agent knows the truth
const context = {
shipment: {
id: "12345",
status: "in_transit", // actual current status
lastUpdate: "2024-01-15T10:30:00Z",
currentLocation: "Memphis hub"
}
}
Agents will make up state if you don't give them real state. Always.
8. Specialized agents beat one generalist
I've seen teams try to build one agent that handles everything. Customer questions. Data entry. Workflow automation. Reports.
It's mediocre at all of them.
The pattern that works:
- One agent for Q&A using org context
- One agent for record operations with strict schemas
- One agent for document extraction with specialized prompts
Each one is easier to eval. Easier to constrain. Easier to improve.
Generalist agents are harder to debug and harder to trust. I push clients toward decomposition early.
9. Guardrails should block bad things, not useful things
I've seen guardrails so aggressive they blocked legitimate business operations.
"Can you help me set up a webhook?" → BLOCKED (mentions code execution)
"What's the API endpoint for shipments?" → BLOCKED (mentions API)
The users stopped trusting the product. Not because the AI was bad. Because the guardrails were dumb.
Narrow guardrails work better. Be specific about what's actually dangerous. Allow everything else.
10. Audit perception before rewriting prompts
When a client tells me their agent is underperforming, I ask these questions first:
- Can it see the right tools? Are names and descriptions clear?
- Can it see the right context? Or is it drowning in noise?
- Can it see real state? Or is it guessing?
- Can it see errors? Or do failures happen silently?
Nine times out of ten, the problem is perception. Not the prompt.
The outcome when you get this right
Teams that engineer perception instead of prompts:
- Stop the endless prompt iteration cycle
- Get measurable accuracy improvements in days, not months
- Build agents that actually work in production
- Have clear debugging paths when things break
The teams that keep tweaking prompts stay stuck. I've seen it enough times to know.
The mental model shift
Prompt engineering asks: "How do I word this better?"
Perception engineering asks: "What does the agent need to see to make a good decision?"
One has diminishing returns after a few iterations.
The other compounds as your system improves.
Stop rewriting prompts. Start auditing what your agent can perceive.
- Rename tools for clarity
- Scope your context
- Pass real state
- Make errors loud
- Use specialized agents
Your agent is only as good as what it can see 👀
If you're building agents and want a second set of eyes on your architecture, I help teams get this right. DM me on X or LinkedIn.
Top comments (0)