DEV Community: Craig Tracey

Building AI Agents: The Fundamentals

Craig Tracey — Sun, 29 Mar 2026 17:31:46 +0000

Everyone's building agents. Most are building them wrong. Not because they lack skill, but because they lack the right mental models. Before you write a line of code, you need to understand what agents actually are and how they differ from everything else you've built.

Here are twelve rules that will save you from the mistakes we made.

1. Understand the Loop

An agent is not a chatbot with tools. It's not RAG with extra steps. It's a system that perceives, reasons, and acts in a loop until a goal is achieved.

	Flow	Outcome
Chatbot	Question > Response	Single answer
RAG	Question > Retrieve > Response	Answer with context
Agent	Goal > Reason > Act > Observe > Repeat	Task accomplished

The difference is autonomy. A chatbot answers. An agent accomplishes.

Years ago, our first "agent" was basically a chatbot with a for-loop. It would answer, we'd manually feed the answer back in, repeat. It took us three weeks to realize we'd reinvented the agentic loop badly.

This loop has a name: the agentic loop. Every agent framework implements some version of it:

while not done:
    observation = perceive(environment)
    thought = reason(observation, goal, memory)
    action = decide(thought, available_tools)
    result = execute(action)
    memory = update(memory, result)
    done = evaluate(result, goal)

The LLM is the reasoning engine. Everything else (tools, memory, evaluation) is scaffolding you build around it.

Here's what most tutorials miss: agents can spawn other agents. A complex task decomposes into subtasks, each with its own loop. The outer agent orchestrates while inner agents execute. This is how you build systems that tackle problems too large for a single context window. The orchestrator maintains high-level state while delegating details to specialists that run, complete, and return results.

Common orchestration patterns:

Hierarchical: A supervisor routes tasks to specialized sub-agents
Sequential: Agents hand off to each other in a pipeline
Parallel: Multiple agents work simultaneously, results are merged
Hub-and-spoke: A central coordinator fans out and collects results

The key insight: inner agents should be narrow and specialized. The orchestrator handles routing and state. Poor orchestration causes cascading failures (especially in hierarchical setups where routing decisions compound errors downstream). One confused sub-agent can poison the entire task.

2. Context Is Working Memory

LLMs have a context window. Most modern models offer 1M+ tokens. Sounds like a lot. It isn't.

Every turn of the agentic loop adds to context:

The user's original goal
Every tool call and its result
Every reasoning step
Every error and retry

A complex task might take 20 tool calls. Each tool result might be 500 tokens. That's 10K tokens just in tool results. Add system prompts, conversation history, and reasoning traces and you're at 50K tokens before you've done anything interesting.

Context is not free storage. It's working memory. The more you stuff in, the worse the LLM reasons. Studies show performance degrades well before you hit the limit due to the "lost in the middle" effect, where models increasingly ignore information in the middle of long contexts. Even 1M+ token windows don't solve this. Bigger windows don't fix reasoning quality without better context engineering.

This means:

Summarize tool results aggressively
Don't keep full history, keep relevant history
Design tools that return focused data, not everything
Use external stores (vector databases, knowledge graphs) for long-term facts
Consider summarization chains for very long tasks

The best agents use the least context to accomplish the goal.

3. Tools Are Your Interface

An LLM can only think. It can't do. Tools bridge that gap.

A tool is a function the LLM can call. It has:

A name the LLM uses to invoke it
A description that tells the LLM when to use it
A schema that defines what parameters it accepts
An implementation that actually does the work

{
  "name": "search_services",
  "description": "Search for services by name, owner, or tag. Use this when you need to find services matching certain criteria.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "Search term to match against service names and descriptions"
      },
      "owner": {
        "type": "string",
        "description": "Filter by team that owns the service"
      }
    },
    "required": ["query"]
  }
}

Tools are how you expose capabilities to the agent. Get them wrong, and the agent can't do its job no matter how good the model is. Get them right, and even a weaker model can accomplish complex tasks.

The temptation is to give agents every tool they might need. Access to GitHub? Add the GitHub tools. AWS? Load those too. Slack, Jira, databases... pile them on. This is a mistake. Every tool is a decision point. Every decision point is a chance for the LLM to choose wrong. We've seen noticeable degradation often starting around 10-25 tools, with severe hallucination issues at 100+. Not because the tools were bad, but because the model couldn't reason over that many options. Start minimal. Add tools only when you hit a wall.

Security matters from day one. Tools are capabilities, and capabilities can be abused. Start with minimal permissions (least-privilege). Use scoped tokens rather than broad API keys. Consider policy gates that require approval for high-risk actions. Standards like Model Context Protocol (MCP) are emerging to provide safer, standardized tool and data access.

4. Tool Descriptions Are Prompts

The description matters more than you think. The LLM decides which tool to use based on the description. A vague description leads to wrong tool choices. A precise description leads to correct ones.

Bad: "description": "Gets service information"

Good: "description": "Search for services by name, owner, or tag. Returns a list of matching services with their IDs, names, and basic metadata. Use this when you need to find services. Do not use this to get detailed information about a specific service you already know."

The description is a prompt. Write it like one.

5. Agents Thrive with Structure

LLMs generate text. Agents need structured data. Bridging this gap eliminates many common failures.

When an LLM calls a tool, it must output valid JSON matching the schema. When it reasons, you often want that reasoning in a parseable format. When it decides it's done, you need to know what it concluded.

Two approaches:

1. Constrained Decoding: Force the LLM to output valid JSON at the token level. OpenAI and Anthropic both support this. The LLM literally cannot generate invalid JSON because the sampling is constrained to valid tokens only.

2. Schema Validation with Retry: Let the LLM generate freely, validate the output, retry if invalid. Works but wastes tokens and time.

Use constrained decoding when available. It's not just more reliable, it's faster because you never retry.

Tool calls are already constrained by the API. But if you're parsing custom output (reasoning traces, final answers, intermediate state), enforce structure.

6. Plan for Hallucination

Hallucination isn't a bug you can fix. It's a property of how LLMs work. They predict likely tokens. Sometimes likely tokens are wrong.

In agent systems, hallucination shows up as:

Invented tool names: The LLM calls a tool that doesn't exist
Fabricated parameters: The LLM passes an ID it made up
False confidence: The LLM claims to have done something it didn't
Imagined results: The LLM describes tool output that didn't happen

We had a case where our prompt included an example UUID with the explicit instruction: "THIS IS AN EXAMPLE UUID. DO NOT USE THIS VALUE." The agent used it anyway. Repeatedly. Prompts don't override pattern matching.

You can't prompt your way out of this. You engineer around it.

Validate everything. Don't trust tool parameters. Check that IDs exist before using them. Don't trust completion claims. Verify the goal was actually achieved.

Fail gracefully. When hallucination happens (it will), the system should recover. Return clear errors. Allow retries. Log what happened.

Reduce opportunity. The fewer choices an LLM has, the less it hallucinates. Fewer tools. Shorter context. More specific prompts. Every reduction in complexity is a reduction in hallucination surface.

Add reflection. Have the agent critique its own outputs before acting. A self-check step catches many errors before they cause harm.

Escalate when stakes are high. For irreversible actions like deleting data, sending emails, or deploying code, require human confirmation. The agent proposes; the human approves.

Use verification agents. For critical tasks, a separate agent can validate outputs before execution. Two models are less likely to make the same mistake.

7. Prompts Are Code

The system prompt is the most important code you write. It's also the least tested.

A system prompt for an agent typically includes:

Identity: Who the agent is and what it does
Constraints: What it must never do
Instructions: How it should approach tasks
Tool guidance: When to use which tools
Output format: How to structure responses

You are an infrastructure assistant that helps engineers understand and operate their systems.

CONSTRAINTS:
- Never modify production systems without explicit confirmation
- Never expose secrets, tokens, or credentials in responses
- If uncertain, ask for clarification rather than guessing

APPROACH:
- Start by understanding what the user is trying to accomplish
- Search for relevant context before taking action
- Explain what you're doing and why
- If a tool call fails, explain the error and suggest alternatives

TOOL USAGE:
- Use search_services to find services before operating on them
- Always use entity IDs from search results, never construct IDs yourself
- Use get_service_details only after you have a valid service ID from search

Treat this like code:

Version control it alongside your tools and schemas
Review changes with the same rigor as code review
Test prompts in isolation before integrating into the full loop
Run regression tests when you change anything
Iterate based on failures

Prompt engineering isn't magic. It's specifying behavior in natural language. The same engineering rigor applies. A prompt change can break your agent just as easily as a code change.

8. Curate Memory

Chat history is a log of what was said. Memory is what the agent knows and can use.

These are different. Chat history grows linearly with conversation length. It includes irrelevant small talk, failed attempts, and superseded information. Memory should be curated to include only what's useful for future reasoning.

Agent memory typically has layers:

Working memory: The current context. What's happening now. Tool results from this task. Usually just the context window.

Episodic memory: What happened before. Previous tasks, their outcomes, what worked. Stored externally, retrieved when relevant.

Semantic memory: What's true about the world. Facts about the system, relationships between entities, organizational knowledge. This is your knowledge graph.

Most agents only implement working memory (the context window). That's fine for simple tasks. Complex agents need episodic memory to learn from experience and semantic memory to reason about relationships. In multi-agent setups, ensure the orchestrator maintains high-level state while delegating memory needs to specialists or shared stores.

9. Evaluate Continuously

You built an agent. Does it work? How would you know?

"It seems to work" is not evaluation. Agents are probabilistic. They might work 80% of the time. You need to know that number.

Evaluation requires:

A test set: Real tasks with known correct outcomes
A metric: How you measure success (task completion, accuracy, efficiency)
A baseline: What you're comparing against

For simple agents, manual evaluation works. Run 50 tasks, count successes. But this doesn't scale, and humans are inconsistent.

LLM-as-judge is the emerging pattern: use an LLM to evaluate whether the agent accomplished the goal. This scales, but has biases. The judge tends to favor verbose responses and can miss subtle errors. Combine it with objective metrics: task completion rate, cost per task, latency, human override rate, and average steps-to-completion. In production, track cost-per-successful-task as your north-star metric.

Evaluate trajectories, not just final outputs. An agent might reach the right answer through a terrible path with 47 tool calls when 3 would do. The path matters for cost and reliability.

The critical insight: evaluation isn't something you do once. It's continuous. Every prompt change, every tool change, every model upgrade requires re-evaluation. Agents are systems with many interacting parts. Change one, and you might break another.

10. Design for Security

Agents that can act in the real world amplify risks. Prompt injection, unauthorized tool use, data leakage, "vibe hacking" (manipulating the agent via clever inputs)... these aren't theoretical. They happen.

Treat every agent as potentially adversarial:

Least-privilege everything. Tools should have minimal permissions. Scoped tokens, not admin keys.
Validate outputs. Sanitize before any external action. Never let raw LLM output hit a database or API without checking.
Gate high-risk actions. A policy layer or separate approval agent should review destructive operations.
Log everything. Every decision, tool call, and rationale. You'll need this for debugging and compliance.
Never handle secrets directly. Agents shouldn't see credentials. Use service accounts and secure vaults.

Robust observability (Rule 11) is your best friend here. Comprehensive logs of thoughts, actions, and rationales make post-incident analysis and compliance far easier.

Over-privileged agents are the top reason enterprise pilots fail. Security isn't an afterthought. It's core architecture.

11. Instrument for Observability

Agents are black boxes by nature. Without traces, you can't diagnose why a loop failed, which tool choice was wrong, or where hallucinations compounded.

Implement from day one:

Full trajectory logging: Every observation, thought, action, and result
Structured traces: Timestamps, token usage, error context, parent-child relationships for sub-agents
Metrics dashboards: Success rate, average steps, cost per task, latency distributions
Replay capabilities: Re-run failed traces against your eval set

This turns "it sometimes works" into actionable insights. Change a prompt? Re-run your test trajectories. Many frameworks support this out of the box, but you must enable and use it.

12. Optimize for Cost and Latency

Agents are inherently slower and more expensive than scripts or single LLM calls. They multiply inferences. In production, these factors often determine viability.

We benchmarked agent performance across models and tasks. The results were stark: a "smarter" model that costs 10x more per token often isn't 10x better at the task. Sometimes it's worse because it overthinks. I watched one run where token usage grew to double what we expected while simultaneously producing worse results. The model was second-guessing itself into failure.

Best practices:

Use cheaper models for simple steps. Routing, summarization, and validation don't need frontier models.
Reserve expensive models for complex reasoning. Know which steps actually benefit from capability.
Implement early stopping. If the agent is looping, cut it off.
Cache aggressively. Common sub-tasks, tool results, embeddings.
Parallelize where possible. Independent tool calls should run concurrently.
Set budgets. Token limits per task. Kill runaway agents.

Track efficiency metrics alongside accuracy. A slightly less "smart" but 5x cheaper agent often wins in real deployments.

When to Break the Rules

Agents are not always the answer. They're slow (multiple LLM calls). They're expensive (tokens add up fast). They're unpredictable (hallucination, wrong tool choices).

Use an agent when:

The task requires multiple steps that depend on each other
The path to the goal isn't known in advance
Human-like reasoning adds value
Exploration and adaptation matter more than speed

Don't use an agent when:

A deterministic script would work
The task is a single question-answer
Latency is critical (sub-second responses)
Cost must stay low (agents compound expense quickly)
The cost of failure is high and you can't verify correctness
You need guaranteed reproducibility
You lack foundational observability, governance, and security controls (build those first)

A well-designed API call beats an agent for predictable tasks. A simple chain (LLM > tool > LLM) beats a full agent when the path is mostly known. An agent beats both when you genuinely don't know what you need until you start exploring.

These twelve rules aren't exciting. They're not the cool demos you see on Twitter. But they're what separates agents that work from agents that almost work.

Master the loop. Respect context. Secure your tools. Plan for hallucination. Treat prompts as code. Curate memory. Evaluate continuously. Instrument everything. Watch your costs.

Start simple, instrument everything, and iterate ruthlessly. The agents that deliver real value are the ones built with engineering discipline, not just clever prompts.

Building agents? We're working with teams bringing AI agents into their infrastructure. Let's talk about how to bring them real-time, structured, and reliable context.

We Gave LLMs 150 Tools: Here's What Broke.

Craig Tracey — Thu, 26 Mar 2026 16:43:58 +0000

There's a hypothesis that most people building AI agents have encountered but few have measured: the more tools you give an LLM, the worse it gets at picking the right one.

It's intuitive. Connect a few MCP servers to your agent, and suddenly it's choosing from 60, 80, 100+ tools. GitHub tools, GitLab tools, Kubernetes, Slack, Jira, PagerDuty, Terraform, Grafana, all loaded into the context window, all the time. The model has to read every tool definition, understand the distinctions between them, and pick the right one. That's a lot of signal to sift through.

But intuition isn't data. So we built Boundary, an open-source framework for finding where LLM context breaks, and ran the numbers.

The setup

We assembled 150 tool definitions based on real schemas from production agent systems across 16 services: GitHub, GitLab, Jira, Confluence, Kubernetes, AWS, Datadog, Slack, PagerDuty, Okta, Snyk, Grafana, Terraform Cloud, Docker, Linear, and Notion. The tools are synthetic (no-op for benchmarking) but the schemas, parameter structures, and descriptions mirror what you'd find in a production MCP environment.

We tested six models across three providers:

Claude Sonnet 4.6 and Claude Haiku 4.5 (Anthropic)
GPT-4o and GPT-5.4 Mini (OpenAI)
Grok 4 and Grok 4.1 Fast Reasoning (xAI)

Each model received 60 prompts (both direct requests and ambiguous ones) at five toolset sizes: 25, 50, 75, 100, and 150 tools. At each size, the available tools were randomly selected but always included the correct one. The question: does the model pick the right tool?

The results

Every model that completed the test degraded. Two didn't finish at all.

Model	25 tools	50 tools	75 tools	100 tools	150 tools
Grok 4.1 Fast	86.7%	83.3%	80.0%	83.3%	76.7%
GPT-5.4 Mini	85.0%	85.0%	80.0%	83.3%	failed
GPT-4o	81.7%	78.3%	73.3%	76.7%	failed
Claude Haiku 4.5	81.7%	80.0%	78.3%	80.0%	76.7%
Grok 4	80.0%	78.3%	80.0%	71.7%	80.0%
Claude Sonnet 4.6	78.3%	73.3%	73.3%	76.7%	75.0%

GPT-5.4 Mini was the most surprising result. At 85% accuracy through 50 tools, 92% on ambiguous prompts, sub-1-second latency, and $0.002 per call, it was arguably the best overall performer for small-to-medium toolsets. Then it hit the same 128-tool wall as GPT-4o and failed completely at 150.

Grok 4.1 Fast Reasoning was the only model that combined top-tier accuracy with the ability to handle 150 tools. It degraded steadily from 86.7% to 76.7%, but it never broke.

Both OpenAI models failed at 150 tools. OpenAI's API has a hard limit of 128 tools per request. This isn't a degradation curve. It's a wall. If your agent connects enough MCP servers to exceed 128 tools, no OpenAI model works.

Claude Sonnet 4.6, the most expensive model in the test ($0.028/call), was the least accurate at 25 tools and never recovered. Claude Haiku outperformed it at every size while costing 3x less.

Cross-service confusion scales with tools

Cross-service confusion, where a model picks a tool from the wrong service entirely, was the most dangerous failure mode.

Model	25 tools	50 tools	75 tools	100 tools	150 tools
Claude Haiku 4.5	0	0	1	2	4
Grok 4.1 Fast	0	0	0	2	3
Claude Sonnet 4.6	0	1	2	3	2
Grok 4	2	0	2	4	1
GPT-4o	0	0	1	2	n/a
GPT-5.4 Mini	0	0	2	1	n/a

Grok 4 had cross-service errors even at 25 tools. Claude Haiku was clean until 75 tools but escalated to 4 errors at 150, the worst of any model at that size.

The most common cross-service confusions across all models:

Datadog vs Grafana: "Check the monitoring alerts" consistently routed to the wrong observability platform
Notion vs Confluence: "Search for documentation" split between the two
Linear vs Jira: "Add a comment to the tracking issue" picked the wrong project tracker
GitHub vs GitLab: "Show me the open issues" confused the two at higher tool counts

Direct vs. ambiguous prompts

A "direct" prompt names the service: "List all Terraform Cloud workspaces." An "ambiguous" prompt doesn't: "Add a comment saying 'Resolved' to the tracking issue."

Model	25t (ambig)	50t (ambig)	75t (ambig)	100t (ambig)	150t (ambig)
GPT-5.4 Mini	92%	92%	67%	92%	n/a
Grok 4.1 Fast	83%	83%	83%	75%	67%
Claude Sonnet 4.6	83%	75%	75%	83%	75%
GPT-4o	83%	83%	67%	58%	n/a
Claude Haiku 4.5	75%	75%	83%	67%	67%
Grok 4	67%	75%	67%	50%	67%

GPT-5.4 Mini dominated ambiguous prompts at 92% through 100 tools. It handled disambiguation better than any other model by a wide margin. GPT-4o collapsed to 58% at the same size. Grok 4 hit 50%, a coin flip.

Claude Sonnet was the most stable, staying between 75% and 83% regardless of toolset size. Consistent, but never great.

Where models get confused

The errors tell a story. Some patterns appeared across all six models:

Terraform is hard. All models consistently confused terraform_create_run with terraform_list_workspaces, and terraform_lock_workspace with terraform_get_workspace. The tool names are semantically close, and the models default to "list" or "get" operations when the toolset is crowded.

Snyk is a trap. snyk_get_remediation, snyk_list_container_projects, and snyk_list_projects all got misrouted to snyk_list_organizations. When Snyk tools are buried among 100+ others, the models default to the most generic-sounding option.

Confluence updates fail. All models picked confluence_search when asked to update a page. The prompt said "Update the runbook page", but with 75+ tools in context, the model reached for search instead of the update operation.

Monitoring platform confusion. Datadog and Grafana both have alerting, dashboards, and metrics tools. The prompt "Check the monitoring alerts for the API server" got routed to Grafana instead of Datadog by every model at some toolset size. Adding two similar services to the toolset creates permanent ambiguity.

The latency story

Accuracy isn't the only cost.

Model	25 tools	50 tools	75 tools	100 tools	150 tools
GPT-5.4 Mini	739ms	754ms	849ms	976ms	n/a
GPT-4o	1,170ms	4,035ms	6,213ms	7,657ms	n/a
Claude Haiku 4.5	2,463ms	6,157ms	8,765ms	11,473ms	16,749ms
Claude Sonnet 4.6	4,728ms	10,308ms	14,579ms	19,120ms	27,935ms
Grok 4.1 Fast	6,448ms	7,042ms	6,930ms	7,349ms	7,533ms
Grok 4	7,706ms	7,945ms	8,133ms	8,418ms	9,552ms

GPT-5.4 Mini was the latency champion: sub-1-second at every toolset size it completed. The Anthropic models scaled linearly, with Sonnet reaching 28 seconds at 150 tools. The xAI models barely changed, staying in the 6-10 second range regardless of tool count.

What this means

The pattern is consistent across six models from three providers: more tools means worse accuracy, and the degradation starts between 25 and 50 tools.

The implications for anyone building agents with MCP:

Don't load everything. If your agent has access to 10+ services, that's easily 80-150 tools. Loading them all upfront is a measurable tax on accuracy, starting at 25 tools.
OpenAI has a hard wall at 128 tools. Both GPT-4o and GPT-5.4 Mini failed at 150. This isn't a model quality issue. It's a platform constraint. If your agent might exceed 128 tools, OpenAI models are not an option.
Ambiguous prompts are the danger zone. Grok 4 hit 50% accuracy on ambiguous prompts at 100 tools. GPT-4o dropped to 58%. When users don't name the service explicitly, the model has to disambiguate, and more tools makes that exponentially harder.
Similar services compound the problem. Datadog and Grafana. Notion and Confluence. Linear and Jira. GitHub and GitLab. Every pair of similar services in the toolset creates a permanent source of confusion that scales with tool count.
Latency compounds. Even if accuracy were flat, the latency cost matters. Claude Sonnet at 28 seconds per call is unusable for interactive workloads. GPT-5.4 Mini at sub-1-second is a different product entirely.
Price does not predict performance. Claude Sonnet 4.6 costs 28x more per call than Grok 4.1 Fast and is less accurate. Claude Haiku outperforms Claude Sonnet at 3x lower cost. The most expensive model lost.

The cost equation

What you pay per call versus what you get in accuracy.

Model	Total cost	Calls	Cost/call	Best accuracy	Worst accuracy
Grok 4.1 Fast	$0.31	300	$0.0010	86.7% (25t)	76.7% (150t)
GPT-5.4 Mini	$0.50	240*	$0.0021	85.0% (25t)	failed (150t)
GPT-4o	$1.57	240*	$0.0065	81.7% (25t)	failed (150t)
Claude Haiku 4.5	$2.83	300	$0.0094	81.7% (25t)	76.7% (150t)
Grok 4	$3.85	300	$0.013	80.0% (25t)	71.7% (100t)
Claude Sonnet 4.6	$8.51	300	$0.028	78.3% (25t)	73.3% (50t)

*OpenAI models completed 240 of 300 calls. All calls at 150 tools failed due to the 128-tool API limit.

The two cheapest models (Grok 4.1 Fast at $0.001/call and GPT-5.4 Mini at $0.002/call) were also the two most accurate. The most expensive model (Claude Sonnet at $0.028/call) was the least accurate. The correlation between price and tool-calling performance is not just weak. It's inverted.

This is exactly the kind of tradeoff Boundary is designed to surface. Without benchmark data, you'd likely pick Claude Sonnet or GPT-4o. The data says they're among the worst choices for tool-calling workloads. A team running fewer than 128 tools should seriously consider GPT-5.4 Mini for its combination of accuracy, speed, and cost. A team that might exceed 128 needs Grok 4.1 Fast or an Anthropic model.

Running these benchmarks costs almost nothing. This entire run across six models cost $17. That's less than a single hour of engineer time debugging a misrouted tool call in production.

How this shaped our architecture

This data isn't theoretical for us. It directly informed how we built progressive disclosure in SixDegree.

The core insight: if accuracy degrades between 25 and 50 tools, then the goal isn't to find a smarter model. It's to never present more than 25 tools in the first place. Not by hardcoding a curated list, but by letting the agent's context determine which tools are relevant at each step.

In SixDegree, when an agent queries the ontology and discovers a GitHub repository, only the GitHub tools become available. When a Kubernetes deployment surfaces through a relationship, the Kubernetes tools appear. The agent never sees all 150 tools at once because it never needs to. The toolset at any given turn is scoped to the entities the agent has actually encountered.

The Boundary data validates this approach quantitatively. At 25 tools (roughly the size of two or three services' worth of tools), accuracy is in the mid-to-high 80s. That's the operating range progressive disclosure keeps you in, regardless of how many total services are connected. You can have 16 integrations and 150 tools installed, and the agent still only sees the 10-20 that matter for the current conversation.

The alternative, loading everything and hoping the model figures it out, costs you 5-10 percentage points of accuracy, up to 28x the latency, and for OpenAI models, a hard failure at 128 tools. Progressive disclosure isn't a nice-to-have. It's a requirement for agents that work at scale.

Limitations and what we'd like to improve

This benchmark is a starting point, not a definitive answer. There are real limitations to what it measures and how:

Single-turn only. Each prompt gets one shot at picking a tool. Real agents chain tool calls, use results from previous calls to inform the next one, and recover from mistakes. A model that picks the wrong tool on the first try might self-correct on a second turn. This benchmark doesn't capture that.

Random tool subsets. At each toolset size, the available tools are randomly selected (with the correct one always included). In production, the tools in context aren't random. They're usually grouped by service or use case. Random selection may overstate or understate confusion depending on which tools end up adjacent.

No parameter validation. We check whether the model picked the right tool, but not whether it filled in the parameters correctly. A model that picks github_create_issue but hallucinates the owner field is still counted as correct. Parameter accuracy is a whole separate dimension.

Prompt quality varies. Some of the ambiguous prompts have arguably debatable expected answers. "Check the monitoring alerts" could reasonably map to either Datadog or Grafana depending on the organization. We picked one, but reasonable people would disagree.

Single trial. Each prompt runs once per toolset size. With 60 prompts per size, the results are directional but individual percentage points could shift with more trials.

We'd like to add multi-turn evaluation, parameter accuracy checking, configurable prompt difficulty levels, and more models. If you have ideas for how to make this benchmark better, if you disagree with our methodology, or if you've run Boundary against a model we haven't tested yet, open an issue or submit a PR. This is an open source project and we want the community to help shape it.

What's next

The full interactive results from this run are available on our site. The framework is open source. Run it yourself and see how your preferred models handle tool overload.

Boundary is an open-source framework for finding where LLM context breaks. See how SixDegree solves tool overload.

First Principles of AI Context

Craig Tracey — Sun, 15 Mar 2026 01:49:01 +0000

Every few weeks someone publishes a benchmark showing that the latest model is smarter, faster, more capable. Context windows are getting massive. A million tokens, two million, more on the horizon. And that’s genuinely impressive.

But it raises a question nobody seems to be asking: what are we filling those windows with?

Right now, the answer is mostly everything. Dump in the docs. Stuff in the chat history. Append the tool definitions. Hope the model figures out what matters.

Bigger windows don’t solve the context problem. They just give you more room to be wrong. A million tokens of unfocused, unstructured context isn’t better than ten thousand tokens of the right context. It’s worse, because the model has to work harder to find the signal in the noise, and you’re paying for every token of that noise.

I’ve spent the last year building agent infrastructure, and I keep landing on the same conclusion: the bottleneck isn’t the model and it isn’t the window size. It’s the quality and structure of what goes into the window. Until we treat context as an engineering problem, not just a capacity problem, we’re going to keep building impressive demos that fall apart in production.

Here are the first principles I keep coming back to.

The Context Exists. The Relations Don't.

There’s a reason AI coding tools are so far ahead of everything else. Code has explicit structure: dependencies, type systems, call graphs. The model can follow the relationships. It can reason about how things connect.

Now think about everything else we’re trying to point AI at. Your operations. Your organization. Your business processes. There’s no relationship graph. No map connecting a customer complaint to the team responsible to the system that caused it.

Without structure, the model guesses. A bigger window just means it has more room to guess in.

The structure already exists inside your systems. Before you can get real value from AI, you need to connect it.

Semantics are probability, not truth.

This is the thing that’s easy to forget when a model gives you a confident, well-formatted answer: it doesn’t know anything. It’s predicting the most likely next token. When you ask it to interpret your data, it’s giving you the most probable interpretation, not necessarily the correct one.

That distinction doesn’t matter much when you’re generating a summary or drafting an email. It matters enormously when an agent is deciding which team to page at 3am, or which customer account is affected by an outage, or whether a support ticket is related to a known incident.

You can see this play out in real time with tool calls. An agent without enough context doesn’t just pick the wrong tool. It tries one, fails, tries another, fails again, and loops. It’s not being stupid. It’s doing exactly what you’d expect from a system that’s navigating by probability without a map. It doesn’t have the connective tissue to know that this entity means that tool, so it guesses, checks the result, and guesses again. It’s brute-forcing a path through a graph it can’t see.

Probability is useful. But decisions need ground truth. And ground truth comes from structure: explicit relationships that say this is connected to that, defined by rules, not inferred by a model.

The more we rely on agents to take real action, the less we can afford to let them operate on vibes.

Facts without relationships are a dead end.

RAG was supposed to solve the context problem. Ground the model in your data. Retrieve relevant chunks. It works for question answering.

And even that takes a surprising amount of effort. Chunking strategies, embedding model selection, reranking, relevance tuning, keeping the index fresh as your data changes. RAG pipelines are deceptively expensive to build well and even harder to maintain. That’s a lot of investment for a system that tops out at retrieval.

And when teams hit the ceiling of what vanilla RAG could do, where did they turn to improve it? You guessed it. Graphs. GraphRAG exists because people kept running into the same wall: retrieval without relationships isn’t enough.

But the moment you want an agent to do something, retrieval isn’t enough. Knowing “there was an incident last Tuesday” is a fact. Knowing that the incident affected three customers, was caused by a change made by a specific team, and is related to two open support tickets? That’s a graph. That’s the difference between an agent that can answer questions and one that can actually reason about what to do next.

We keep trying to solve a graph problem with a search engine. Vector similarity tells you what’s textually related. It can’t tell you what’s causally connected, what depends on what, or what breaks if something changes. And because similarity is probabilistic, it’ll happily surface content that looks related but isn’t, with no way to tell the difference.

Context has to discover itself.

Here’s where it gets hard. You can’t manually build and maintain a map of how everything in your world connects. But look at what we’re doing today to try.

We write longer prompts. We craft system instructions. We maintain AGENTS.md and CLAUDE.md files. We build onboarding documents that try to explain our world to the model in prose. We hand-author tool descriptions and few-shot examples. We create elaborate prompt chains that try to steer the model toward the right context at the right time.

All of these are manual. All of them go stale. And all of them are fundamentally trying to solve the same problem: teaching the model what it should already be able to see.

And here’s the kicker. What are we writing all of this context in? Natural language. Prose. The very thing we just established is interpreted probabilistically, not precisely. We’re using semantics to provide context to a system that processes semantics as probability. We’re bootstrapping truth from a medium that doesn’t guarantee it.

It works at small scale. When you have five tools and one domain, you can write enough context by hand to get by. But it breaks the moment your environment grows. More tools, more systems, more relationships, more change. The rate of change is faster than any human process can keep up with.

The only context that stays accurate is context that builds itself, continuously, from the systems that are already running. The relationships already exist inside your tools and platforms. They’re just not structured in a way that AI can use.

The job isn’t data entry. The job is discovery.

Structure needs rules, not just data.

This one took me a while to internalize. You can ingest every piece of data from every system you touch and still have nothing useful. Data without interpretation is noise, and a model will happily interpret that noise for you. Confidently, probabilistically, and sometimes wrong.

Structure emerges from rules. A project is owned by a team. A customer is served by a product. An alert relates to an incident. These aren’t things you discover statistically. They’re things you define. And once defined, they make relationships queryable, composable, and trustworthy. Not probable. True.

Without rules, you have data. With rules, you have structure an agent can trust.

Agents need context before tools.

MCP gave agents a standard way to call tools. That was a genuine breakthrough. But tools without context are blind.

Think about how an agent actually decides which tool to call. It reads the tool’s name and description and picks the one that seems most relevant. Semantics again. The entire tool selection process is probabilistic. The agent isn’t matching against a schema or following a rule. It’s making its best guess.

Give an agent access to hundreds of tools and watch what happens. It picks the wrong ones. It hallucinates capabilities. It takes action without understanding what it’s acting on. And every one of those irrelevant tool definitions is eating up your context window, crowding out the information the agent actually needs. Each failed tool call burns tokens, adds latency, and pushes useful context further out of reach.

The fix isn’t better prompting. The fix is context first, tools second. The agent needs to understand what’s relevant to the current task before it gets access to the tools that apply.

This is the order of operations that most agent architectures get backwards.

Why this matters now

We’re about to get 10 million token context windows. The temptation will be to treat that as a solution. Just throw everything in and let the model sort it out.

That won’t work. It’ll just be expensive, slow, and probabilistically wrong in ways that are hard to debug. The context problem isn’t about capacity. It’s about knowing what matters, how things connect, and what’s relevant right now. With certainty, not just likelihood.

MCP is taking off. Agent frameworks are proliferating. Everyone is building tool integrations. But almost nobody is building the context layer underneath: the thing that decides what goes into the window and why.

That’s the gap. And it’s the gap that will determine whether AI agents become genuinely useful or remain expensive toys that work great in demos.

I started this newsletter because I think the people building in this space need a place to think through these problems together. Not hype. Not product announcements. Just the hard, specific questions that come with making AI systems work for real.

This is the problem I'm building toward solving with sixdegree.ai. More on that soon - and more on the specific patterns that actually work in production.