DEV Community

Kiran Manne
Kiran Manne

Posted on • Originally published at claudecertifiedarchitect.dev

Designing tools so an LLM actually calls them correctly: 5 patterns from the CCA-F blueprint

The first time I shipped a tool-using agent, it kept calling the wrong tool. Not occasionally — often enough that I couldn't ship it. The tools worked. The schemas were valid. The descriptions were technically accurate. The model still picked the wrong one, over and over.

What I learned, painfully, is that designing a tool for an LLM caller is closer to writing technical documentation than writing a function signature. The model is reading your tool definition as part of its decision about whether and how to invoke it. If the description is ambiguous, the model is going to guess — and you don't want that.

This post walks through five patterns I now use by default, all of which map directly to material in the Tool Design & MCP domain of the Claude Certified Architect Foundations exam (which is, not coincidentally, the domain where a lot of strong backend engineers lose points).

Pattern 1: Treat the tool description as a prompt, not a docstring

The most common anti-pattern: writing the tool description the way you'd write a Python docstring.

{
  "name": "get_user",
  "description": "Returns the user object for a given user ID.",
  "input_schema": {
    "type": "object",
    "properties": {
      "user_id": { "type": "string" }
    },
    "required": ["user_id"]
  }
}
Enter fullscreen mode Exit fullscreen mode

That description tells the model what the tool does. It does not tell the model when to call it, when not to call it, or what "user object" actually contains. So the model has to guess.

The rewrite:

{
  "name": "get_user",
  "description": "Fetch a user's profile (name, email, plan tier, signup date) by their internal user ID. Use this when the user is asking about a specific account and you have a user ID. Do NOT use this for email-based lookups — use lookup_user_by_email instead. Returns null if the user does not exist; does not raise.",
  "input_schema": { ... }
}
Enter fullscreen mode Exit fullscreen mode

The rewrite answers four questions the model would otherwise guess at: what's in the response, when to use it, when not to use it, and what happens on the unhappy path. That's the bar.

Heuristic: if your tool description doesn't contain the word "when", it's probably not finished.

Pattern 2: Tighten the schema until ambiguity is impossible

LLMs are excellent at producing plausible-looking arguments. They are less excellent at producing correct arguments when the schema admits ambiguity.

A loose schema:

{
  "properties": {
    "status": { "type": "string" },
    "priority": { "type": "string" },
    "date": { "type": "string" }
  }
}
Enter fullscreen mode Exit fullscreen mode

The tightened version:

{
  "properties": {
    "status": {
      "type": "string",
      "enum": ["open", "in_progress", "closed"],
      "description": "Current status of the ticket."
    },
    "priority": {
      "type": "integer",
      "minimum": 1,
      "maximum": 4,
      "description": "1 = urgent, 4 = low. Default to 3 if unspecified."
    },
    "date": {
      "type": "string",
      "format": "date",
      "description": "ISO 8601 date (YYYY-MM-DD), in UTC."
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Three changes, each of which removes a class of failure:

  1. status is now an enum. The model can't invent "pending" or "new".
  2. priority has a documented scale and a documented default. The model isn't guessing whether 1 or 4 is higher priority.
  3. date has a format and a timezone convention. The model isn't deciding whether to write "tomorrow" or "2026-05-30".

Every time you let the model improvise on a string field, you're paying for that improvisation in tail latency and bug reports.

Pattern 3: Make errors teach

This is the single pattern that made the biggest difference to my own agent's reliability, and I think it's badly underrated.

When a tool fails, the model will see the result as part of the next turn's context. If your error is uninformative, the model has nothing to act on. If your error is structured and instructive, the model can recover on its own — no retry logic, no human in the loop.

Bad:

{ "error": "bad input" }
Enter fullscreen mode Exit fullscreen mode

Better:

{
  "error": {
    "code": "USER_NOT_FOUND",
    "message": "No user exists with id 'usr_9k2x'.",
    "hint": "If you only have an email address, call lookup_user_by_email instead."
  }
}
Enter fullscreen mode Exit fullscreen mode

The hint field is the one that earns its keep. It turns the tool's error path into a redirect. The model reads the hint, calls the right tool on the next turn, and the agent loop closes successfully without the user ever seeing the failure.

A good rule: every error your tool returns should answer the question "what should the model do differently next time?"

Pattern 4: One tool, one job

There is a strong temptation, especially if you've written REST APIs, to design a single tool with a mode or action parameter that switches behavior:

{
  "name": "manage_ticket",
  "description": "Create, update, close, or reopen a ticket.",
  "input_schema": {
    "properties": {
      "action": { "enum": ["create", "update", "close", "reopen"] },
      "ticket_id": { "type": "string" },
      "title": { "type": "string" },
      "body": { "type": "string" }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

This looks elegant. It is, in practice, worse than four separate tools.

Reasons:

  • The model has to reason about which fields are required for which action. ticket_id is required for update/close/reopen but not create. The schema can't enforce that without oneOf, and even with oneOf, the model's adherence rate drops.
  • The description has to cover four behaviors, which dilutes the "when to use" signal.
  • Error messages become generic. "missing required field" is less useful than "close_ticket requires ticket_id".

The rewrite is four tools: create_ticket, update_ticket, close_ticket, reopen_ticket. Each has a focused description, a tight schema, and instructive errors. The model picks correctly more often, and when it doesn't, the failure is local to one tool, not the whole multiplexer.

The exception: if you have genuinely dozens of near-identical operations (CRUD over hundreds of resource types), at some point a parameterized tool wins on context window cost. That tradeoff is real, but it shows up later than people think.

Pattern 5: Design for parallel calls, not just sequential ones

When the model emits multiple tool_use blocks in a single turn, the runtime executes them in parallel and returns all results before the next model call. This is the agent loop's most underrated feature, and it's the one most tool designs accidentally break.

A tool is parallel-safe when:

  • It is idempotent, or at least non-destructive when called once.
  • It does not depend on state mutated by another tool call in the same turn.
  • Its result is interpretable without reference to other concurrent results.

A tool is not parallel-safe when:

  • It mutates shared state (counters, queues, locks).
  • It depends on a previous call's output as input.
  • It returns results that are only meaningful in a specific ordering.

If you have tools that aren't parallel-safe, document that explicitly in the description: "This tool must be called after fetch_session and not in parallel with other write operations." The model will respect that.

The deeper point: when you design a new tool, ask yourself "what happens if the model calls this twice in the same turn?" If the answer is "undefined behavior", you have more work to do.

A short MCP note

If you're exposing tools through the Model Context Protocol (MCP), all five patterns still apply — the protocol just standardizes how the host discovers and invokes them. MCP adds two adjacent primitives worth knowing:

  • Resources are read-only, addressable content (a file, a database row, a URL). The model can reference them by URI.
  • Prompts are reusable, parameterized templates the host can offer to the user.

The practical implication: if your tool is fundamentally "give me the content of X", it might want to be a resource, not a tool. Resources are cheaper to expose, easier to cache, and don't burn an agent-loop turn to read.

Closing

Most "the model called the wrong tool" bugs are not model bugs. They're tool design bugs surfaced by the model's willingness to guess when the design is ambiguous. The five patterns above — descriptions that say when, tight schemas, instructive errors, narrow tools, and parallel-safety — close the ambiguity gaps that LLMs are most likely to fall into.

If you're studying for the Claude Certified Architect Foundations exam, the Tool Design & MCP domain (18% of the blueprint) is heavy on scenario questions that test exactly these tradeoffs. The independent practice platform I run at claudecertifiedarchitect.dev has a free 15-question set — a few of them in this domain — if you want to calibrate where you currently stand. It emails you a diagnostic at the end; no payment to try it.

Not affiliated with Anthropic; just a thing I built because I wanted blueprint-weighted practice that didn't exist yet.

Top comments (0)