Tool Documentation Is Written for the LLM, Not for Humans
Have you ever written a tool like this?
@lc_tool
def get_data(query: str) -> str:
"""Get data."""
...
That's bad documentation for a human. For an LLM it's worse — it doesn't know what this tool does, when to call it, or what to pass as the parameter.
Tool design has three core dimensions: description quality (whether the LLM selects you), error handling (whether the agent crashes on failure), and granularity (whether parameters are easy to extract). This article uses experimental data.
Demo 1: Description Quality — When It Actually Matters
Two versions of the same weather tool:
# Version A: vague
@lc_tool
def weather_vague(city: str) -> str:
"""Get data."""
...
# Version B: precise
@lc_tool
def weather_precise(city: str) -> str:
"""Get current weather for a city.
Returns temperature (Celsius) and condition (sunny / cloudy / rainy / unknown).
Use this whenever the user asks about weather, temperature, or sky conditions
for a specific city. Pass the city name as a plain string, e.g. 'Beijing'.
"""
...
Five weather queries tested against both agents:
Query Vague Precise
------------------------------------------------ ---------- ----------
What's the weather in Beijing today? ✓ called ✓ called
Is it raining in Shanghai right now? ✓ called ✓ called
What temperature should I expect in Shenzhen? ✓ called ✓ called
Should I bring an umbrella to Beijing? ✓ called ✓ called
How's the sky in Shanghai? ✓ called ✓ called
Tool call rate — Vague: 5/5 Precise: 5/5
Both scored 5/5.
This is a counter-intuitive result with an important prerequisite: when an Agent has only one tool, the LLM has no choice but to use it regardless of the description. Description quality matters when the LLM must choose among multiple tools — and that's the norm in production.
An Agent with 10 tools receives "check the weather in Beijing." The LLM reads all 10 docstrings to find the best match. A well-documented tool wins that competition; a vague one gets overlooked in favor of tools with clearer purpose statements.
The golden docstring format:
"""<One sentence describing what it does>
Returns: <format and meaning of the return value>
Use when: <what type of user question should trigger this tool>
Parameters: <param name + format example>
"""
Demo 2: Error Handling — Raise or Return?
Two tools with identical logic but different failure behavior:
# Raises on unknown city ← dangerous
@lc_tool
def weather_raises(city: str) -> str:
"""Get current weather for a city."""
if city.lower() not in MOCK_WEATHER:
raise ValueError(f"City '{city}' not found in database.")
...
# Returns a helpful error string ← safe
@lc_tool
def weather_returns_error(city: str) -> str:
"""Get current weather for a city. Returns error message if city not found."""
data = MOCK_WEATHER.get(city.lower())
if data is None:
return (f"City '{city}' not found. "
f"Available cities: {list(MOCK_WEATHER.keys())}. "
f"Please ask the user to confirm the city name.")
...
Three test cases:
Known city (Beijing): Both work identically — no observable difference.
Unknown city (Atlantis):
raises : [CRASHED] ValueError: City 'Atlantis' not found in database.
returns: I'm sorry, but I couldn't find the weather information for Atlantis.
Please make sure the city name is correct...
weather_raises crashes the entire agent run; weather_returns_error lets the LLM read the error string and compose a friendly response.
Typo city (Shanghia):
raises : The current weather in Shanghai is cloudy with a temperature of 22°C.
returns: The current weather in Shanghai is 22°C with a cloudy condition.
Both answered correctly — because the LLM corrected "Shanghia" to "Shanghai" before calling the tool. The tool received the right city name and never reached the error path.
This demonstrates the LLM's self-healing input capability, but you can't rely on it.
Rule: tools should only return, never raise. Exceptions escape the Agent's control flow — the LLM has no opportunity to handle them. Error strings can be read, understood, and acted on: the LLM can retry with a corrected parameter, tell the user what's wrong, or try a different approach.
Demo 3: Granularity — Fat Tool vs Fine-grained Tools
Fat tool: handles everything, accepts free-text input.
@lc_tool
def omnibus_lookup(query: str) -> str:
"""Look up weather, product info, or evaluate math. Pass the full user question."""
q = query.lower()
for city in MOCK_WEATHER:
if city in q: return json.dumps(MOCK_WEATHER[city])
for name in MOCK_PRODUCTS:
if name in q: return json.dumps(MOCK_PRODUCTS[name])
# try math...
Fine-grained tools: three separate tools with typed parameters.
Four test cases:
Single queries (weather, product): both approaches work; no meaningful difference.
Multi-step — weather + temperature difference:
Fat tools=['omnibus_lookup', 'omnibus_lookup']
→ The temperature in Beijing is 25°C and Shanghai is 22°C. The difference is 3°C.
Fine tools=['get_weather', 'get_weather', 'calculator']
→ The difference is 3°C. (3 explicit calls)
Multi-step — product price + annual calculation:
Fat tools=['omnibus_lookup'] ← only one call!
→ The monthly price is $299. The annual cost is $3588.
Fine tools=['get_product_info', 'calculator'] ← two calls
→ The monthly price is $299. The annual cost is $3588.
This last result is the most interesting: the fat tool only called once and got the right answer. The LLM found the $299 price inside omnibus_lookup's response, then did the mental math (299×12=3588) without triggering a separate calculator call.
The fat tool isn't always worse — sometimes it accomplishes a task in fewer calls. But the execution path is opaque, untestable, and hard to maintain.
When to use fine-grained, when merging is acceptable:
Use fine-grained when:
- Different tools are triggered by different query types
- Parameters have clear semantic types (city: str, amount: float)
- You need observability (per-tool timing, input logging)
Merging is acceptable when:
- Two operations always appear together, never used separately
- The merged parameter is still structured (not free-text)
- Example: get_weather_with_unit(city: str, unit: Literal["C","F"])
Never merge when:
- The combined parameter degrades to free text (query: str)
- Tool description needs "and/or" to cover multiple domains
Five Golden Rules for Tool Design
Principle Bad Good
──────────────────────────────────────────────────────────────────────
Description "Get data." What + When + How + param example
Error handling raise ValueError(...) return "Error: ... Available: [...]"
Granularity omnibus(query: str) get_weather(city: str)
Parameter name lookup(q: str) get_weather(city: str)
Return format raw dict / None JSON string or error string
Golden rule: design tools for the LLM, not for humans.
The LLM uses three pieces of information to decide how to call a tool:
- Docstring: decides whether to select this tool
- Parameter types and names: decides what value to pass
- Return value: decides what to do next
Get these three right, and the tool will be used correctly without extra prompting.
Design Checklist
Docstring
- [ ] First sentence states what the tool does (start with a verb)
- [ ] Describe the return value format (JSON / plain text / error string)
- [ ] State when to use it ("use this when the user asks about...")
- [ ] Give a parameter example (
e.g. 'Beijing',e.g. '299 * 12')
Error Handling
- [ ] Tools only
return, neverraise - [ ] Error messages include actionable guidance ("not found. Available: [...]")
- [ ] Distinguish "data doesn't exist" from "input format wrong" — give different hints
Granularity
- [ ] Parameters are structured types (semantically clear
str,int,float), not free text - [ ] One tool does one thing — if the description needs "and" or "or", consider splitting
- [ ] Tools are mutually exclusive: different queries trigger different tools
Return Format
- [ ] Success: JSON string (easy for the LLM to parse fields)
- [ ] Failure:
"Error: <reason>. <suggested action>"format - [ ] Never return
Noneor empty string — the LLM doesn't know what to do with them
Summary
Five core takeaways:
- Description quality matters in multi-tool competition: with one tool the LLM has no choice; with many tools a well-documented tool wins the selection
- Tools return strings, never raise exceptions: raise crashes the agent; returning an error string gives the LLM a chance to recover
- The LLM has self-healing input capability, but don't rely on it: "Shanghia" was auto-corrected to "Shanghai," but this isn't a reliable defense layer
- Fat tools aren't always worse, but they're opaque: real benchmarks showed the fat tool completing a two-step task in one call — but the path is untraceable and untestable
-
Parameter type determines parameter quality:
city: str(clear semantics) beatsq: str(free text) — the clearer the parameter type, the more accurately the LLM extracts its value
Up next: Advanced Context Engineering — how to precisely control what information gets sent to the LLM: system prompt optimization, few-shot example selection, and dynamic context injection.
References
- LangChain Tools documentation
- LangGraph ReAct Agent reference
- Full demo code for this series: agent-15-tool-design
Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.
Find more useful knowledge and interesting products on my Homepage
Top comments (0)