WonderLab

Posted on May 25 • Edited on May 27

Agent Series (4): Deep Dive into Tool Calling — The Agent's Hands and Eyes

#ai #langchain #agents #toolcall

Tools Are the Agent's Hands and Eyes

The previous three articles covered the thinking frameworks — how ReAct reasons while acting, and how Plan-and-Solve plans before executing. But even the best reasoning framework is nearly useless if the Agent can only talk to itself.

Tools are what let Agents break out of the language model's boundaries. With tools, an Agent can:

Query real-time data (stock prices, weather, news)
Operate on file systems
Call external APIs
Execute code and calculations

But the quality of your tool design directly determines the Agent's reliability. A poorly designed tool leads to Agents spinning in error loops — or worse, creating security vulnerabilities.

This article tears down the full picture of tool calling: design, validation, security, parallel calls, and error handling — five dimensions, grounded in real execution results.

Good Tool vs. Bad Tool: Same Task, Completely Different Agent Behavior

Let's start with a comparison experiment to make the conclusion concrete.

Same "stock price query" functionality, two implementations:

Bad tool (three classic problems):

@tool
def bad_stock_tool(x: str) -> str:
    """Get stock info."""   # ← Thin docs: no param description, no return format, no examples
    _MOCK_STOCKS = {"AAPL": 189.5, "GOOGL": 175.2, "MSFT": 420.3}
    price = _MOCK_STOCKS[x]      # ← KeyError crashes without handling
    return f"{price}"            # ← Just a number, no currency, no context

Good tool (three corresponding improvements):

@tool
def get_stock_price(symbol: str) -> str:
    """Query the current price and daily change of a stock.

    Parameters:
      symbol: Stock ticker symbol, uppercase letters, e.g. "AAPL", "GOOGL", "MSFT"

    Returns:
      A string with the company name, current price (USD), and daily change percent.
      Returns an error description if the symbol is not found.

    Examples:
      get_stock_price("AAPL")    → "Apple Inc. (AAPL): $189.50 USD, today +1.23%"
      get_stock_price("UNKNOWN") → "Symbol UNKNOWN not found. Supported: ..."
    """
    symbol = symbol.strip().upper()
    if not re.match(r"^[A-Z]{1,5}$", symbol):
        return f"Invalid ticker format: {symbol!r}. Tickers must be 1-5 uppercase letters."

    info = _MOCK_STOCKS.get(symbol)
    if info is None:
        supported = ", ".join(_MOCK_STOCKS.keys())
        return f"Symbol {symbol} not found. Supported tickers: {supported}"

    sign = "+" if info["change_pct"] >= 0 else ""
    return (
        f"{info['name']} ({symbol}): "
        f"${info['price']:.2f} {info['currency']}, "
        f"today {sign}{info['change_pct']:.2f}%"
    )

Testing both with the same query: "Please check the price of AAPL and a nonexistent stock XYZ999"

Bad tool execution trace:

[Tool Call]    bad_stock_tool(x='AAPL')
[Tool Return]  189.5

[Tool Call]    bad_stock_tool(x='XYZ999')
[Tool Return]  Error: KeyError('XYZ999')
               Please fix your mistakes.

[Final Answer]
  The price of AAPL is 189.5 USD, but XYZ999 does not exist and cannot be queried.

Good tool execution trace:

[Tool Call]    get_stock_price(symbol='AAPL')
[Tool Return]  Apple Inc. (AAPL): $189.50 USD, today +1.23%

[Tool Call]    get_stock_price(symbol='XYZ999')
[Tool Return]  Invalid ticker format: 'XYZ999'. Tickers must be 1-5 uppercase letters.

[Final Answer]
  AAPL is currently trading at $189.50, up 1.23% today.
  XYZ999 is not a valid ticker — please verify the symbol.

Two observations worth noting:

Observation 1: The bad tool's KeyError didn't crash the Agent — LangGraph catches the exception and wraps it as Error: KeyError('XYZ999') Please fix your mistakes.. The Agent still produced a final answer. So "tool crash = Agent crash" is a myth. The framework has fault tolerance built in.

Observation 2: But the output quality gap is obvious. The good tool's error message explains why the input was invalid (format problem), letting the Agent give a more helpful response. The bad tool's error is just an exception name, so the Agent can only say vaguely "it doesn't exist."

Conclusion: A tool that doesn't crash isn't automatically a well-designed tool. The quality of error messages directly determines the quality of the Agent's answer to the user.

The Three Pillars of Tool Design

The comparison above maps to three core dimensions of tool design:

Pillar 1: Interface — Documentation Is the Contract

LLMs understand tools through their docstrings. Unclear docs → LLM guesses → bugs.

A complete tool docstring should include:

@tool
def get_stock_price(symbol: str) -> str:
    """[What it does] Query the current price and daily change of a stock.

    Parameters:
      symbol: [meaning + format constraint] Uppercase ticker, e.g. "AAPL", "GOOGL"

    Returns:
      [success case] A string with company name, price (USD), and daily change.
      [error case] An error description if the symbol is not found.

    Examples:
      [success] get_stock_price("AAPL") → "Apple Inc. (AAPL): $189.50 USD"
      [failure] get_stock_price("UNKNOWN") → "Symbol UNKNOWN not found"
    """

Key point: Examples must include both success and failure cases — the LLM needs to know what failure looks like in order to handle error branches correctly.

Pillar 2: Validation — Pydantic Is Your Gatekeeper

For single-parameter tools, in-function validation works fine. For multi-parameter tools with complex constraints, Pydantic's BaseModel is the right choice:

class CurrencyConvertInput(BaseModel):
    amount: float = Field(..., gt=0, le=1_000_000_000)
    from_currency: str = Field(...)
    to_currency: str = Field(...)

    @field_validator("from_currency", "to_currency")
    @classmethod
    def validate_currency(cls, v: str) -> str:
        code = v.strip().upper()
        if code not in _EXCHANGE_RATES:
            raise ValueError(
                f"Unsupported currency code: {code!r}. "
                f"Supported: {SUPPORTED_CURRENCIES}"
            )
        return code

@tool(args_schema=CurrencyConvertInput)
def convert_currency(amount: float, from_currency: str, to_currency: str) -> str:
    ...

Three test cases showing Pydantic's interception:

# Normal request
[Tool Call]    convert_currency(amount=1000, from_currency='USD', to_currency='CNY')
[Tool Return]  1,000.00 USD = 7,250.00 CNY (rate: 1 USD ≈ 7.2500 CNY)

# Negative amount
[Tool Call]    convert_currency(amount=-500, from_currency='USD', to_currency='CNY')
[Tool Return]  Error: 1 validation error for CurrencyConvertInput
               amount
                 Input should be greater than 0 [type=greater_than, input_value=-500]

[Final Answer] The amount must be positive. Please enter a positive number.

# Unsupported currency
[Tool Call]    convert_currency(amount=100, from_currency='USD', to_currency='BTC')
[Tool Return]  Error: Unsupported currency code: 'BTC'.
               Supported: ['USD', 'CNY', 'EUR', 'JPY', 'GBP', 'HKD']

[Final Answer] Converting to Bitcoin is not supported.
               Supported currencies: USD, CNY, EUR, JPY, GBP, HKD.

Pydantic's advantages:

Human-readable errors: Input should be greater than 0 is far clearer than ValueError: invalid amount
Automatic type coercion: If the LLM passes string "1000", Pydantic converts it to float
Separation of concerns: Validation rules live separate from business logic

Pillar 3: Security — Trust Nothing from Input

This is the most commonly overlooked dimension. Tools bridge the Agent and external systems — without proper boundary checks, an Agent can be manipulated into doing dangerous things.

Three Security Threats and How to Defend Against Them

Threat 1: Path Traversal

A user (or malicious prompt) asks the Agent to read ../../../etc/passwd:

@tool
def read_file(filename: str) -> str:
    # Security check 1: reject path traversal characters
    if any(char in filename for char in ["../", "..", "/", "\\"]):
        return f"Security denied: path characters not allowed in filename ({filename!r})"

    # Security check 2: whitelist — only letters, digits, dots, underscores, hyphens
    if not re.match(r"^[\w.\-]+$", filename):
        return f"Security denied: invalid filename format ({filename!r})"

    target = _SANDBOX_DIR / filename

    # Security check 3: physical path must be inside the sandbox (blocks symlink attacks)
    try:
        target.resolve().relative_to(_SANDBOX_DIR.resolve())
    except ValueError:
        return "Security denied: path resolves outside sandbox"
    ...

The three layers are layered by intent:

Layer 1: Fast string rejection (catches the most common attacks)
Layer 2: Whitelist format check (catches attempts to bypass layer 1)
Layer 3: Path.resolve() physical path verification (blocks symlink-based bypasses)

Real test results:

# Normal read
[Tool Call]    read_file("report.txt")
[Tool Return]  Q1 Sales Report: Total revenue 12M, up 15% YoY.

# Path traversal attempt
[Tool Call]    read_file("../../../etc/passwd")
[Tool Return]  Security denied: path characters not allowed in filename ('../../../etc/passwd')

Threat 2: SQL/Command Injection

An attacker crafts input designed to break query logic:

@tool
def lookup_user(user_id: str) -> str:
    # Strict whitelist: pure digits only, nothing else
    if not re.match(r"^\d{1,10}$", user_id):
        return (
            f"Security denied: user_id must be 1-10 digits, "
            f"received: {user_id!r}"
        )
    # Never concatenate user input into SQL strings
    user = _MOCK_USERS.get(user_id)  # dict lookup simulates parameterized query
    ...

Real test results:

# Normal query
Input: "10001"
[Tool Return]  User 10001: Zhang San, role: admin, dept: Engineering

# SQL injection attempt
Input: "1 OR 1=1; DROP TABLE users--"
[Tool Return]  Security denied: user_id must be 1-10 digits,
               received: '1 OR 1=1; DROP TABLE users--'

[Agent Reply]  The user ID format is invalid. Please use a 1-10 digit numeric ID.

Core principle: Validate at the tool layer, don't trust the Agent to "rationally" pass only valid inputs. Prompt injection can hijack an Agent into passing malicious parameters.

Threat 3: Rate Limit Abuse

A simple token bucket prevents tools from being called too frequently:

class _RateLimiter:
    def __init__(self, max_calls: int, window_seconds: int = 60):
        self._max = max_calls
        self._window = window_seconds
        self._calls: list[float] = []

    def allow(self) -> bool:
        now = time.time()
        self._calls = [t for t in self._calls if now - t < self._window]
        if len(self._calls) >= self._max:
            return False
        self._calls.append(now)
        return True

_search_limiter = _RateLimiter(max_calls=10, window_seconds=60)

@tool
def rate_limited_search(query: str) -> str:
    if not _search_limiter.allow():
        wait = _search_limiter.wait_seconds()
        return f"Rate limit exceeded (max 10 calls/min). Wait ~{wait:.0f}s before retrying."
    ...

Parallel Tool Calls: Theory vs. Reality

LangGraph supports parallel tool calls — when the LLM returns multiple tool_calls in a single response, LangGraph executes them simultaneously, significantly reducing latency.

For querying weather + air quality in 3 cities, the ideal flow would be:

LLM response:
  → Parallel call [get_weather("Beijing"), get_weather("Shanghai"), get_weather("Chengdu"),
                   get_air_quality("Beijing"), get_air_quality("Shanghai"), get_air_quality("Chengdu")]

All 6 tools run simultaneously → done in 1 round

But in practice, GLM-4-Flash does not support parallel tool calls, even when explicitly told to query cities "simultaneously":

[Tool Call]  get_weather(city='Beijing')
[Tool Call]  get_weather(city='Shanghai')
[Tool Call]  get_weather(city='Chengdu')
[Tool Call]  get_air_quality(city='Beijing')
[Tool Call]  get_air_quality(city='Shanghai')
[Tool Call]  get_air_quality(city='Chengdu')

Stats: 6 tool calls total, 0 parallel batches

All 6 calls ran sequentially, each as a separate AIMessage.

This is an important real-world constraint: parallel tool calling depends on the model itself, not just the framework. OpenAI GPT-4o supports parallel calls, but not every model does. When using non-OpenAI models, always test before assuming — don't take it for granted.

How to detect whether parallel calls actually happened:

for msg in result["messages"]:
    if isinstance(msg, AIMessage) and msg.tool_calls:
        if len(msg.tool_calls) > 1:
            # Multiple tool_calls in one AIMessage → truly parallel
            parallel_batches += 1

Error Classification: Retryable vs. Non-Retryable

Not all tool errors are equal. Some errors make retrying pointless (bad format, permission denied). Others just need a moment (network timeout, service restart).

Signal retry intent through return value prefixes:

@tool
def fetch_report(report_id: str, retry_simulation: bool = False) -> str:
    if not re.match(r"^RPT-\d{4}$", report_id):
        return f"ERROR: Invalid report ID ({report_id!r}), expected RPT-XXXX format"
    #       ↑ Non-retryable: the parameter itself is wrong

    if retry_simulation:
        return "RETRY: Service temporarily unavailable (HTTP 503), please try again later"
    #       ↑ Retryable: transient failure, Agent can wait and try again
    ...

Real test results showing how prefix affects Agent behavior:

# Format error (ERROR prefix)
[Tool Return]  ERROR: Invalid report ID ('REPORT-001'), expected RPT-XXXX format
[Agent Reply]  The report ID format is invalid. Please use the RPT-XXXX format.
               (Agent explains directly, does not retry)

# Service unavailable (RETRY prefix)
[Tool Return]  RETRY: Service temporarily unavailable (HTTP 503), please try again later
[Tool Return]  RETRY: Service temporarily unavailable (HTTP 503), please try again later
               ← Agent retried once
[Agent Reply]  I'm unable to fetch the report — the service is temporarily unavailable.
               Please try again later.
               (Agent retried, then gave up and suggested waiting)

The results matched the intent: ERROR: → Agent explains and asks the user to correct; RETRY: → Agent retries, then suggests waiting.

Note: This retry behavior is GLM-4-Flash inferring from context semantics, not framework-level automatic retry. For production-grade retry reliability, implement it inside the tool or at the orchestration layer (e.g., with the tenacity library).

Tool Design Checklist

Before handing a tool to an Agent, run through this checklist:

Interface

[ ] Docstring clearly describes parameter meaning and format constraints
[ ] Examples include both success and failure cases
[ ] Return format is consistent (strings for both success and failure, no type mixing)

Validation

[ ] Single parameter: validate with regex or conditionals inside the function
[ ] Multiple params / complex constraints: use @tool(args_schema=XxxInput) + Pydantic
[ ] Edge case coverage: empty strings, oversized input, special characters

Security

[ ] File operations: character whitelist + Path.resolve() sandbox check
[ ] Database queries: strict format validation + parameterized queries (no string concatenation)
[ ] High-frequency tools: token bucket rate limiting

Error Handling

[ ] Retryable errors: RETRY: prefix + reason
[ ] Non-retryable errors: ERROR: prefix + correct format hint
[ ] Never let exceptions bubble up to the Agent (catch and return strings)

Summary

Tool calling looks like a technical detail, but it's the foundation of Agent reliability. Key takeaways:

Tool crash ≠ Agent crash: The framework catches exceptions, but the quality of the error message determines the quality of the Agent's final answer
Documentation is the contract: The LLM understands tools through docstrings — write them well
Never trust inputs: Validate at the tool layer, not in the Agent's "rational" reasoning — prompt injection can hijack the Agent
Parallel calls depend on the model: LangGraph supports parallel execution at the framework level, but the model must support it too — verify before assuming
Classify errors: RETRY: vs ERROR: lets the Agent make the right next decision

Next: Intent Recognition and Routing — when an Agent faces different types of user requests, how to identify intent and dispatch tasks to the right specialized tool or sub-Agent.

References

LangChain Tool Documentation
Pydantic Field Validators
LangGraph Parallel Tool Calls
OWASP LLM Top 10
Full demo code for this series: agent-03-tool-calling

Find more useful knowledge and interesting products on my Homepage

DEV Community