DEV Community

Mattias chaw
Mattias chaw

Posted on

Building AI Agents That Don't Hallucinate: A Practical Guide to Function Calling in 2026

Building AI Agents That Don't Hallucinate: A Practical Guide to Function Calling in 2026

If you've built anything with LLMs in the last year, you've probably hit the same wall everyone does: the model confidently invents a function signature, hallucinates parameter values, or calls the wrong tool entirely.

Function calling was supposed to fix this. In practice, it often makes things worse ??because now your agent is confidently wrong at scale.

Let's fix that.

Why Function Calling Still Breaks in Production

Most implementations look something like this:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools_schema,
    tool_choice="auto"
)
Enter fullscreen mode Exit fullscreen mode

This works fine for demos. It falls apart in production for three reasons:

  1. Schema bloat ??You pass 15 tools, the model picks the wrong one
  2. Parameter hallucination ??The model invents values that match the type but not the intent
  3. Cascading errors ??One bad tool call leads to a chain of incorrect reasoning

The fix isn't bigger models. It's better architecture.

Pattern 1: Narrow the Tool Space

Never pass all available tools in every turn. Instead, use a two-stage router:

import json

# Stage 1: Intent classification with a cheap, fast model
intent_response = client.chat.completions.create(
    model="deepseek-chat",  # fast and cheap
    messages=[
        {"role": "system", "content": "Classify the user intent into one of: search, calculate, fetch_data, general"},
        {"role": "user", "content": user_message}
    ],
    response_format={"type": "json_object"}
)

intent = json.loads(intent_response.choices[0].message.content)["intent"]

# Stage 2: Only expose relevant tools
tool_map = {
    "search": [search_tool],
    "calculate": [calculator_tool],
    "fetch_data": [db_query_tool],
    "general": []
}

relevant_tools = tool_map.get(intent, [])
Enter fullscreen mode Exit fullscreen mode

This single pattern reduces wrong-tool errors by 60-70% in testing. You're not asking the model to choose from 15 tools ??you're asking it to use the 1-2 tools that actually matter.

Pattern 2: Structured Outputs as a Hard Constraint

Stop relying on the model to "mostly" return valid JSON. Use structured outputs enforced at the API level:

from pydantic import BaseModel, Field
from typing import Literal

class SearchQuery(BaseModel):
    query: str = Field(description="The search query, max 100 characters")
    filters: Literal["all", "recent", "popular"] = Field(default="all")
    max_results: int = Field(default=10, ge=1, le=50)

# The API now GUARANTEES this schema ??no hallucinated fields
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=messages,
    response_format=SearchQuery,
)

query = response.choices[0].message.parsed  # typed, validated, guaranteed
Enter fullscreen mode Exit fullscreen mode

The key insight: constraints reduce hallucination more than prompt engineering does. You can write a 500-word system prompt about returning valid JSON, or you can use a schema. The schema wins every time.

Pattern 3: The Validation Sandwich

Every tool call should go through three layers:

User Input ??Pre-validation ??Model ??Post-validation ??Execution
Enter fullscreen mode Exit fullscreen mode
def safe_tool_call(tool_func, params_schema):
    def wrapper(model_output):
        # Post-validation: check model output against schema
        try:
            validated = params_schema(**model_output)
        except Exception as e:
            # Don't execute ??return error to model for self-correction
            return {
                "error": f"Invalid parameters: {e}",
                "received": model_output
            }

        # Execute only if valid
        result = tool_func(**validated.model_dump())

        # Post-execution sanity check
        if not result or len(str(result)) > 10000:
            return {"error": "Tool returned unexpected output"}

        return result
    return wrapper
Enter fullscreen mode Exit fullscreen mode

This pattern lets the model self-correct. When validation fails, you return the error back to the model as a tool response. In testing, models fix their own parameter errors 80% of the time on the second attempt.

Pattern 4: Token Budgeting for Agent Loops

The #1 production failure mode for agents is infinite loops. The model calls a tool, gets a result, decides to call another tool, gets a result, and so on ??burning tokens until it hits a limit or times out.

class AgentLoop:
    def __init__(self, max_iterations=5, max_tokens_per_call=2000):
        self.max_iterations = max_iterations
        self.max_tokens_per_call = max_tokens_per_call
        self.total_tokens = 0

    def run(self, user_query):
        messages = [{"role": "user", "content": user_query}]

        for i in range(self.max_iterations):
            response = self.call_model(messages)
            self.total_tokens += response.usage.total_tokens

            if not response.choices[0].message.tool_calls:
                return response.choices[0].message.content

            # Execute tool calls and add results
            for tool_call in response.choices[0].message.tool_calls:
                result = self.execute_tool(tool_call)
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": json.dumps(result)
                })

        return "I wasn't able to complete this task within the iteration limit."
Enter fullscreen mode Exit fullscreen mode

Hard limits are not a hack ??they're a requirement. Any agent system without a maximum iteration count will eventually loop forever on some edge case.

Pattern 5: Multi-Model Orchestration

Different models have different strengths. A practical agent system uses multiple models:

Task Model Why
Intent routing Small/fast model Low latency, simple classification
Tool selection Mid-tier model Good reasoning, reasonable cost
Complex planning Frontier model Best reasoning, highest cost
Output formatting Small/fast model Structured task, deterministic
async def smart_agent(user_query):
    # Step 1: Cheap model for intent
    intent = await route_intent(user_query)  # deepseek-chat: $0.27/M tokens

    if intent.complexity == "simple":
        return await simple_agent(user_query)  # mid-tier model

    # Step 2: Frontier model for complex reasoning
    plan = await plan_with_frontier(user_query, intent)  # gpt-4o: $2.50/M tokens

    # Step 3: Cheap model for execution
    results = await execute_plan(plan)  # back to fast model
    return results
Enter fullscreen mode Exit fullscreen mode

This architecture cuts costs by 10-15x compared to running a frontier model for every step, with negligible quality loss.

Common Pitfalls to Avoid

1. Don't trust tool descriptions alone. The model reads them, but it also ignores them when it has a strong prior. Add examples to tool descriptions:

{
    "name": "search_database",
    "description": "Search the product database. Example: search_database(query='wireless mouse', category='electronics') returns matching products.",
    "parameters": {}
}
Enter fullscreen mode Exit fullscreen mode

2. Don't return raw API responses as tool results. The model has to parse them, and it will hallucinate fields that don't exist. Always transform tool output into a clean, predictable format.

3. Don't chain agents without checkpoints. If Agent A passes output to Agent B, validate the output at the boundary. A bad output from Agent A will cascade through the entire chain.

Measuring Success: The Three Metrics That Matter

  1. Tool Selection Accuracy ??Did the model call the right tool? (Measure: % of calls where the tool matches human annotation)
  2. Parameter Validity Rate ??Were the parameters valid? (Measure: % of calls that pass schema validation)
  3. Task Completion Rate ??Did the agent actually solve the problem? (Measure: % of tasks completed without human intervention)

Track these three numbers. If any one drops below 90%, you have a production problem.

Conclusion

Building reliable AI agents isn't about finding the smartest model ??it's about building guardrails that make any model more reliable. The patterns above work across GPT-4o, DeepSeek, GLM, Claude, and any model that supports function calling.

The future of AI development isn't prompt engineering. It's system design ??constraints, validation, fallbacks, and smart orchestration. The teams that understand this will build agents that work. The teams that don't will keep debugging hallucinated function calls.

Start with narrow tool spaces. Add structured outputs. Build validation layers. Set hard limits. Orchestrate multiple models. Your agents will be dramatically more reliable starting today.


What patterns have you found effective for building AI agents? Share your experience in the comments.

Top comments (0)