Building AI Agents That Don't Hallucinate: A Practical Guide to Function Calling in 2026
If you've built anything with LLMs in the last year, you've probably hit the same wall everyone does: the model confidently invents a function signature, hallucinates parameter values, or calls the wrong tool entirely.
Function calling was supposed to fix this. In practice, it often makes things worse ??because now your agent is confidently wrong at scale.
Let's fix that.
Why Function Calling Still Breaks in Production
Most implementations look something like this:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools_schema,
tool_choice="auto"
)
This works fine for demos. It falls apart in production for three reasons:
- Schema bloat ??You pass 15 tools, the model picks the wrong one
- Parameter hallucination ??The model invents values that match the type but not the intent
- Cascading errors ??One bad tool call leads to a chain of incorrect reasoning
The fix isn't bigger models. It's better architecture.
Pattern 1: Narrow the Tool Space
Never pass all available tools in every turn. Instead, use a two-stage router:
import json
# Stage 1: Intent classification with a cheap, fast model
intent_response = client.chat.completions.create(
model="deepseek-chat", # fast and cheap
messages=[
{"role": "system", "content": "Classify the user intent into one of: search, calculate, fetch_data, general"},
{"role": "user", "content": user_message}
],
response_format={"type": "json_object"}
)
intent = json.loads(intent_response.choices[0].message.content)["intent"]
# Stage 2: Only expose relevant tools
tool_map = {
"search": [search_tool],
"calculate": [calculator_tool],
"fetch_data": [db_query_tool],
"general": []
}
relevant_tools = tool_map.get(intent, [])
This single pattern reduces wrong-tool errors by 60-70% in testing. You're not asking the model to choose from 15 tools ??you're asking it to use the 1-2 tools that actually matter.
Pattern 2: Structured Outputs as a Hard Constraint
Stop relying on the model to "mostly" return valid JSON. Use structured outputs enforced at the API level:
from pydantic import BaseModel, Field
from typing import Literal
class SearchQuery(BaseModel):
query: str = Field(description="The search query, max 100 characters")
filters: Literal["all", "recent", "popular"] = Field(default="all")
max_results: int = Field(default=10, ge=1, le=50)
# The API now GUARANTEES this schema ??no hallucinated fields
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=messages,
response_format=SearchQuery,
)
query = response.choices[0].message.parsed # typed, validated, guaranteed
The key insight: constraints reduce hallucination more than prompt engineering does. You can write a 500-word system prompt about returning valid JSON, or you can use a schema. The schema wins every time.
Pattern 3: The Validation Sandwich
Every tool call should go through three layers:
User Input ??Pre-validation ??Model ??Post-validation ??Execution
def safe_tool_call(tool_func, params_schema):
def wrapper(model_output):
# Post-validation: check model output against schema
try:
validated = params_schema(**model_output)
except Exception as e:
# Don't execute ??return error to model for self-correction
return {
"error": f"Invalid parameters: {e}",
"received": model_output
}
# Execute only if valid
result = tool_func(**validated.model_dump())
# Post-execution sanity check
if not result or len(str(result)) > 10000:
return {"error": "Tool returned unexpected output"}
return result
return wrapper
This pattern lets the model self-correct. When validation fails, you return the error back to the model as a tool response. In testing, models fix their own parameter errors 80% of the time on the second attempt.
Pattern 4: Token Budgeting for Agent Loops
The #1 production failure mode for agents is infinite loops. The model calls a tool, gets a result, decides to call another tool, gets a result, and so on ??burning tokens until it hits a limit or times out.
class AgentLoop:
def __init__(self, max_iterations=5, max_tokens_per_call=2000):
self.max_iterations = max_iterations
self.max_tokens_per_call = max_tokens_per_call
self.total_tokens = 0
def run(self, user_query):
messages = [{"role": "user", "content": user_query}]
for i in range(self.max_iterations):
response = self.call_model(messages)
self.total_tokens += response.usage.total_tokens
if not response.choices[0].message.tool_calls:
return response.choices[0].message.content
# Execute tool calls and add results
for tool_call in response.choices[0].message.tool_calls:
result = self.execute_tool(tool_call)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
return "I wasn't able to complete this task within the iteration limit."
Hard limits are not a hack ??they're a requirement. Any agent system without a maximum iteration count will eventually loop forever on some edge case.
Pattern 5: Multi-Model Orchestration
Different models have different strengths. A practical agent system uses multiple models:
| Task | Model | Why |
|---|---|---|
| Intent routing | Small/fast model | Low latency, simple classification |
| Tool selection | Mid-tier model | Good reasoning, reasonable cost |
| Complex planning | Frontier model | Best reasoning, highest cost |
| Output formatting | Small/fast model | Structured task, deterministic |
async def smart_agent(user_query):
# Step 1: Cheap model for intent
intent = await route_intent(user_query) # deepseek-chat: $0.27/M tokens
if intent.complexity == "simple":
return await simple_agent(user_query) # mid-tier model
# Step 2: Frontier model for complex reasoning
plan = await plan_with_frontier(user_query, intent) # gpt-4o: $2.50/M tokens
# Step 3: Cheap model for execution
results = await execute_plan(plan) # back to fast model
return results
This architecture cuts costs by 10-15x compared to running a frontier model for every step, with negligible quality loss.
Common Pitfalls to Avoid
1. Don't trust tool descriptions alone. The model reads them, but it also ignores them when it has a strong prior. Add examples to tool descriptions:
{
"name": "search_database",
"description": "Search the product database. Example: search_database(query='wireless mouse', category='electronics') returns matching products.",
"parameters": {}
}
2. Don't return raw API responses as tool results. The model has to parse them, and it will hallucinate fields that don't exist. Always transform tool output into a clean, predictable format.
3. Don't chain agents without checkpoints. If Agent A passes output to Agent B, validate the output at the boundary. A bad output from Agent A will cascade through the entire chain.
Measuring Success: The Three Metrics That Matter
- Tool Selection Accuracy ??Did the model call the right tool? (Measure: % of calls where the tool matches human annotation)
- Parameter Validity Rate ??Were the parameters valid? (Measure: % of calls that pass schema validation)
- Task Completion Rate ??Did the agent actually solve the problem? (Measure: % of tasks completed without human intervention)
Track these three numbers. If any one drops below 90%, you have a production problem.
Conclusion
Building reliable AI agents isn't about finding the smartest model ??it's about building guardrails that make any model more reliable. The patterns above work across GPT-4o, DeepSeek, GLM, Claude, and any model that supports function calling.
The future of AI development isn't prompt engineering. It's system design ??constraints, validation, fallbacks, and smart orchestration. The teams that understand this will build agents that work. The teams that don't will keep debugging hallucinated function calls.
Start with narrow tool spaces. Add structured outputs. Build validation layers. Set hard limits. Orchestrate multiple models. Your agents will be dramatically more reliable starting today.
What patterns have you found effective for building AI agents? Share your experience in the comments.
Top comments (0)