Table of Contents
- The $117 Bug That Changed Everything
- The Reliability Gap
- Why LLMs Fail at Critical Tasks
- The Multi-Layer Defense Strategy
- Real-World Impact
- What's Next
- Resources
The $117 Bug That Made Me Rethink Architecture
Picture this: A customer requests a shipping quote. Your AI agent calculates the price: $1,320.56. The agent stores it correctly in the database. The backend logs show the right number. Everything looks perfect.
But when the customer sees the response, it says: $117.00.
That's a 91% error. One hallucination away from massive financial loss or customer distrust.
This actually happened in production. And it's not an edge case, it's a fundamental characteristic of Large Language Models.
The Reliability Gap
When we talk about "AI agents," we're usually talking about LLMs with access to tools (functions they can call). The promise is compelling: natural language interfaces that can handle complex business logic.
The reality? LLMs are very advanced prediction engines, not calculators.
What LLMs Are Good At
- Understanding natural language and intent
- Generating human-like, contextual responses
- Recognizing patterns and relationships
- Adapting to conversational flow
What LLMs Struggle With
- Exact calculations and number precision
- Following strict business rules consistently
- Deterministic behavior across requests
- Security boundaries and access control
- Critical decision-making without oversight
The Developer's Dilemma
Think of it like hiring someone who's brilliant at customer service but occasionally forgets basic math:
# This is what you write
def calculate_price(user_type, base_price):
if user_type == "wholesale":
return base_price * 0.85 # 15% discount
else:
return base_price * 1.12 # 12% markup
# This is what the LLM might do
"Based on the pricing, I calculate roughly $117 for this service..."
The LLM might:
- Use the wrong pricing tier (wholesale vs retail)
- Misread numbers during token processing (1320.56 → $117.00)
- Hallucinate calculations instead of using tool results
- Ignore access controls when constructing responses
- Mix data from different contexts
You can't fix this with better prompts alone. This requires architectural solutions.
Why LLMs Fail at Critical Tasks
1. Token-Based Processing
LLMs don't process numbers as mathematical values, they process everything as tokens (text chunks).
The number 1320.56 might be tokenized as:
["1320", ".", "56"]["13", "20", ".56"]["1", "320", ".", "56"]
Depending on the tokenizer. When generating output, the model predicts the most statistically likely next token, not the mathematically correct one.
2. Probabilistic, Not Deterministic
# Deterministic code - same input = same output
def check_permission(user):
return user.role == "admin" # Always consistent
# LLM reasoning - same input ≠ guaranteed same output
"""
The user appears to have administrative permissions based on
their previous actions... (might vary across calls)
"""
With temperature > 0, the same prompt can yield different outputs. Even at temperature 0, subtle prompt changes can shift behavior.
3. Context Window Limitations
Your agent might have the correct data in its context, but:
- As conversations grow, early information gets less attention
- Critical details might be overshadowed by recent messages
- The model might "forget" information from 50+ messages ago
4. No Inherent Security Model
LLMs don't understand:
- Role-based access control (RBAC)
- Authentication vs Authorization
- Privilege escalation risks
- Data isolation requirements
They might cheerfully expose wholesale pricing to retail customers if the context suggests it, or process admin commands from regular users.
5. Instruction Drift
Even with perfect instructions, LLMs can:
- Misinterpret edge cases
- Prioritize recent conversation over system instructions
- Follow user instructions that contradict system rules
- Generate plausible-sounding but incorrect responses
The Multi-Layer Defense Strategy
Here's what I learned building production AI agents: Don't trust the LLM for business-critical logic. Use it for what it does best, and enforce everything else deterministically.
Think of it like airport security—multiple checkpoints, each catching different issues:
The Architecture
User Request
↓
┌─────────────────────────────────────────────┐
│ Layer 1: Tool Filtering │ ← "What can this user access?"
│ (Before agent even sees the tools) │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Layer 2: Instruction Engineering │ ← "How should you behave?"
│ (Guide LLM with clear rules) │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Layer 3: Pre-Execution Callbacks │ ← "Is this allowed right now?"
│ (before_tool_callback - Business rules) │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Layer 4: Tool Execution │ ← "Execute with ground truth"
│ (Deterministic business logic) │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Layer 5: Output Validation │ ← "Is the response accurate?"
│ (after_model_callback - Final check) │
└─────────────────────────────────────────────┘
↓
User Response
Each layer provides a different type of safety:
- Layer 1 (Tool Filtering): "What tools can this user even see?"
- Layer 2 (Instructions): "How should the LLM behave?"
- Layer 3 (Pre-Execution): "Is this specific tool call allowed? Should we route it elsewhere?"
- Layer 4 (Execution): "Calculate with code, not with tokens"
- Layer 5 (Validation): "Is the output what we expected? Correct if needed"
Example: Catching the $117 Bug
Here's how the layers work together:
# Layer 4: Tool execution - Calculate with code (ground truth)
def calculate_shipping_price(origin, destination, vehicle):
price = pricing_api.get_quote(origin, destination, vehicle)
# price = 1320.56 (calculated correctly)
# Store ground truth in session state
session.state['expected_price'] = price
return {
"total_price": price,
"currency": "USD",
"breakdown": {...}
}
# Layer 5: Validation - Catch LLM errors before user sees them
def validate_response(context, llm_response):
"""
This runs AFTER the LLM generates a response but BEFORE
the user sees it. It's our last line of defense.
"""
import re
# What did the LLM say?
response_text = llm_response.content.parts[0].text
mentioned_prices = re.findall(r'\$\s*([0-9,]+\.?\d*)', response_text)
if not mentioned_prices:
return None # No prices mentioned, all good
# What should it have said?
expected_price = context.state.get('expected_price') # 1320.56
mentioned_price = float(mentioned_prices[0].replace(',', '')) # 117.00
# Validate within tolerance (2% for rounding)
difference_pct = abs(mentioned_price - expected_price) / expected_price
if difference_pct > 0.02: # More than 2% off
logger.error(
f"🚨 PRICE ERROR: LLM showed ${mentioned_price}, "
f"expected ${expected_price} (Δ {difference_pct*100:.1f}%)"
)
# Correct it silently - user never knows
return LlmResponse(
content=Content(
role="model",
parts=[Part(text=f"Total shipping cost: ${expected_price:.2f} 🚚")]
)
)
return None # Price is correct, use original response
What happened in my case:
- ✅ Tool calculated correct price:
$1,320.56 - ✅ Stored as ground truth in session
- ❌ LLM generated response with wrong price:
$117.00(91% error!) - ✅ Validation caught the discrepancy
- ✅ Response corrected to
$1,320.56before user saw it - ✅ Error logged for monitoring and debugging
Customer saw: "Total shipping cost: $1,320.56" ✅
Logs showed: "🚨 PRICE ERROR #1 DETECTED" 📊
Real-World Impact
Let's compare approaches:
Before: Hope and Pray 🙏
# Traditional approach - trust the LLM completely
agent = LLM(
tools=[calculate_price, charge_card, book_shipment],
instructions="""
You are a helpful shipping assistant.
Always use the calculate_price tool for quotes.
Be accurate with prices.
Don't charge admin users.
"""
)
# Cross fingers and hope it follows instructions...
Observed failure rates:
- Wrong pricing tier: ~15-20% of requests
- Hallucinated numbers: ~5-10% of price displays
- Admin privilege leaks: ~3-5% of admin sessions
- Inconsistent behavior: Varies with prompt changes
After: Defense in Depth 🛡️
# Production-ready approach - enforce with code
agent = LLM(
# Layer 1: Filter tools by user role
tools=get_allowed_tools_for_user(user),
# Layer 2: Context-aware instructions
instructions=build_instructions(user_context),
# Layer 3: Pre-execution enforcement
before_tool_callback=enforce_business_rules,
# Layer 5: Post-generation validation
after_model_callback=validate_output
)
def enforce_business_rules(tool, args, context):
"""Runs before EVERY tool call"""
user_role = context.state.get('user_role')
# Block unauthorized actions
if tool.name == 'charge_card' and user_role == 'admin':
logger.info("Admin users don't pay - blocking payment tool")
return {
"success": True,
"message": "Admin booking confirmed - no payment required",
"skipped": True
}
# Route to correct implementation
if tool.name == 'calculate_price':
if user_role == 'admin':
# Route to wholesale pricing
return calculate_wholesale_price(args, context)
else:
# Route to retail pricing
return calculate_retail_price(args, context)
return None # Proceed with normal execution
def validate_output(context, response):
"""Runs after LLM generates response"""
# Check prices match ground truth
# Check no sensitive data leaked
# Check response format is correct
# Return corrected response if needed
pass
Results after implementation:
- ✅ Wrong pricing tier: 0% (blocked by callback)
- ✅ Hallucinated numbers: 0% (caught by validation)
- ✅ Admin privilege leaks: 0% (filtered at tool level)
- ✅ Inconsistent behavior: Eliminated (enforced by code)
- ⚠️ False positives: <0.1% (logged, reviewed, tuned)
The Reliability Equation
Agent Reliability =
MIN(
LLM Accuracy,
Instruction Following Rate,
Context Retention
)
+
Callback Coverage × Enforcement Quality
+
Validation Coverage × Error Detection Rate
The first line is unpredictable and varies with:
- Model updates
- Prompt changes
- Conversation length
- User phrasing
The second and third lines are deterministic—they're code you control.
Don't bet your business on only the first line.
What's Next
Building reliable AI agents isn't about making the LLM perfect, it's about designing systems that deliver reliable results despite LLM imperfections.
In this series, we'll explore each defensive layer:
Article 2: "Callback the Police" 👮♂️
How to use callbacks to enforce business rules like law enforcement—stopping bad behavior before it happens.
Article 3: "The Ground Truth Principle" 📊
Why session state is your source of truth and how to use it for validation.
Article 4: "Explicit Contracts Save Lives" ⚖️
Making function parameters explicit so callbacks can inject and enforce them.
Article 5: "Event-Driven Automation" 🔄
Using webhooks and background tasks to make agents proactive and reliable.
Try It Yourself
Quick reliability audit for your agent:
Question 1: If the LLM calls the wrong tool, do you detect and block it?
- ❌ No → Implement
before_tool_callbackfor routing - ⚠️ Sometimes → Ensure callbacks cover all tools
- ✅ Yes → Great! Is your coverage monitored?
Question 2: If the LLM displays incorrect data, do you catch it?
- ❌ No → Implement
after_model_callbackfor validation - ⚠️ Log only → Add correction logic
- ✅ Yes and correct → Excellent!
Question 3: Can the LLM bypass access controls?
- ❌ Yes → Filter tools by user role
- ⚠️ Depends on prompts → Make it code-enforced
- ✅ No → Verify with penetration testing
Question 4: Do you validate critical outputs?
- ❌ No → Start with financial/legal data
- ⚠️ Some → Expand to all critical outputs
- ✅ All → Document validation coverage
Question 5: Can you debug what went wrong?
- ❌ No logs → Add structured logging
- ⚠️ Basic logs → Add context and tracing
- ✅ Full tracing → Can you alert on patterns?
If you answered ❌ or ⚠️ to any question, you need deterministic layers. Your agent is relying too much on LLM behavior.
Key Takeaways
- LLMs are amazing at conversation, terrible at guarantees - Use them for what they do best
- Multiple defense layers - Like airport security, assume each layer might miss something
- Ground truth in code - Store correct values, validate against them
- Callbacks are enforcement - They're your business logic police
- Monitor everything - You can't improve what you don't measure
Resources
Next Article: "Callback the Police: Enforcing Business Rules in AI Agents" →
Have you caught reliability issues in your AI agents? What strategies worked for you? Share in the comments!
Top comments (0)