DEV Community

Cover image for How Reliable Are Your AI Agents?
Claret Ibeawuchi
Claret Ibeawuchi

Posted on

How Reliable Are Your AI Agents?

Table of Contents


The $117 Bug That Made Me Rethink Architecture

Picture this: A customer requests a shipping quote. Your AI agent calculates the price: $1,320.56. The agent stores it correctly in the database. The backend logs show the right number. Everything looks perfect.

But when the customer sees the response, it says: $117.00.

That's a 91% error. One hallucination away from massive financial loss or customer distrust.

This actually happened in production. And it's not an edge case, it's a fundamental characteristic of Large Language Models.


The Reliability Gap

When we talk about "AI agents," we're usually talking about LLMs with access to tools (functions they can call). The promise is compelling: natural language interfaces that can handle complex business logic.

The reality? LLMs are very advanced prediction engines, not calculators.

What LLMs Are Good At

  • Understanding natural language and intent
  • Generating human-like, contextual responses
  • Recognizing patterns and relationships
  • Adapting to conversational flow

What LLMs Struggle With

  • Exact calculations and number precision
  • Following strict business rules consistently
  • Deterministic behavior across requests
  • Security boundaries and access control
  • Critical decision-making without oversight

The Developer's Dilemma

Think of it like hiring someone who's brilliant at customer service but occasionally forgets basic math:

# This is what you write
def calculate_price(user_type, base_price):
    if user_type == "wholesale":
        return base_price * 0.85  # 15% discount
    else:
        return base_price * 1.12  # 12% markup

# This is what the LLM might do
"Based on the pricing, I calculate roughly $117 for this service..."
Enter fullscreen mode Exit fullscreen mode

The LLM might:

  • Use the wrong pricing tier (wholesale vs retail)
  • Misread numbers during token processing (1320.56 → $117.00)
  • Hallucinate calculations instead of using tool results
  • Ignore access controls when constructing responses
  • Mix data from different contexts

You can't fix this with better prompts alone. This requires architectural solutions.


Why LLMs Fail at Critical Tasks

1. Token-Based Processing

LLMs don't process numbers as mathematical values, they process everything as tokens (text chunks).

The number 1320.56 might be tokenized as:

  • ["1320", ".", "56"]
  • ["13", "20", ".56"]
  • ["1", "320", ".", "56"]

Depending on the tokenizer. When generating output, the model predicts the most statistically likely next token, not the mathematically correct one.

2. Probabilistic, Not Deterministic

# Deterministic code - same input = same output
def check_permission(user):
    return user.role == "admin"  # Always consistent

# LLM reasoning - same input ≠ guaranteed same output
"""
The user appears to have administrative permissions based on 
their previous actions... (might vary across calls)
"""
Enter fullscreen mode Exit fullscreen mode

With temperature > 0, the same prompt can yield different outputs. Even at temperature 0, subtle prompt changes can shift behavior.

3. Context Window Limitations

Your agent might have the correct data in its context, but:

  • As conversations grow, early information gets less attention
  • Critical details might be overshadowed by recent messages
  • The model might "forget" information from 50+ messages ago

4. No Inherent Security Model

LLMs don't understand:

  • Role-based access control (RBAC)
  • Authentication vs Authorization
  • Privilege escalation risks
  • Data isolation requirements

They might cheerfully expose wholesale pricing to retail customers if the context suggests it, or process admin commands from regular users.

5. Instruction Drift

Even with perfect instructions, LLMs can:

  • Misinterpret edge cases
  • Prioritize recent conversation over system instructions
  • Follow user instructions that contradict system rules
  • Generate plausible-sounding but incorrect responses

The Multi-Layer Defense Strategy

Here's what I learned building production AI agents: Don't trust the LLM for business-critical logic. Use it for what it does best, and enforce everything else deterministically.

Think of it like airport security—multiple checkpoints, each catching different issues:

The Architecture

User Request
     ↓
┌─────────────────────────────────────────────┐
│  Layer 1: Tool Filtering                    │  ← "What can this user access?"
│  (Before agent even sees the tools)         │
└─────────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────────┐
│  Layer 2: Instruction Engineering           │  ← "How should you behave?"
│  (Guide LLM with clear rules)               │
└─────────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────────┐
│  Layer 3: Pre-Execution Callbacks           │  ← "Is this allowed right now?"
│  (before_tool_callback - Business rules)    │
└─────────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────────┐
│  Layer 4: Tool Execution                    │  ← "Execute with ground truth"
│  (Deterministic business logic)             │
└─────────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────────┐
│  Layer 5: Output Validation                 │  ← "Is the response accurate?"
│  (after_model_callback - Final check)       │
└─────────────────────────────────────────────┘
     ↓
User Response
Enter fullscreen mode Exit fullscreen mode

Each layer provides a different type of safety:

  • Layer 1 (Tool Filtering): "What tools can this user even see?"
  • Layer 2 (Instructions): "How should the LLM behave?"
  • Layer 3 (Pre-Execution): "Is this specific tool call allowed? Should we route it elsewhere?"
  • Layer 4 (Execution): "Calculate with code, not with tokens"
  • Layer 5 (Validation): "Is the output what we expected? Correct if needed"

Example: Catching the $117 Bug

Here's how the layers work together:

# Layer 4: Tool execution - Calculate with code (ground truth)
def calculate_shipping_price(origin, destination, vehicle):
    price = pricing_api.get_quote(origin, destination, vehicle)
    # price = 1320.56 (calculated correctly)

    # Store ground truth in session state
    session.state['expected_price'] = price

    return {
        "total_price": price,
        "currency": "USD",
        "breakdown": {...}
    }

# Layer 5: Validation - Catch LLM errors before user sees them
def validate_response(context, llm_response):
    """
    This runs AFTER the LLM generates a response but BEFORE 
    the user sees it. It's our last line of defense.
    """
    import re

    # What did the LLM say?
    response_text = llm_response.content.parts[0].text
    mentioned_prices = re.findall(r'\$\s*([0-9,]+\.?\d*)', response_text)

    if not mentioned_prices:
        return None  # No prices mentioned, all good

    # What should it have said?
    expected_price = context.state.get('expected_price')  # 1320.56
    mentioned_price = float(mentioned_prices[0].replace(',', ''))  # 117.00

    # Validate within tolerance (2% for rounding)
    difference_pct = abs(mentioned_price - expected_price) / expected_price

    if difference_pct > 0.02:  # More than 2% off
        logger.error(
            f"🚨 PRICE ERROR: LLM showed ${mentioned_price}, "
            f"expected ${expected_price}{difference_pct*100:.1f}%)"
        )

        # Correct it silently - user never knows
        return LlmResponse(
            content=Content(
                role="model",
                parts=[Part(text=f"Total shipping cost: ${expected_price:.2f} 🚚")]
            )
        )

    return None  # Price is correct, use original response
Enter fullscreen mode Exit fullscreen mode

What happened in my case:

  1. ✅ Tool calculated correct price: $1,320.56
  2. ✅ Stored as ground truth in session
  3. ❌ LLM generated response with wrong price: $117.00 (91% error!)
  4. ✅ Validation caught the discrepancy
  5. ✅ Response corrected to $1,320.56 before user saw it
  6. ✅ Error logged for monitoring and debugging

Customer saw: "Total shipping cost: $1,320.56" ✅

Logs showed: "🚨 PRICE ERROR #1 DETECTED" 📊


Real-World Impact

Let's compare approaches:

Before: Hope and Pray 🙏

# Traditional approach - trust the LLM completely
agent = LLM(
    tools=[calculate_price, charge_card, book_shipment],
    instructions="""
    You are a helpful shipping assistant. 
    Always use the calculate_price tool for quotes.
    Be accurate with prices.
    Don't charge admin users.
    """
)

# Cross fingers and hope it follows instructions...
Enter fullscreen mode Exit fullscreen mode

Observed failure rates:

  • Wrong pricing tier: ~15-20% of requests
  • Hallucinated numbers: ~5-10% of price displays
  • Admin privilege leaks: ~3-5% of admin sessions
  • Inconsistent behavior: Varies with prompt changes

After: Defense in Depth 🛡️

# Production-ready approach - enforce with code
agent = LLM(
    # Layer 1: Filter tools by user role
    tools=get_allowed_tools_for_user(user),

    # Layer 2: Context-aware instructions
    instructions=build_instructions(user_context),

    # Layer 3: Pre-execution enforcement
    before_tool_callback=enforce_business_rules,

    # Layer 5: Post-generation validation
    after_model_callback=validate_output
)

def enforce_business_rules(tool, args, context):
    """Runs before EVERY tool call"""
    user_role = context.state.get('user_role')

    # Block unauthorized actions
    if tool.name == 'charge_card' and user_role == 'admin':
        logger.info("Admin users don't pay - blocking payment tool")
        return {
            "success": True,
            "message": "Admin booking confirmed - no payment required",
            "skipped": True
        }

    # Route to correct implementation
    if tool.name == 'calculate_price':
        if user_role == 'admin':
            # Route to wholesale pricing
            return calculate_wholesale_price(args, context)
        else:
            # Route to retail pricing
            return calculate_retail_price(args, context)

    return None  # Proceed with normal execution

def validate_output(context, response):
    """Runs after LLM generates response"""
    # Check prices match ground truth
    # Check no sensitive data leaked
    # Check response format is correct
    # Return corrected response if needed
    pass
Enter fullscreen mode Exit fullscreen mode

Results after implementation:

  • ✅ Wrong pricing tier: 0% (blocked by callback)
  • ✅ Hallucinated numbers: 0% (caught by validation)
  • ✅ Admin privilege leaks: 0% (filtered at tool level)
  • ✅ Inconsistent behavior: Eliminated (enforced by code)
  • ⚠️ False positives: <0.1% (logged, reviewed, tuned)

The Reliability Equation

Agent Reliability = 
    MIN(
        LLM Accuracy,
        Instruction Following Rate,
        Context Retention
    )
    +
    Callback Coverage × Enforcement Quality
    +
    Validation Coverage × Error Detection Rate
Enter fullscreen mode Exit fullscreen mode

The first line is unpredictable and varies with:

  • Model updates
  • Prompt changes
  • Conversation length
  • User phrasing

The second and third lines are deterministic—they're code you control.

Don't bet your business on only the first line.


What's Next

Building reliable AI agents isn't about making the LLM perfect, it's about designing systems that deliver reliable results despite LLM imperfections.

In this series, we'll explore each defensive layer:

Article 2: "Callback the Police" 👮‍♂️

How to use callbacks to enforce business rules like law enforcement—stopping bad behavior before it happens.

Article 3: "The Ground Truth Principle" 📊

Why session state is your source of truth and how to use it for validation.

Article 4: "Explicit Contracts Save Lives" ⚖️

Making function parameters explicit so callbacks can inject and enforce them.

Article 5: "Event-Driven Automation" 🔄

Using webhooks and background tasks to make agents proactive and reliable.


Try It Yourself

Quick reliability audit for your agent:

Question 1: If the LLM calls the wrong tool, do you detect and block it?

  • ❌ No → Implement before_tool_callback for routing
  • ⚠️ Sometimes → Ensure callbacks cover all tools
  • ✅ Yes → Great! Is your coverage monitored?

Question 2: If the LLM displays incorrect data, do you catch it?

  • ❌ No → Implement after_model_callback for validation
  • ⚠️ Log only → Add correction logic
  • ✅ Yes and correct → Excellent!

Question 3: Can the LLM bypass access controls?

  • ❌ Yes → Filter tools by user role
  • ⚠️ Depends on prompts → Make it code-enforced
  • ✅ No → Verify with penetration testing

Question 4: Do you validate critical outputs?

  • ❌ No → Start with financial/legal data
  • ⚠️ Some → Expand to all critical outputs
  • ✅ All → Document validation coverage

Question 5: Can you debug what went wrong?

  • ❌ No logs → Add structured logging
  • ⚠️ Basic logs → Add context and tracing
  • ✅ Full tracing → Can you alert on patterns?

If you answered ❌ or ⚠️ to any question, you need deterministic layers. Your agent is relying too much on LLM behavior.


Key Takeaways

  1. LLMs are amazing at conversation, terrible at guarantees - Use them for what they do best
  2. Multiple defense layers - Like airport security, assume each layer might miss something
  3. Ground truth in code - Store correct values, validate against them
  4. Callbacks are enforcement - They're your business logic police
  5. Monitor everything - You can't improve what you don't measure

Resources


Next Article: "Callback the Police: Enforcing Business Rules in AI Agents"


Have you caught reliability issues in your AI agents? What strategies worked for you? Share in the comments!

Top comments (0)