DEV Community

Cover image for AI Agent Guardrails: Rules That LLMs Cannot Bypass
Elizabeth Fuentes L for AWS

Posted on

AI Agent Guardrails: Rules That LLMs Cannot Bypass

AI Agent Guardrails: Rules That LLMs Cannot Bypass

AI agents can hallucinate operation success even when they violate business rules. They confirm bookings without payment verification, accept invalid parameters such as 15 guests when the maximum is 10, or ignore required prerequisites. Prompt engineering alone cannot prevent these errors.

Diagram showing LLM flexibility combined with symbolic rule enforcement

Consider a travel booking agent that receives the query: "Confirm my hotel booking." The agent calls confirm_booking(booking_id="BK001") and returns "SUCCESS: Confirmed BK001" — even though no payment was ever verified. The docstring says "Payment must be verified first." The LLM read it and ignored it.

This is the hallucination pattern that symbolic guardrails solve. Using Strands Agents hooks, you can enforce business rules at the framework level — before the tool executes. The result: 3/3 invalid operations blocked, with zero changes to the tools or prompts, by adding a single hook.

Research from ATA: Autonomous Trustworthy Agents (2024) calls this the neurosymbolic approach: combining neural LLM reasoning with deterministic symbolic rules that cannot be overridden.

Note: This post uses Strands Agents to demonstrate the pattern. Similar hook-based interception exists in LangGraph (node guards), AutoGen (reply functions), and other agent frameworks.

This Series: 4 Production Techniques

Part 1: RAG vs GraphRAG: When Agents Hallucinate Answers — Relationship-aware knowledge graphs preventing hallucinations in aggregations and precise queries

Part 2: Reduce Agent Errors and Token Costs with Semantic Tool Selection — Vector-based tool filtering for accurate tool selection

Part 3 (This Post): AI Agent Guardrails — Symbolic rules for verifiable decisions that LLMs cannot bypass

Part 4: Multi-Agent Validation — Agent teams detecting hallucinations before damage

Code repository: sample-why-agents-fail

In Part 2, semantic tool selection reduced tool confusion by filtering tools before the LLM sees them. But agents can still hallucinate operation success — confirming bookings without payment, ignoring guest limits, or bypassing required validation steps. Filtering tools doesn't stop the LLM from misusing the ones it receives.

The Problem: Prompts Are Suggestions, Not Constraints

Research from ATA (2024) identifies three hallucination patterns that prompt engineering cannot prevent:

  • Parameter errors: The agent calls book_hotel(guests=15) despite "Maximum 10 guests" in the docstring
  • Completeness errors: The agent executes bookings without required payment verification
  • Tool bypass behavior: The agent confirms success without calling mandatory validation tools

The root cause is architectural: prompts are text that the large language model (LLM) interprets. Business rules embedded in docstrings or system prompts become suggestions, not constraints. The model decides whether to follow them on every call.

Prerequisites

This post assumes familiarity with Python and LLM agent tool use. If you are new to Strands Agents, start with the Strands Agents documentation.

cd 04-neurosymbolic-demo
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

The demo uses Strands Agents with OpenAI GPT-4o-mini by default. You can configure any model provider that Strands supports — see Strands Model Providers.


The Solution: Neurosymbolic Validation with Strands Hooks

Strands Agents provides BeforeToolCallEvent — a hook that intercepts every tool call before execution. You can attach a HookProvider to the agent that validates symbolic rules and cancels the call if any rule fails:

Symbolic Guardrails at the Tool Level

The key line in the hook is event.cancel_tool. When set, Strands replaces the tool result with that message before the LLM sees anything. The tool never executes. The LLM receives a cancellation it cannot override.

Neural + Symbolic: The LLM handles natural language understanding and tool selection. The hook handles deterministic constraint enforcement. Neither replaces the other.


The Demo: Two Agents, Same Code

We run the same 3 scenarios on two agents with identical tools, identical model, and identical prompts. The only difference is one line: hooks=[hook].

Agent Hook Guardrails
baseline_agent None ❌ No validation
guarded_agent NeurosymbolicHook ✅ Rules enforced before every tool call

Full notebook: test_neurosymbolic_hooks.ipynb

Setup: Rules, Hook, and Two Agents

Step 1 — Define symbolic rules (rules.py):

from dataclasses import dataclass
from typing import Callable

@dataclass
class Rule:
    name: str
    condition: Callable[[dict], bool]
    message: str

BOOKING_RULES = [
    Rule("max_guests",   lambda ctx: ctx.get("guests", 1) <= 10,     "Maximum 10 guests per booking"),
    Rule("valid_dates",  lambda ctx: ctx["check_in"] < ctx["check_out"], "Check-in must be before check-out"),
]

CONFIRMATION_RULES = [
    Rule("payment_before_confirm", lambda ctx: ctx.get("payment_verified", False),
         "Payment must be verified before confirmation"),
]
Enter fullscreen mode Exit fullscreen mode

Rules are plain Python functions — deterministic, testable, and auditable independently of any agent.

Step 2 — Create the validation hook (test_neurosymbolic_hooks.py):

from strands.hooks import HookProvider, HookRegistry, BeforeToolCallEvent

class NeurosymbolicHook(HookProvider):

    def register_hooks(self, registry: HookRegistry) -> None:
        registry.add_callback(BeforeToolCallEvent, self.validate)

    def validate(self, event: BeforeToolCallEvent) -> None:
        tool_name = event.tool_use["name"]
        if tool_name not in self.rules:
            return
        context = self._build_context(tool_name, event.tool_use["input"])
        passed, violations = validate(self.rules[tool_name], context)
        if not passed:
            event.cancel_tool = f"BLOCKED: {', '.join(violations)}"
Enter fullscreen mode Exit fullscreen mode

Step 3 — Define clean tools (no validation logic mixed in):

@tool
def book_hotel(hotel: str, check_in: str, check_out: str, guests: int = 1) -> str:
    """Book a hotel room."""
    return f"SUCCESS: Booked {hotel} for {guests} guests, {check_in} to {check_out}"

@tool
def confirm_booking(booking_id: str) -> str:
    """Confirm a booking."""
    return f"SUCCESS: Confirmed {booking_id}"
Enter fullscreen mode Exit fullscreen mode

Step 4 — Create both agents:

# Baseline: no hook, no validation
baseline_agent = Agent(tools=[book_hotel, process_payment, confirm_booking], model=MODEL)

# Guarded: hook intercepts every tool call
hook = NeurosymbolicHook(STATE)
guarded_agent = Agent(tools=[book_hotel, process_payment, confirm_booking], hooks=[hook], model=MODEL)
Enter fullscreen mode Exit fullscreen mode

Same tools, same model, same prompts — the only difference is hooks=[hook].


Test 1: Confirm Booking Without Payment

Query: "Confirm booking BK001"

The booking exists. The payment does not. The confirm_booking tool docstring says "Payment must be verified first."

Baseline agent calls the tool and returns success. The docstring was context, not a constraint.

Guarded agent — the hook evaluates the CONFIRMATION_RULES before the tool executes, finds payment_verified = False, and cancels the call.

The LLM received BLOCKED: Payment must be verified before confirmation as the tool result. It cannot retry with different parameters — the rule is enforced at the framework level.


Test 2: Book Hotel Exceeding Guest Limit

Query: "Book Grand Hotel for 15 people from 2026-03-20 to 2026-03-25"

The BOOKING_RULES set a maximum of 10 guests. The agent extracts guests=15 from the query.

Baseline agent passes 15 guests to the tool, which returns success. The maximum in the docstring was ignored.

Guarded agent — the hook evaluates max_guests_check(15 <= 10) = False and cancels before the tool runs.

baseline executes, guarded blocks

The validation happens before execution. There is no booking to roll back, no compensating transaction needed — the invalid operation never occurred.


Test 3: Valid Booking

Query: "Book Grand Hotel for 5 guests from 2026-03-20 to 2026-03-25"

All rules pass: guests=5 <= 10, dates are valid, advance booking window is met.

Both agents execute the booking successfully. The hook adds no friction to valid operations.

both agents execute


Results: 3/3 Invalid Operations Blocked

Scenario Baseline Agent Guarded Agent
Confirm booking without payment ❌ Executes — hallucination ✅ Blocked before execution
Book 15 guests (max 10) ❌ Executes — rule violated ✅ Blocked before execution
Valid booking (5 guests) ✅ Executes ✅ Executes

The guarded agent blocked 3/3 invalid operations and allowed 1/1 valid operations — zero false positives, zero false negatives.

The baseline agent has no mechanism to detect that it violated a business rule. It returns success with full confidence. Without the hook, the only thing standing between the agent and the invalid operation is the LLM's interpretation of a docstring.


Key Insight: Where Enforcement Happens

Prompt engineering — the LLM can ignore it:

system_prompt = """
IMPORTANT: Never confirm bookings without payment verification.
CRITICAL: Maximum 10 guests per booking.
"""
# The LLM reads this as context. It can hallucinate compliance.
Enter fullscreen mode Exit fullscreen mode

❌ The LLM decides whether to follow this on every single call.

Strands Hook — enforced before the tool runs:

def validate(self, event: BeforeToolCallEvent) -> None:
    passed, violations = validate(self.rules[tool_name], context)
    if not passed:
        event.cancel_tool = f"BLOCKED: {', '.join(violations)}"
        # Tool never executes. LLM receives the cancellation.
        # There is no path to override this.
Enter fullscreen mode Exit fullscreen mode

✅ The hook runs outside the LLM. The decision is not the LLM's to make.

The difference is architectural. Prompts are input to the LLM. Hooks are framework-level interceptors that run before the LLM sees the tool result.

Production Considerations

Advantages:

  • Verifiable constraints — rules are code, not instructions
  • Centralized — one hook validates all tools; no validation logic scattered across tool definitions
  • You can test rules independently of any agent or LLM call
  • Rule violations produce explicit, loggable events with tool name, parameters, and reason

Challenges:

  • Rules must be explicitly defined for each operation you want to protect
  • Does not handle fuzzy or probabilistic logic — rules are boolean
  • Edge cases require explicit handling in rule conditions
  • Rules need maintenance as business logic evolves

Best practices:

  • Define hooks for critical, high-stakes operations such as bookings, payments, and cancellations
  • Log all rule violations with tool name, parameters, and reason for auditing
  • Test rules thoroughly and independently of the agent
  • Combine guardrails with semantic tool selection (Part 2) and multi-agent validation (Part 4) for layered protection

Troubleshooting

Issue Solution
OPENAI_API_KEY not set Run export OPENAI_API_KEY=your-key or add it to a .env file
Hook not blocking calls Verify the tool name in self.rules matches the @tool function name exactly
BeforeToolCallEvent import error Use from strands.hooks import HookProvider, HookRegistry, BeforeToolCallEvent
Strands version mismatch Run pip install --upgrade strands-agents

For additional help, open an issue in the sample-why-agents-fail repository.

What's Next

Symbolic guardrails block rule violations at the tool level. But a single agent still has no check on its own reasoning across multiple steps — it can hallucinate that a previous step succeeded, misinterpret a tool result, or reach a confident wrong conclusion from a chain of correct tool calls.

Part 4: Multi-Agent Validation shows how a Swarm of specialized agents (Executor → Validator → Critic) provides explicit verdicts on every response — catching hallucinations that no tool-level guardrail can see.

Key Takeaways

  • Prompts are suggestions: The LLM interprets docstrings and system prompts — it can hallucinate compliance with any instruction
  • Hooks are enforcement: BeforeToolCallEvent intercepts tool calls before execution at the framework level — the LLM cannot override a cancelled tool
  • 3/3 invalid operations blocked: Zero changes to tools or prompts, one hook added
  • Clean separation: Tools handle business operations; hooks handle constraint enforcement
  • Auditable by design: Rule violations are explicit Python conditions — testable, loggable, and traceable
  • One hook, all tools: A single NeurosymbolicHook validates every tool call in one place

References


Code: GitHub


Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube

Top comments (1)

Collapse
 
camila_hinojosa_anez profile image
Camila Hinojosa Anez

This is crucial