Composable Output Guardrails: Filter Agent Responses Before They Reach Users

#hermeschallenge #ai #python #agents

The model returned a response that contained a phone number. Someone's real phone number, lifted from the training data. It went straight to the user.

The model returned a response that started with "As an AI language model, I..." — a generic hedge that made the product look unpolished.

The model returned a response with an HTML injection attempt embedded in user-provided text that it had reflected back verbatim.

All of these are output problems. agent-guard-rails is a composable pipeline of output filters that catch them before responses reach users.

The Shape of the Fix

from agent_guard_rails import GuardRails, RuleResult

rails = GuardRails([
    "strip_ai_preamble",          # remove "As an AI..." opener
    "no_pii",                     # flag SSN, cards, phones
    "no_html_injection",          # strip <script>, <img>, HTML tags
    "max_length:2000",            # cap response length
    "no_empty",                   # flag empty responses
])

llm_response = call_llm(prompt)
result: RuleResult = rails.apply(llm_response.content)

if result.blocked:
    return fallback_response(result.rule_name)

# Safe to send
return result.cleaned_text

Each rule either passes (with optional cleaning) or blocks. The first blocking rule stops the pipeline. result.cleaned_text has any cleaning applied by passing rules.

What It Does NOT Do

agent-guard-rails does not use an LLM to evaluate the response. All rules are regex and string-operation based. For semantic guardrails (is this response harmful?), you need a separate classifier.

It does not guarantee safety. Rules catch known patterns. Novel jailbreak outputs or novel PII formats may not be caught. Defense in depth applies here: guardrails are one layer, not the only layer.

It does not block at the request level. It filters responses. For request-level filtering (blocking certain user inputs), you need a separate pre-call filter.

Inside the Library

Rules are composed in a pipeline. Each rule is a function (text: str) -> RuleResult where RuleResult has passed, blocked, cleaned_text, and rule_name.

Built-in rules:

BUILTIN_RULES = {
    "strip_ai_preamble": strip_ai_preamble_rule,
    "no_pii": no_pii_rule,
    "no_html_injection": no_html_injection_rule,
    "no_empty": no_empty_rule,
    "no_repetition": no_repetition_rule,  # detect looping/stuck output
}

# Parameterized rules use "rule_name:param" syntax
def parse_rule(spec: str) -> Callable:
    if ":" in spec:
        name, param = spec.split(":", 1)
        if name == "max_length":
            return max_length_rule(int(param))
        if name == "forbidden_pattern":
            return forbidden_pattern_rule(param)
    return BUILTIN_RULES[spec]

Custom rules: pass callables in the rule list. GuardRails(["no_pii", my_custom_rule]) where my_custom_rule(text) -> RuleResult.

strip_ai_preamble: removes "As an AI language model,", "I'm an AI assistant,", "As a large language model," and similar openers. These make products look unpolished and add no value.

no_html_injection: removes <script>, <iframe>, <img src=...>, and javascript: URIs. Does not strip all HTML (some responses legitimately use markdown-style HTML). Targets injection patterns specifically.

no_repetition: flags output where the same sentence appears 3+ times consecutively. This catches stuck-generation loops where the model starts repeating itself.

When to Use It

Use it for any user-facing agent response. Customer support bots, document Q&A, coding assistants. Anywhere a response goes directly from the LLM to a user without intermediate editing.

The preamble rule is low-risk and high-value. Strip it everywhere. "As an AI language model" adds nothing and signals a lack of product polish.

The PII rule requires thought: do you want to block responses with PII (return a fallback) or clean them (strip the PII from the text)? The blocking approach is safer for compliance. The cleaning approach is better for user experience. Configure per use case.

Skip it for intermediate LLM calls (tool planning, internal reasoning steps). Guardrails are for the final response, not internal agent state.

Install

pip install git+https://github.com/MukundaKatta/agent-guard-rails

from agent_guard_rails import GuardRails, RuleResult

def make_rails_for_context(channel: str) -> GuardRails:
    base_rules = [
        "strip_ai_preamble",
        "no_empty",
        "no_repetition",
        "max_length:3000",
    ]

    if channel == "web":
        base_rules.append("no_html_injection")

    if channel in ("public", "web", "mobile"):
        base_rules.append("no_pii")

    return GuardRails(base_rules)

web_rails = make_rails_for_context("web")

def respond_to_user(user_message: str, channel: str = "web") -> str:
    llm_text = call_llm(user_message)
    result = web_rails.apply(llm_text)

    if result.blocked:
        logger.warning("response_blocked", rule=result.rule_name, channel=channel)
        return "I couldn't generate a complete response. Please try again."

    return result.cleaned_text

Sibling Libraries

Library	What it solves
`llm-output-validator`	Rule-based validation of structured output shape
`llm-pii-redact`	Reversible PII redaction with restore support
`tool-secret-scrubber`	Redact API keys from tool output
`prompt-shield`	Pattern-based prompt-injection detection on inputs
`agentvet`	Validate tool call arguments before execution

The combined pipeline: prompt-shield on user inputs (block injection attempts), agentvet on tool calls (validate args), agent-guard-rails on LLM outputs (clean and filter responses).

What's Next

Configurable action per rule: right now rules either block or clean. A warn action that logs but passes would be useful for auditing without blocking production traffic. This would make it easier to tune rules on real traffic before enabling blocking.

Rule composition operators: GuardRails(["no_pii AND no_html_injection", "max_length:2000"]) where AND means both must fail to block. Right now every rule failure blocks independently.

Telemetry integration: emit a structured event every time a rule fires with rule name, input length, and outcome. Feed this into your observability stack to understand how often guardrails trigger in production.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.