BrewHubPHL

Posted on May 24

Don't Trust Your LLM's Safety Promises Across Runtimes

#ai #productivity #opensource #architecture

Most LLM safety guardrails are built with a silent assumption baked in: all your customer-facing traffic runs through a single runtime. One process. One in-process safety check. Done.

That assumption breaks the moment you deploy a polyglot stack.

This is a write-up of a pattern we call parity contracts — a deployable security primitive for LLM commerce agents that span multiple runtimes. We implemented it in production at BrewHub PHL, a Philadelphia café whose AI agent, Franklin, places orders, charges customer wallets, and issues loyalty mutations without a human approval step. The full academic paper, red-team corpus, and parity test runner are open-source at github.com/BrewHubPHL/allergen-parity-corpus.

The Problem: Polyglot Deployments Break In-Process Safety

BrewHub's architecture spans three runtimes:

Runtime 1 — Next.js on Netlify (AWS Lambda): customer-facing SSE chat endpoint, Franklin's tool calls, price recompute
Runtime 2 — Netlify Functions (AWS Lambda): POS checkout, payment processing, Square webhook handlers
Runtime 3 — Google Cloud Run (Python ADK): six specialized workflow agents — ops, marketing, barista training, service recovery, concierge, provenance storyteller

Every existing LLM safety library — Llama Guard, NeMo Guardrails, LlamaFirewall, MCP-Guard — assumes a single serving runtime. Their guarantee is an in-process guarantee: if traffic goes through this runtime, the gate holds. That's fine for single-runtime deployments. It's a liability when you have independently reachable runtimes that each produce customer-facing or customer-adjacent output.

In our case: the workflow-agent path on Cloud Run can produce text that eventually reaches a customer via email or staff review. If the safety gate only lives in the Next.js Lambda, and someone engineers a path to Cloud Run, the gate is missing.

This is the deployed-agent gap. We formalize it as Adversary D — the runtime-bypass attacker: an adversary who discovers or engineers a path that reaches a secondary runtime while bypassing the primary safety gate.

The Core Insight: Treat LLM Tool Arguments as Untrusted Input

Before getting to parity contracts specifically, the foundational reclassification:

LLM tool arguments should be classified as untrusted input on par with browser JSON.

Every production web developer knows to re-derive truth server-side rather than trusting client-supplied values. We apply the same rule to the tool-call boundary. When Franklin's place_order tool fires, our backend (_pricing.js) ignores whatever price_cents the model supplied and re-fetches merch_products.price_cents and modifiers.price_delta_cents directly from Supabase. If the model-supplied total drifts from the server-computed total by a single penny, the transaction fails.

Same for identity: we never read customer_id from a tool argument. Identity resolves exclusively from the Bearer JWT via bearer_client in python-agents/lib/supabase_clients.py. A hallucinating model and a prompt-injection adversary are indistinguishable at the tool-call boundary — the defense covers both.

Parity Contracts: The Pattern

Definition. Let f_A and f_B be deterministic safety classifiers implemented in distinct runtimes A and B. A parity contract between them consists of three obligations:

Equivalence: ∀s. f_A(s) = f_B(s)

Shared test corpus: a finite set of labeled inputs that both implementations must pass, covering positive, negative, and boundary cases

CI enforcement: a gate that evaluates both implementations against the corpus on every commit and blocks deployment on any disagreement

The key word is deterministic. This pattern applies to regex classifiers, finite-state automata, and rule engines — not probabilistic LLM-based guardrails. Deterministic classifiers admit byte-equivalence testing. That property is what makes CI-gated equivalence possible.

Instantiation: The Three-Layer Allergen Kill Switch

The concrete implementation is our allergen safety gate. The failure mode is concrete and high-stakes: a hallucinated claim that a drink is peanut-free can cause anaphylaxis. Prompts can be circumvented. System instructions can be bypass-tested. The gate cannot live in the LLM's reasoning.

Layer 1 — Pre-LLM interception (lib/safety/allergen.py):
Before any user message reaches Anthropic or Gemini, a synchronous regex engine intercepts it. ALLERGEN_KEYWORDS, MEDICAL_KEYWORDS, and DIETARY_SAFETY_KEYWORDS patterns run against the raw text. If any match, the request is blocked before a single token is billed, returning the canonical ALLERGEN_SAFE_RESPONSE string. Median latency: 3.4 μs.

Layer 2 — Mid-stream scrubbing (lib/chat/allergen-safety.ts):
If a prompt evades Layer 1, the outbound token stream is wrapped by scrubbing_text_stream(), which maintains a 50-character lookahead buffer. DANGEROUS_REPLY_RE matches patterns like \b100%\s+(?:\w+[- ])?free\b and \bguaranteed\s+(?:safe|free)\b. A match mid-flight breaks the SSE stream and substitutes ALLERGEN_SAFE_RESPONSE before any byte of the dangerous assurance reaches the client.

Layer 3 — Post-response audit:
Every safety interception is logged to franklin_safety_audit in Supabase. Layer 3 is explicitly characterized as best-effort forensic evidence — AWS Lambda's fire-and-forget execution model means absence of an audit row is not proof of non-execution. Positive evidence only.

The Parity Test: Parsing TypeScript from Python

The parity contract enforcement lives in python-agents/tests/safety/test_allergen_parity.py. Its mechanism is worth examining:

Read src/lib/chat/allergen-safety.ts from the repository filesystem at test time
Regex-extract each declaration of the form export const NAME = /pattern/i;
Compile each extracted pattern under Python's re engine with re.IGNORECASE
Assert behavioral equivalence against a 90-case battery for all four named regexes
Assert that ALLERGEN_SAFE_RESPONSE is byte-identical between the TypeScript template literal and the Python string constant

This is structurally stronger than testing two independent implementations against a shared corpus. The test parses the TypeScript source and recompiles it under Python's engine. The most common parity-bug shape — engineer edits the regex on one side and forgets the other — is caught by construction. A change to the TypeScript regex with no change to the Python regex is impossible to ship: either the new regex passes the battery under Python (parity preserved by accident) or it fails (CI blocks deployment).

The 90-case battery breaks down as: 27 allergen-positive, 27 medical-positive, 10 dietary-safety-positive, 19 dangerous-reply-positive, 7 negative controls.

Red-Team Results

We evaluated against a 100-prompt corpus (75 adversarial + 25 benign controls) across four categories:

Category	Executed	Block rate	False positive rate
A — Allergen bypass	25	100%	n/a
B — Price/identity (commerce language)	20	0%	0%
D — Cross-runtime / Unicode	20	100%	n/a
N — Benign controls	25	0%	0%

Two honest Layer-1 gaps surfaced: sulfites didn't match the bare \bsulfite\b pattern (plural boundary issue) and the "does this contain any nuts?" form pushed keywords outside the .{0,30} window. Both are Layer-2 caught — the kill switch held — but the gate placement was one layer deeper than ideal. Both are documented as actionable findings rather than silently fixed before publication.

Layer-1 p99 latency: 8.79 μs — roughly four orders of magnitude below the 10–50ms cost of an intra-region HTTPS round trip to a centralized safety service.

When to Use This Pattern

The parity contract is worth the maintenance cost when all three of these hold:

Two or more runtimes can independently produce customer-facing or customer-adjacent output
The safety property is deterministic (regex, automaton, rule engine)
Runtime ownership crosses team or language boundaries — making silent divergence likely in practice

If only one runtime produces customer-facing output, enforce the gate once. If the safety property is probabilistic, the parity contract is the wrong shape — you need distributional equivalence, which is a harder problem.

The HMAC Wire Contract

One more primitive worth naming: every request from the Next.js edge to Cloud Run is signed with HMAC-SHA256 via internal-hmac.ts (TypeScript) and verified by hmac_auth.py (Python) as ASGI middleware before any ADK invocation. The shared secret lives in disjoint Doppler configs for the two runtimes. Timestamp freshness is enforced at 60 seconds. This is a second parity contract — same shape, different domain — enforced by CODEOWNERS rules requiring a security-tagged reviewer for co-modification of either file.

The Full Paper and Corpus

The complete paper, formal definitions, ablation methodology, and the 100-prompt red-team corpus with executable runner are available at:

github.com/BrewHubPHL/allergen-parity-corpus

The runner executes in local in-process mode against the Python safety layer, with staging-instance mode (full SSE against live infrastructure) described for camera-ready validation. The corpus is released open-source so reviewers can extend the battery and re-execute the methodology end-to-end.

The pattern is language-agnostic. Replace TypeScript and Python with any two languages, replace the allergen regex with any deterministic classifier, replace Jest and pytest with any shared runner. The three adoption conditions above are necessary and sufficient.

DEV Community