Guatu

Posted on Jun 10 • Originally published at guatulabs.dev

Agent Glass-Break Patterns: Controlled Escalation for Production

#aiagents #security #mcpservers #kubernetes

I watched an autonomous ops agent attempt to "fix" a failing deployment by recursively deleting pods in a loop because it misinterpreted a CrashLoopBackOff as a transient networking glitch. The agent had the permissions to do it, the logic to justify it, and absolutely no circuit breaker to stop it from taking down the entire namespace. It was a classic case of giving a tool a hammer and watching it treat the entire infrastructure like a nail.

If you're running agents in production, you've probably realized that the standard "system prompt" safety is a joke. Telling an LLM "please be careful with the production database" is not a security boundary. You need a glass-break pattern: a way for agents to operate within a strict sandbox, but with a controlled, audited path to escalate privileges when a human approves it or a specific condition is met.

What I tried first

My first instinct was to lean on centralized identity. I tried routing every agent tool call through an Authentik-protected gateway. The idea was simple: the agent requests a tool, the gateway checks the session, and the action is authorized.

It was a nightmare. The latency added by the OIDC handshake for every single tool call made the agent feel sluggish, and the integration overhead for low-sensitivity observability tools was absurd. I spent more time debugging JWT expiration and redirect loops than actually building agent capabilities. I was treating a low-sensitivity internal tool like a public-facing enterprise application.

Then I tried the "Super-User" approach. I gave the agent a high-privilege service account but wrapped it in a complex set of Python decorators that checked for "safe" keywords in the arguments. This failed immediately. LLMs are too good at prompt injection and parameter manipulation. A simple --force flag or a clever string concatenation bypassed my "safety" filters in minutes.

The Actual Solution: Controlled Escalation

The fix was to move the security boundary from the application layer to the infrastructure and execution layer. I implemented a three-pronged approach: Network-level isolation for internal tools, safeBins for execution control, and a manual escalation trigger.

1. Infrastructure-Level Isolation

Instead of forcing every internal tool through a heavy auth layer, I shifted to a LAN-only access model using Kubernetes NetworkPolicy. This ensures that only the agent orchestrator can talk to the tool, and only from a specific subnet.

For a tool like Agent Quest, I stripped out the Authentik dependency and locked it down at the pod level:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: traefik-allow-egress-to-agentquest
spec:
  podSelector:
    matchLabels:
      app: traefik
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 10.0.0.140/32 # The specific IP of the Agent Quest service
    ports:
    - protocol: TCP
      port: 4444

This removes the auth overhead while ensuring that no one outside the cluster (or even in other namespaces) can trigger the tool. It aligns with the privacy-routed inference pattern of keeping sensitive traffic off the open wire.

2. Execution Control with safeBins

For tools that actually execute shell commands, like mcporter, I stopped relying on regex filters. I implemented a safeBins pattern. This is essentially an allowlist of binaries and the specific flags they are permitted to use.

If the agent tries to pass a flag not in the allowedValueFlags list, the execution engine kills the process before it ever hits the shell.

{
  "safeBins": {
    "mcporter": {
      "allowedValueFlags": ["--config", "--timeout"],
      "forbiddenFlags": ["--force", "--recursive", "--delete-all"]
    },
    "kubectl": {
      "allowedValueFlags": ["--dry-run=client", "-n"],
      "restrictedCommands": ["delete", "patch"]
    }
  }
}

This forces the agent to operate in a "read-only" or "safe-write" mode by default. If the agent needs to do something destructive, it cannot simply "decide" to do it; it must trigger the glass-break.

3. The Glass-Break Escalation

When the agent hits a safeBins restriction or a NetworkPolicy block, it triggers an escalation event. I integrated this with an n8n workflow that sends a Slack notification to me with the exact command the agent wants to run and the reasoning behind it.

The workflow looks like this:

Agent fails a safeBins check.
The error is caught by the orchestrator and pushed to an n8n webhook.
n8n sends a message: "Agent X wants to run mcporter --force. Reason: 'Pod is stuck in Terminating'. Approve?"
I click "Approve," which updates a temporary Redis key granting the agent a 5-minute window of escalated privileges.

Why it works

This works because it acknowledges that the LLM is an unreliable narrator. You cannot trust the agent to follow safety guidelines, but you can trust the Linux kernel and the Kubernetes API.

By moving the constraints to the binary level (safeBins) and the network level (NetworkPolicy), we create a hard boundary. The agent can hallucinate all it wants, but it cannot execute a --force flag if the execution wrapper doesn't allow it.

Combining this with the two-tier service account model ensures that even if the agent escalates, it's using a token with a strictly defined TTL. The "glass-break" isn't just a permission change; it's a temporary shift in the security posture of the system.

For the MSAM (Model State Management) integration, I had to rewrite the server-side tools using FastMCP to support this. I used a specific IngressRoute to ensure that the escalation triggers only came from trusted internal IPs:

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: msam-ingress
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`msam.example.com`)
      kind: Rule
      services:
        - name: msam-service
          port: 443
          tls:
            insecureSkipVerify: true

Lessons Learned

The biggest surprise was how much the agent actually prefers these constraints. When the agent knows exactly what the boundaries are (because the error messages from safeBins are explicit), it stops trying to guess and starts asking for help. It turns a "failure" into a "collaboration."

If I were doing this again, I'd automate the memory index rebuilds more aggressively. I found that when I escalated an agent to fix a model registry mismatch (like the OpenClaw v2026.3.12 issue where codex-5.4 wasn't recognized), the agent often forgot that it had already tried a specific fix. I had to implement a rebuild-memory-index.py script to ensure the agent's long-term memory was synced with the actual state of the registry after a glass-break event.

A few caveats:

Latency: The human-in-the-loop part of the glass-break is a bottleneck. If you're in a high-availability environment, you'll need to define "Auto-Escalation" rules for low-risk tasks.
Complexity: You're adding a layer of middleware between the agent and the tool. If your middleware crashes, your agent is blind. I run my orchestration layer with a strict Recreate strategy on Kubernetes to avoid the split-brain issues I've seen with Ollama deployments.

Ultimately, production AI isn't about building the smartest agent. It's about building the most reliable cage for that agent to live in. The glass-break pattern allows the agent to be useful without giving it the keys to the kingdom.

Top comments (1)

Max Quimby • Jun 10

The "system prompt safety is a joke" line is the whole post, honestly. We learned the same thing the hard way — every guardrail that lives in the prompt or in argument-string regex eventually loses to the model's creativity. Pushing the boundary into NetworkPolicy + a binary/flag allowlist is right because it's enforced by something the LLM can't talk its way past.

One caution on safeBins: I'd lean even harder on the allowlist and treat forbiddenFlags as belt-and-suspenders, not the primary control. Denylists fail on the flag you didn't think of — kubectl delete has aliases, and you can still replace --force or apply a manifest that deletes. Allowlisting the handful of safe verbs ages a lot better.

Curious how you handle the human-in-the-loop latency for unattended runs: if a glass-break fires at 3am while you're asleep, does the agent block, fail the task, or drop to a read-only summary? We ended up with a tiered timeout — auto-deny the destructive action after N minutes but let the agent keep doing non-escalated work — and it genuinely changed how "autonomous" the thing felt overnight.