Frank Steegmans

Posted on Feb 16

OpenClaw Isn't the Problem. Your Agent Architecture Is.

#ai #opensource #security #architecture

Frank Steegmans, Co-founder of Lambda-Phi

As of today, OpenClaw's creator Peter Steinberger has joined OpenAI. Sam Altman says Steinberger will "drive the next generation of personal agents" and that OpenClaw will "live in a foundation as an open source project that OpenAI will continue to support." Steinberger confirmed the project will stay open and independent.

Which means the architectural decisions baked into OpenClaw are no longer a hobbyist's concern. They are about to scale to every org that adopts OpenAI-backed agents.

And those architectural decisions have consequences. 135,000+ exposed instances. One-click remote code execution through a WebSocket hijack. Hundreds of malicious skills in the ClawHub registry. Cisco called it "groundbreaking" from a capability perspective and "an absolute nightmare" from a security perspective.

The security reports are right about the symptoms. The deeper issue is the trust boundaries: OpenClaw gives an LLM process direct access to secrets, broad shell execution, and full filesystem scope, all within a single boundary. If you keep that shape, you keep the same class of failures even after every known CVE is patched. The next vulnerability is a consequence of the architecture, not a gap in the test suite.

I spend my time building agent runtime infrastructure at Lambda-Phi. We hit the same fundamental tensions OpenClaw is now dealing with in public: how do you give an agent enough access to be useful without giving it enough access to be dangerous? We tried the naive approach first (give the LLM access to everything, put a sandbox around it). It broke in the same predictable ways OpenClaw is breaking now. What follows are the five architectural patterns we found to be non-negotiable, and the threat model behind them.

If you've evaluated OpenClaw, you probably hit the same wall: "This is powerful, but there's no way security signs off on an agent with shell access and API keys in env vars." This article is the engineering conversation that comes after that moment.

The threat model in one paragraph

An agent runtime is a privileged orchestrator. It ingests untrusted inputs (user messages, web content, documents, third-party skills). It reasons about those inputs using a probabilistic model that cannot distinguish between a helpful instruction and an attacker instruction embedded in a webpage. It then executes tools with real side effects: filesystem writes, shell commands, API calls, database queries. The agent often ends up with the access of a senior engineer and the obedience of a process that will faithfully follow instructions from whatever source is most convincing. Untrusted input must never become privileged action without a policy decision in between. Any architecture that doesn't enforce that boundary will eventually produce security failures.

The architecture at a glance

Before diving into individual patterns, here is how they fit together as a system. The goal is to prevent any single compromise from turning into total compromise, and to make the remaining failure modes visible and containable.

Security invariants

These are the rules that hold across all five patterns. If your agent architecture violates any of them, you have a structural gap.

Secrets never enter model context or process environment.
All side effects pass through a policy gate before execution.
Default-deny egress. The broker is the only outbound path.
Workspace is the maximum read/write scope by default.
Every side effect produces a structured, hash-linked trace entry.
Autonomy expands only on measured outcomes, and contracts automatically on measured failures.

These are checkable. You can audit any agent runtime against this list in an afternoon.

1. Secret firewalling: the LLM never touches your keys

SecurityScorecard found tens of thousands of exposed OpenClaw instances in their scan (40,000+), with other researchers reporting 135,000+ exposed instances overall, many leaking API keys, OAuth tokens, and credentials through publicly accessible control panels. This happens because OpenClaw puts secrets and the LLM process in the same environment. A prompt injection attack, a malicious skill, or a simple misconfiguration can exfiltrate anything the agent can see.

The architectural fix is a hard boundary: secret values never enter the LLM process environment. Not in environment variables. Not in the prompt context. Not in configuration files the agent can read.

Instead, the agent talks to a tool broker. When the agent decides "send a Slack message," it issues a structured action request. The broker receives that request, validates it against policy (is this agent allowed to send Slack messages? in this workspace? at this rate?), injects the credential at execution time, makes the call, and returns a sanitized result. The agent sees "message sent" or "permission denied." It never sees the token. The broker never returns secrets in responses or logs.

The broker does not accept free-form instructions. It accepts typed actions with a fixed schema, and every field is policy-checked before execution. This is not "ask the broker nicely." It is a constrained API with a small surface area.

The skeptical question is: "Can't the agent coerce the broker into doing bad things?" It can try. That's what the command gate (pattern 2) and trust governor (pattern 5) are for. The broker also enforces its own constraints: each integration is scoped by audience, workspace, rate limit, and cost ceiling. A request to "send a Slack message to #general" is a different policy decision than "send a Slack message to every channel in the workspace." The broker distinguishes between them.

Compromising the model does not automatically compromise your secrets. You've decoupled the two highest-risk components. This is the same principle behind every serious secrets management system. We just apply it to the one context where the industry somehow decided it was fine to skip.

2. Command gating: shell access is not a binary

CVE-2026-25253 showed that a single malicious link could disable OpenClaw's sandbox, modify safety configuration, and execute arbitrary commands. But even without that specific bug, OpenClaw's model has a structural problem: once you're past the security boundary, the agent has broad shell access. The sandbox is a perimeter. Inside it, everything is permitted.

Command gating replaces that binary with a policy layer. Every command the agent wants to execute passes through a filter before it reaches the operating system.

A naive approach is a blocklist: block rm -rf /, block curl, block wget. Security engineers know this fails immediately. An attacker (or an LLM following injected instructions) can trivially obfuscate: echo cm0gLXJmIC8= | base64 -d | sh bypasses any string-matching blocklist.

The approach that survives contact with reality combines several constraints:

Allowlisted binaries only. The agent can invoke cargo, grep, cat, mkdir. Everything else is denied by default. This is a whitelist, not a blacklist. For compilers and build tools, assume the build itself is untrusted: run them only inside the sandboxed workspace with no raw egress.

Structured argv execution, never sh -c. Commands execute via exec-style invocation with a validated argument vector. No shell interpretation layer. This eliminates an entire class of injection through metacharacters, pipes, redirects, backticks, subshells, and globbing.

Argument and path validation. Even allowed binaries are constrained. cat can read files inside the workspace directory. Paths are resolved via realpath and verified to remain under the workspace root before execution. No traversal to /etc/passwd or symlink escapes.

Per-task profiles. A documentation task gets read-only filesystem access and markdown tooling. A build task gets compiler access and test runners. A deployment task gets a different profile with network access explicitly enabled. No task gets everything.

Escalation path. When the agent requests a command outside its profile, the request doesn't silently fail. It gets logged and escalated: either to a broader profile (if policy allows) or to a human for review.

Composition without shell pipelines. If a task needs to chain operations, the runtime composes them explicitly through intermediate files or controlled streaming, not through arbitrary shell pipelines that reintroduce the injection surface.

We codified this when building the command gating layer in our agent runtime. Every blocked command gets logged to the trace ledger. This means you can see exactly what an agent tried to do and was prevented from doing, which is often more informative than what it successfully did.

3. Tamper-evident trace logging: prove what happened

When something goes wrong with an OpenClaw deployment, and the data says things go wrong regularly, the audit trail is conversation history. Chat logs are mutable, unstructured, and trivially editable. They cannot answer "did this agent take this specific action at this specific time?" with any verifiable confidence.

The alternative is an append-only action ledger with tamper evidence. Every action the agent takes (command execution, file read, file write, API call, tool invocation) produces a log entry. Each entry includes a cryptographic hash of the previous entry. The chain uses the same hash-linking found in certificate transparency logs and git commit chains.

An important caveat upfront: tamper evidence proves integrity of what was captured. It cannot prove you captured everything unless the runtime is the only path to side effects. That is why workspace isolation (pattern 4) and broker-only egress (pattern 1) exist. The patterns reinforce each other. Verification without containment is incomplete. Containment without verification is unauditable.

We ended up needing this just to debug our own long-running agent workflows. When an agent task produces a wrong result three steps into a ten-step process, you need to trace back through the decision chain. An unstructured chat log makes this painful. A structured, indexed action ledger makes it a query.

Why this matters beyond debugging:

Automated reliability scoring. A verifiable record of agent behavior across hundreds of tasks gives you measured error rates: how often does it produce code that fails tests? How often does it request commands outside its profile? How often does it need human intervention? These metrics feed directly into the earned autonomy model in pattern 5.

Compliance tractability. Hand an auditor a verifiable log with tamper evidence instead of a folder of chat transcripts. For regulated environments, this is the difference between "we think the agent did X" and "we can cryptographically demonstrate the log has not been altered since recording."

4. Workspace isolation: contain the blast radius

Bitsight found over 30,000 exposed instances in a two-week scan, many on corporate infrastructure. SecurityScorecard correlated 549 of those with prior breach activity. When OpenClaw runs with full access to the host machine, a single compromise gives the attacker everything the machine can touch: credential stores, browser sessions, messaging accounts, connected services.

Workspace isolation is the structural countermeasure. Each task executes inside a bounded filesystem scope: its own directory, its own set of permitted files. Task A cannot read Task B's workspace. Neither can access system-level resources outside their boundary.

In a container deployment, the isolation extends further: the agent process runs with no host network access by default, no visibility into other tasks, and no access to the host filesystem outside the mounted workspace. If the task is compromised, the attacker gets that workspace's contents and nothing else.

To be clear: this is about blast radius, not perfect containment. A determined attacker with kernel exploits can escape containers. The goal is that one task compromise does not become an organizational compromise. That's a meaningful improvement even when the isolation isn't theoretically perfect.

The piece that ties this to the broker architecture (pattern 1): the container has no outbound network access except through the tool broker. The agent cannot exfiltrate data by opening a raw socket to an external server. If it wants to interact with the outside world, it goes through the broker, which enforces policy and logs every action.

"But then the broker is your exfiltration path." Yes. That is exactly why it rate-limits, scopes to specific integrations, logs every call, and can be configured to require human approval for high-risk actions. A controlled, auditable egress point is strictly better than an unrestricted network socket.

5. Earned autonomy: trust is a gradient, not a switch

OpenClaw's permission model is binary. Sandbox on or off. Safety prompts enabled or disabled. CVE-2026-25253 showed that an attacker could flip these switches remotely, but the deeper problem is that on/off is the wrong abstraction for agent trust.

Autonomy should be a gradient with explicit levels:

Level 1: Propose and wait. The agent plans an action and presents it for human approval before executing. Every command, every file write, every API call gets reviewed. This is the right default for new agents, untested task types, or high-risk operations.

Level 2: Act and report. The agent executes autonomously and produces a structured report for async human review. Appropriate for agents that have demonstrated reliability on a specific task class, measured by the trace data from pattern 3.

Level 3: Act and alert on exceptions. The agent runs autonomously and only surfaces issues when something unexpected happens: a test failure, a cost spike, a policy violation, a command gate rejection. The human is a safety net, not an active reviewer.

The critical mechanism is that promotion and demotion between levels is driven by policy tied to measured outcomes:

An agent that passes 50 consecutive build-and-test cycles with no regressions earns Level 2 for build tasks.
An agent that causes a test regression gets demoted back to Level 1 for that task class. Automatically. Not after a human notices.
An agent that triggers a cost anomaly (say, token spend 3x above the rolling average) gets frozen. Not demoted. Frozen. A human must explicitly unfreeze it after review.
An agent that violates a command gate policy gets quarantined and its trace log flagged for forensic review.

Quarantine triggers can also include: repeated gate rejections, unusual file access patterns, or an unexpected tool call sequence for a given task class. You are effectively building anomaly detection into the trust model.

Freeze and quarantine are first-class states, not error conditions. They exist because the system detected something outside expected parameters and chose safety over throughput. That's the behavior you want from infrastructure you're trusting with real work.

Without earned autonomy, you get one of two failure modes: either you constrain agents so tightly they can't do useful work (the "demo that never graduates to production" problem), or you give them full autonomy from day one and hope nothing goes wrong (the OpenClaw model). The gradient is what lets you start safe and expand trust based on evidence.

What to do right now if you're running OpenClaw

If you've evaluated these patterns and decided you need to build or adopt a different agent runtime, that's a longer conversation. But if you're running OpenClaw today and need to reduce risk immediately, here are the steps that matter most:

Verify the gateway binds to loopback and don't widen it. OpenClaw's own security docs default to loopback, but many of the 135,000+ exposed instances got there by changing the bind mode, port-forwarding the gateway, or deploying on a VPS without a firewall. Check your gateway.bind setting. If you've changed it to 0.0.0.0 or exposed the port through a reverse proxy, undo that unless you also have authentication and a real firewall allowlist in place.

Run it inside a container with no host network. Use Docker with restricted capabilities. Mount only the workspace directory. Drop NET_RAW and other unnecessary capabilities. Don't use --network host. Treat the agent like you'd treat any code you downloaded from a stranger's GitHub repo, because that's what third-party skills are.

Assume third-party skills are untrusted code. If you can't audit a skill, don't run it. Use Cisco's open-source Skill Scanner or Bitdefender's AI Skills Checker if you must evaluate one.

Create a dedicated service account. Don't run OpenClaw as your primary user. Create a restricted account with access only to the directories and services the agent actually needs. If it gets compromised, the attacker inherits a limited account instead of your full identity.

Rotate every credential the agent could have accessed. If you've been running OpenClaw with API keys in its environment, assume those keys have been exposed. Rotate them now. Then set up a rotation schedule going forward.

These are harm reduction measures. They meaningfully shrink the blast radius of a compromise, but they don't change the underlying architecture. For that, you need the patterns described above, whether you build them yourself, adopt a framework that implements them, or wait for OpenClaw to evolve in that direction under its new OpenAI stewardship.

Where this is going

Steinberger joining OpenAI validates what 180,000 GitHub stars already told us: autonomous personal agents are a real category, not a fad. OpenAI backing the project as an open-source foundation means more resources, more adoption, and more organizations discovering the gap between "agents that work" and "agents that are safe to trust with real work."

The industry is learning, in public and at speed, that autonomous agents require fundamentally different infrastructure than chatbots. The security invariants listed above are not opinions. They are engineering requirements that any team building production agent systems will arrive at eventually. The question is whether you arrive there before or after the incident.

That's the question we work on at Lambda-Phi. We happen to be implementing this stack (we call the layers Argos and Daedalus), but the checklist and the patterns stand on their own. If they help you build better agent infrastructure without ever talking to us, that's a good outcome.

I've been drafting a practical threat model template and a starter command-gating policy based on what we've built. If you want a copy to adapt for your own stack, shoot me an email and I'll send it over. Or if you want a threat-model review of your specific agent deployment, that's the kind of work we do.

Reach me at frank+openclaw+dev.to@lambda-phi.com or LinkedIn.

Frank Steegmans is co-founder of Lambda-Phi, where he builds secure agent runtime and governance infrastructure. He has spent the last several months designing and implementing agent systems with secret firewalling, command gating, tamper-evident tracing, and earned autonomy frameworks.

DEV Community