云微

Posted on Jun 2

Runtime Observability and Enforcement for Opaque AI Agents with eBPF: Beyond Sandboxes and Approvals

#ebpf #ai #security #observability

AI coding agents now run for hours, complete entire features end-to-end,
optimize production GPU kernels, and merge thousands of pull requests
autonomously. Meanwhile, most agent security still relies on human-in-the-loop
approval, and Anthropic's own data shows users approve 93% of prompts without
meaningful review. The result is predictable: products add bypass modes, users
disable permission gates, and 65% of firms report agent security incidents.

But the deeper problem is not approval fatigue. It is that the agent harness
(the prompt loop, tool routing, permission logic, and sandbox defaults) is
increasingly a third-party product the platform team did not write, running in a
sandbox the platform team may not own. The harness is not a trusted security
boundary. This post argues for separating agent security into three layers with
three different owners: intent authorization (harness-owned), execution
isolation (ownership contested), and side-effect verification (must be
platform-owned). When the layers agree, you have confidence. When they
disagree, you need independent observability and enforcement at the OS level to
detect it, and that is exactly the layer most agent platforms are missing. We
are building projects towards this direction:
AgentSight for runtime observation and
ActPlane for runtime harness enforcement, both using eBPF to provide an
independent runtime observability and enforcement below the agent harness.

Why Now: Complexity Up, Guardrails Behind

The important change in 2026 is not that agents exist. It is the scale and
duration of what they do.

A year ago, the typical agent task was "fix this bug" or "write this function."
In 2026, agents routinely run for hours on complex, multi-step work. OpenAI
documented a Codex session that ran for 25 hours uninterrupted,
consuming 13 million tokens and producing 30,000 lines of code from a blank
repository. Anthropic's agentic coding report cites a 12.5-million-line
codebase change completed in a single 7-hour run. Meta's
KernelEvolve uses multi-agent coordination to write and optimize
production GPU kernels, compressing work that previously required weeks of
expert systems engineering into hours. On SWE-bench Verified, top agents now
resolve 60–70% of real GitHub issues, up from under 30% in early

Devin has merged hundreds of thousands of pull requests across enterprise customers with a 67% merge rate. Goldman Sachs deployed hundreds of Devin instances across a 12,000-person engineering team.

Beyond coding, general-purpose autonomous agents have gone mainstream.
OpenClaw, an open-source agent with
over 300,000 GitHub stars, connects to LLMs and executes shell commands,
browser automation, email, calendar, and file operations on the user's machine.
CrowdStrike called it "the AI Super Agent" security teams need to worry
about:
between January and April 2026, 470 security advisories
were filed against it across three disclosure waves.

These are not research demos. They are production workflows: background tasks,
parallel execution, multi-hour sessions, end-to-end feature development, kernel
optimization, and enterprise-scale code changes.

Meanwhile, the guardrails designed to keep agents safe have not kept pace.

Most agent security still relies on human-in-the-loop approval: a prompt asks
the user to approve or deny each action before it executes. This works for short
sessions with a few tool calls. It does not work when an agent makes hundreds of
decisions over hours of autonomous operation.

The evidence suggests that approval-based control is already failing in
practice. Anthropic's own data shows that Claude Code users approve 93% of
permission prompts, a rate consistent with rubber-stamping
rather than meaningful review. An independent stress test of Claude Code's auto
mode found an 81% false negative rate on ambiguous
state-changing actions, meaning the classifier allowed 4 out of 5 actions that
should have required human review. Real incidents have followed: in documented
cases, users running agents without permission gates had their home directories
deleted by rm -rf commands the agent generated. A 2026
industry survey found that 65% of firms reported AI agent security
incidents, primarily
unauthorized data access, credential exposure, and exfiltration to external
endpoints, with most involving organizations lacking proper agent access
controls.

Products have responded by adding bypass mechanisms. Claude Code offers
--dangerously-skip-permissions. Windsurf's Cascade agent proceeds
autonomously where Cursor stops to ask. Community guides now
focus on "how to safely use YOLO mode." Anthropic researcher Nicholas Carlini
ran 16 parallel Claude agents with permissions bypassed, with the
caveat: "Run this in a container, not your actual machine."

This is the tension: the more capable agents become, the more users want to
let them run uninterrupted, and the less effective human-in-the-loop becomes as
the primary security boundary.

That tension is what creates the need for a different security model.

The Accountability Gap

The deeper issue is not just that agents are more capable. It is that the agent
harness, the component that decides what the agent does, is increasingly a
third-party product the platform team did not write.

A modern agent harness is not a thin wrapper around a model. It includes a
prompt loop, planning and retry logic, tool routing, MCP clients, permission
modes, approval gates, hooks, memory, logs, credential handling, and sometimes
sandbox defaults. In many deployments, that harness comes from a hosted
coding-agent service or an open-source framework the platform team does not
control.

This is already visible across the ecosystem. GitHub Copilot's coding
agent runs autonomously in GitHub Actions, researching
repositories, creating plans, making changes, and opening pull requests. OpenAI
Codex runs background tasks in sandboxed cloud environments with
controlled network access. Claude Code runs cloud sessions in Anthropic-managed
VMs with scoped credentials. Kubernetes SIG is defining Agent
Sandbox for isolated, stateful agent workloads. Recent research
datasets show agent-authored pull requests at scale across real
repositories.

The ownership split is now explicit in major platforms. Anthropic's shared
responsibility framework divides agent security into four
layers (Model, Harness, Tools, Environment) and
stresses that an agent's behavior depends on all four working together, so the
harness, tools, and environment, the layers shaped by the deploying party, are
as decisive as the model itself. Anthropic itself notes that even together,
these layered safeguards are not a guarantee. The question the framework
leaves open is what happens when a failure crosses these layers, and whether
the deployer has independent observability to detect it. In cloud infrastructure,
the analogous gap in shared responsibility led to independent observability
and audit services (CloudTrail, Config, GuardDuty) controlled by the
customer, not the provider. Agent infrastructure has no equivalent yet: the
deployer is told it owns harness, tools, and environment, but often has no
independent way to verify what those layers actually did at runtime.

GitHub's agentic
workflow architecture starts from the premise that "agents cannot be trusted by
default, especially in the presence of untrusted inputs",
using kernel-enforced communication boundaries that hold even if the agent
container is compromised. OpenAI's Codex documentation acknowledges
that "devcontainers provide substantial protection, but they do not prevent
every attack."

The platform team still owns the repository, the CI runner, the Kubernetes
cluster, the service accounts, the secrets, and the internal network. But the
runtime acting on those assets may be opaque.

There is also a second split that matters even more for platform teams: the
sandbox may not be controlled by the environment owner either. If the agent
runs in a provider-managed cloud (Claude Code on the web runs in
Anthropic-managed isolated VMs with scoped credential
proxies; Codex runs in OpenAI-managed containers), the
platform team cannot attach its own monitoring, modify isolation policy, or
inspect the sandbox internals. Even Anthropic's own managed agent architecture
explicitly decouples the "brain" (Claude + harness) from the
"hands" (sandboxes), treating containers as disposable and ensuring tokens are never reachable
from the sandbox where generated code runs. This is good architecture, but it is the provider's architecture,
not the platform team's.

When agents run locally or on self-hosted infrastructure (GitHub now supports
self-hosted runners for its coding agent, and Kubernetes
Agent Sandbox provides gVisor/Kata-backed isolation under the
platform operator's control), the environment owner can wrap the agent in its
own sandbox and observability. When agents run in provider-managed
environments, independent observability and enforcement must move to the
boundaries the platform team does control.

This creates the accountability gap: the platform team is responsible for
production impact from a workload it cannot fully inspect, running in a sandbox
it may not own.

The old mental model was simple: the agent is risky, so put it in a sandbox.
The new reality has a different trust boundary: the agent and its harness are
part of the workload, and the environment owner needs independent runtime observability.

Three Layers, Three Questions

MCP, sandboxes, and OS-level observability are all necessary for agent security.
They are not interchangeable. Each answers a fundamentally different question,
and each has a different owner.

Intent authorization (MCP, tool gateways, approval prompts) answers: what
is the agent supposed to do? Which tools may it call, under which identity,
with which scopes? This is the right place to enforce access control before a
dangerous action happens. But a tool approval is not proof of side effects. A
framework log saying "run tests" does not prove that the process tree only ran
tests. An MCP server can be well-authenticated and still be part of a workflow
that causes unexpected local effects. This layer is typically owned or mediated
by the agent harness.

Execution isolation (containers, VMs, network policy, namespaces) answers:
what can the agent reach? Which files, network endpoints, credentials, and
syscalls are available? This is the right place to limit blast radius. But a
sandbox does not automatically record what the agent attempted within its
constraints: which process read a secret, which subprocess opened a network
connection, whether the sandbox policy matched the approved intent. This layer's
ownership is contested: it may belong to the agent provider, the platform team,
or both.

Side-effect verification (OS/runtime observability) answers: what actually
happened? Which processes ran, which files were read, which network connections
were opened, which credentials were accessed? This layer provides facts about
execution, independent of what the framework reported or the sandbox intended.
This layer must be owned by the environment operator. Otherwise there is no
independent source of truth.

The security model is the combination:

authorize intent  →  isolate execution  →  verify side effects
(harness-owned)      (ownership contested)  (must be platform-owned)

When all three layers agree, you have confidence. When they disagree, you need
OS-level observability and controls, independent of the harness, to detect the
mismatch, contain the damage, and reconstruct what happened.

Why Independence Matters

The reason to keep these layers independent follows from the trends above, but
also from a deeper structural argument about ownership and trust.

Approval fatigue

When approvals are relaxed (as the evidence above shows they routinely are),
the other two layers must compensate. If you auto-approve routine actions, you
need an independent way to verify what those actions actually did. If you
bypass permissions for speed, you need stronger containment and stronger observability.

Harness opacity

When the harness is opaque, application-level telemetry cannot be the sole
source of truth. OpenTelemetry GenAI conventions and framework-level tracing are
valuable when you own the framework. But opaque agent apps, closed-source
runtimes, hosted execution, stripped binaries, and arbitrary subprocess trees
can all break the assumption that the framework trace is complete. OpenClaw
illustrates this directly: its behavior is non-deterministic across
runs, producing different tool-calling
sequences for the same input, which makes static code review inadequate and
drove multiple teams to build dedicated runtime observability tools for it
(OneClaw,
ClawTrace).
Security researchers have already found 30+ vulnerabilities across all major AI
IDEs (Cursor, Copilot, Windsurf, Claude Code), enabling data theft
and remote code execution through prompt injection into agent tool chains.

The MCP layer records intended tool calls. The OS layer records actual side
effects. When the harness is opaque, the gap between these two is exactly where
security incidents live.

The trust boundary is an ownership boundary

The deepest reason for independence is that the three layers serve different
owners with different incentives.

The harness provider's goal is to complete the user's task: maximize
autonomous coding productivity, reduce permission friction, deliver results.
The platform team's goal is to protect the repository, secrets, cluster,
CI runner, internal network, and production APIs. These goals are not opposed,
but they are not identical. When they conflict, when the fastest path to task
completion involves reading credentials, opening network connections, or
modifying files outside the workspace, the harness will optimize for
completion unless an independent boundary stops it.

This is why Bhattarai and Vu argue that
"probabilistic compliance is not compliance": training-based and
classifier-based defenses may reduce empirical attack rates, but cannot provide
deterministic guarantees under adversarial conditions. Only architectural
enforcement can. Red Hat's experience deploying multi-agent systems on Kagenti
frames the same insight differently: this is "a multi-tenancy problem disguised
as an AI problem". The agent is an untrusted tenant. The
platform needs the same kind of isolation, identity, and audit controls it would
apply to any untrusted workload.

The OWASP Top 10 for Agentic Applications reinforces this
framing. Its top risk (ASI01, Agent Goal Hijacking) is that "agents cannot
reliably distinguish instructions from data," and a single malicious input from a
repository, issue, MCP response, or web page can redirect the agent to perform
harmful actions using its legitimate tools. This is not a hypothetical:
Bishop Fox demonstrated confused deputy attacks where
instructions embedded in support tickets caused agents to exfiltrate data using
authorized tools, with "the user's name on every audit log entry." Docker
documented a GitHub prompt injection chain where a
malicious issue hijacked an MCP-connected agent to steal confidential data from
private repositories.

The threat model for platform teams therefore has three adversary categories:

Threat	Which layer fails	Runtime observability detects
Compromised agent (prompt injection, malicious repo/issue/MCP response)	Intent layer: agent is tricked into unintended actions	Actual side effects diverge from stated intent
Untrusted harness (opaque permission logic, incomplete logs, unauditable internal state)	Cannot verify harness completeness	OS-level facts independent of harness reporting
Sandbox escape or policy gap (container breakout, mounted credentials, network bypass)	Isolation layer fails or is misconfigured	Detects behavior outside expected sandbox boundary

AISI's SandboxEscapeBench makes the third category concrete:
frontier models can reliably escape container sandboxes under
misconfigurations that plausibly occur in real systems, and the researchers
discovered four unintended escape paths the benchmark designers had missed.
Their recommendation: "treat plain Docker isolation as insufficient by
default."

In all three cases, OS/runtime observability is the independent control
that lets the platform team detect the problem, regardless of which other layer
failed.

What OS-Level Monitoring Captures

At the OS/runtime layer, observability captures:

Process lineage: the full tree from agent to subprocess to network call
File access: which paths were read or written, including credential paths
Network behavior: connections, destinations, timing, data volume
Container metadata: namespace, cgroup, pod identity, service account
Subprocess behavior: commands that bypass framework instrumentation

This data is collected below the application layer, typically via eBPF,
audit subsystems, or kernel instrumentation. It does not require modifying the
agent app. Its key property is independence: the observability is owned and
operated by the environment operator, not by the agent provider.

This makes cross-layer comparison possible:

Framework report:    run tests
Sandbox policy:      workspace mounted, registry allowed, SA token mounted
OS observability:       agent → shell → python → curl
                     read: /var/run/secrets/.../token
                     connect: unknown external host

Each layer saw a different part of the event. Without the OS layer, this is an
undetected credential theft: a service account token read and exfiltrated while
the framework logged only "running tests." The platform team discovers the
breach days later, if at all. OS-level observability is what turns an invisible data leak into a real-time
detection.

Deployment Reality

OS-level observability is strongest when you control the host, node, or VM where the
agent executes. If the agent runs entirely in a provider-managed environment,
you may not be able to attach eBPF inside it.

In that case, the same model applies, but observability shifts to the boundaries you do control:

Repository permissions and branch protection
Scoped credentials with minimal lifetime
CI/CD and GitHub audit logs
Network proxies and webhook events
Artifact access logs
Provider-supplied session logs

This observability is weaker than owning the runtime boundary, but it is still better
than treating the agent transcript as the only source of truth.

The design question for platform teams is:

Where is the lowest layer I actually control?
That is where independent observability should live.

AgentSight and ActPlane: Observe, Then Enforce

We are building open-source tools that implement the verification layer
described above, each addressing a different half of the problem.

AgentSight is a zero-instrumentation observability tool for
AI agents. It uses eBPF to intercept SSL/TLS traffic and monitor process
behavior at the system boundary, with no code changes, no SDKs, and no
framework integration required. Point it at any agent process (Claude Code,
Codex, a custom Python agent) and it captures the full picture: process
lineage, LLM API calls (prompts and completions), file access, network
connections, and tool invocations, all correlated into a live timeline. This is
the "see what actually happened" layer. Because it operates below the
application, it works even when the agent runtime is opaque, closed-source, or
running arbitrary subprocesses that bypass framework-level tracing. In
practice, this means detecting credential access, data exfiltration attempts,
and unauthorized network connections as they happen, not days later when an
external party reports the breach.

ActPlane is an OS-level harness for AI agents. Where AgentSight
observes, ActPlane enforces. You write behavioral contracts in a YAML-based
rule language (labeled information-flow control, not static allow-lists), and
ActPlane compiles them into an eBPF program that enforces constraints at the
kernel level: every exec, file open, and network connect in the agent's
entire process tree is checked against the policy. When a rule is violated,
ActPlane blocks the action and feeds a human-readable reason back to the agent
through its hook system, so the agent self-corrects rather than failing
silently. The rule language supports data-flow tracking across fork/exec
chains, causal ordering ("run tests before committing"), and staleness
invalidation, going well beyond what sandboxes or tool-layer guards can
express.

The two tools are complementary. AgentSight provides runtime observability:
independent, below-the-application visibility into what the agent did. ActPlane
provides the enforcement plane: deterministic, kernel-level guarantees about
what the agent cannot do. Together they implement the "verify side effects"
layer of the three-layer model, independent of the harness provider and
independent of who owns the sandbox.

Both are possible implementations of this architecture, not the only ones.
The important point is the separation: observe and enforce at a layer the
environment operator controls, regardless of which agent runtime sits above.

This also addresses ecosystem gaps Anthropic identifies: the need for
cross-deployment security telemetry sharing and open standards for agent
security. Independent runtime observability that travels with the workload,
rather than being locked to a specific harness or provider, is the foundation
for both.

Practical Checklist

If you are building or evaluating an agent platform, ask these questions at
each layer.

Intent authorization (MCP / tool access):

Are MCP servers allowlisted?
Are OAuth scopes minimal and audience-bound?
Are local MCP servers treated as code execution risk?
Are high-risk tools gated by human approval?
Are tool calls logged with enough context for audit?

Execution isolation (sandboxing):

Is filesystem access default-deny or broad workspace mount?
Can the agent reach cloud metadata endpoints?
Is network egress restricted by domain, IP, or proxy?
Are service account tokens mounted into the environment?
Are process, memory, CPU, and runtime duration bounded?
Who owns the sandbox policy: the platform team or the agent provider?

Side-effect verification (runtime observability):

Can you reconstruct process lineage for an agent session?
Can you see file and credential access below the framework?
Can you correlate network egress with pod, service account, and command?
Can you detect mismatch between tool intent and OS side effects?
Can you replay an incident without trusting only framework logs?
Can you demonstrate to auditors (SOC 2, ISO 27001) how automated agent access to production data and credentials is monitored and logged?

Guardrail integration:

Which side effects should be blocked immediately?
Which should trigger alert or human review?
Which policies belong in MCP config, sandbox config, Kubernetes policy, eBPF/LSM, or network controls?
What happens when framework logs and OS-level observability disagree?

Closing

Agent runtimes are becoming more capable, more managed, and more opaque. The
security model cannot depend on any single layer, especially when the layers
have different owners.

The harness is not a trusted boundary. The sandbox ownership depends on the
deployment model. The only layer the environment operator can guarantee it
owns is OS/runtime observability.

MCP authorizes intent. Sandboxes constrain execution. OS-level observability verifies side
effects. Each is necessary; none is sufficient. The practical model is their
separation:

authorize intent  →  isolate execution  →  verify side effects
(harness-owned)      (ownership contested)  (must be platform-owned)

The implementation details vary by deployment, but the separation, and the
ownership question, is the part that should remain stable.

If you are exploring this space, AgentSight and
ActPlane are our open-source starting points for the observation
and enforcement layers respectively.

DEV Community