DEV Community: 云微

An Empirical Study: AI Agent Rules Need Context and Layered Enforcement

云微 — Mon, 20 Jul 2026 00:57:55 +0000

A rule like "run the full test suite before committing" looks simple until an AI coding agent edits a source file after the last test run and then calls git commit. The kernel sees an ordinary process writing a commit object, while the harness sees one more tool call, yet the decision depends on which test result is still fresh, which edit invalidated it, and whether this commit is allowed now.

The ActPlane paper measures the gap between the behavioral rules developers write and the subset a system can actually check. Its statement-level analysis of 2,116 instructions shows that developers are not short of rules; the difficulty lies in turning natural-language requirements into state that a system can observe and evaluate over time. Many rules concern files, processes, or network activity but still depend on repository structure, task progress, or prior events, so a single OS hook can cover only part of the policy set.

Developers Have Already Written the Policies

Most discussions of AI agent safety start from threat models or attack surfaces. ActPlane starts from a different question: what do developers already tell their agents to do and not do, and what would it take to enforce those instructions?

The study examines 64 popular repositories containing CLAUDE.md and AGENTS.md files (median 20K GitHub stars, snapshot from 2026-05-23), covering 84 instruction files and 2,116 individual statements. Unlike prior work that analyzed instruction files at the file or section-heading level, ActPlane classifies every statement independently. The study asks three questions: are instruction files primarily behavioral policies or descriptive context? Which policies require OS-level enforcement, and what kinds of OS-level checks do they need? What context is needed to instantiate these policies into concrete, enforceable rules?

Statements were extracted through a two-pass LLM agent-assisted pipeline that recorded source line ranges and four labels per statement: content type, topic, enforcement level, and context requirement. A validation script verified full source coverage and verbatim span matching, then two independent agents (Claude and Codex) cross-checked the results. A stratified sample of 100 statements went through independent human review, which confirmed the labels were correct.

Across those 2,116 statements, 64% are policies that require, forbid, or condition a specific agent action. The remaining 36% are descriptive context, such as architecture notes or project background. Policy density varies widely across repositories, from 0% to 97%, with 70.1% of repositories containing more policy statements than descriptive ones. File- or heading-level studies do not report this statement-level distribution, which is why the finer classification matters.

To understand how policies distribute across concerns, the study assigns each statement to one of 12 topic categories adapted from prior instruction-file research, applied at statement granularity rather than file granularity. Development Process and Implementation Details dominate the policy landscape at 87% and 85% respectively. Architecture is mostly descriptive at 23% because directory layouts and design summaries make up the bulk of those sections. The imported source figures call policy statements directives and call the system-observable policy subset system-level directives. The prose follows the paper's policy and system-observable terminology.

Five real statements from the dataset illustrate the range of enforcement requirements:

Statement	Enforcement level	Context
S4: "Never push to main directly."	per-event	self-contained
S5: "Never modify upstream source code."	per-event	project
S6: "Run the full test suite before committing."	cross-event	project
S7: "Data read from .env must not reach the network."	cross-event	project
S8: "Do not update dependencies without approval."	per-event	task

The Enforcement Gap Begins with Context

Each policy exits at the first matching tier of an enforcement waterfall. Semantic-only covers reasoning, communication, or output style; content covers predicates over file contents; per-event covers a single command, file access, or network connection; and cross-event covers policies that depend on temporal ordering or data lineage across operations. The union of content, per-event, and cross-event tiers is called system-observable.

Of the 1,361 policies in the dataset, only 17% are semantic-only. The remaining 83% are system-observable, comprising 38% that require content inspection, 29% that match one OS event, and 16% that require cross-event state. Only the per-event and cross-event classes, 45% together, form the OS-enforceable subset. Cross-event policies concentrate in Development Process, which accounts for 39.5% of all cross-event policies.

These cross-event policies follow four recurring patterns. Temporal ordering constrains sequencing: "run tests before committing" requires that one event happened after another, not merely at some earlier point. Cross-file consistency links changes across artifacts: "update docs when behavior changes" couples a source edit to a documentation update. Multi-step workflows enforce release checklists with verification gates, where each step must complete before the next begins. Conditional triggers couple operations: "if you change specs, also update the SDK" fires only when a precondition is met.

None of these can be decided from a single event, so enforcement must record what ran, in what order, and what has changed since. Such policies are widespread, with 81% of repositories containing at least one cross-event policy and 43% spanning all four enforcement tiers.

Context dependence compounds the enforcement challenge. Of the 1,127 system-observable policies, only 26.4% are self-contained. The majority, 64.2%, require project context: "the test suite" or "upstream source" must be resolved against a specific repository before the policy becomes a concrete rule. Even a per-event policy like S5, "Never modify upstream source code," requires resolving which paths constitute "upstream source" before a file-write check can fire. Another 9.4% require task context, such as "unless explicitly requested" or "without approval."

The two difficulties compound, because the policies that require tracking state across events are also the ones that rarely specify the concrete commands and paths needed to write the rule. Cross-event policies are 95% context-dependent (77% project, 19% task), compared to 58% for content policies. A policy that says "run tests before commit" sounds simple until the enforcement engine needs to know which test command to watch for, which source directories count as "relevant edits," and whether the test passed or merely ran.

A fixed set of static rules can cover only the self-contained fraction. Instantiating the rest requires reading the repository and interpreting the current task before any check can run.

Agent policy enforcement begins by compiling repository and task context into concrete state that deterministic checks can evaluate.

One Rule Crosses Several Enforcement Layers

Prompt instructions rely on the model's own compliance, but they are vulnerable to prompt injection and compete with the user's task prompt for attention in a long context window. Separate agents or LLM guards can check prompts, responses, or action trajectories at runtime, but these checks are inherently probabilistic.

Tool-call guardrails and application-level information-flow control (IFC) systems intercept at the harness boundary deterministically, but they observe only harness-mediated requests, not system-level effects once a tool starts executing. An indirect subprocess, shell-out, or compiled binary can bypass the tool boundary. Consider an agent that writes a Python script containing subprocess.run(["git", "push"]) and then executes it: the tool-call layer sees "run python script.py," not the git push inside it.

OS-level mechanisms like seccomp, AppArmor, Landlock, and Tetragon control resource access, not actions in the sense developers write about. They expect statically pre-written policies and return opaque errors that confuse the agent: a bare EPERM with no explanation of what rule was violated or how to recover.

Those layers still leave a structural split between who holds policy context and who can see every execution path. Most rules need project or task context that resides with the agent, so the agent itself must turn policies into concrete rules, yet many policies define event ordering or data flow that tool-call guardrails never see, so the rules must still be concrete enough for deterministic OS-level enforcement. Bridging that gap is what ActPlane addresses.

Two design requirements follow. The policy specification must be agent-writable yet OS-enforceable, so the agent can produce concrete rules from natural-language policies with minimal expertise and receive semantic feedback to understand violations and recover. Enforcement must also stay safe, isolated, and efficient, meaning agent-authored policy must not weaken constraints set by higher authority, must not affect other agents' policies, and must not slow the agent's normal workload.

Compiling Intent into Enforceable State

Each ActPlane rule has five components: a source that identifies what is being governed, a target operation (such as exec, write, or connect), an effect, an optional temporal gate, and a reason string for semantic feedback. The paper's running example makes this concrete:

kill exec "git" "commit" unless after exec "go" "test" exits 0 since write "**/*.go"

This rule kills any git commit unless go test has exited successfully since the most recent relevant source edit. The reason field, omitted here for brevity, provides the agent with a structured explanation when the rule fires.

Effects form a gradient matching the distinction between instructions and constraints. Block is a pre-operation synchronous denial with no TOCTOU gap: the kernel intercepts the system call before it executes, and the agent can reroute. Kill terminates the process after the operation has begun, preventing the agent from switching to an alternate channel. Notify delivers guidance without stopping the action. Constraints use block or kill; instructions use notify.

Temporal gates let rules express ordering rather than point-in-time predicates. The after ... since ... construct encodes that one event must have occurred after another: tests must have run after the most recent edit, not merely at some earlier point. The exits N qualifier distinguishes successful from failed exits. A lineage gate checks process ancestry, allowing rules to restrict operations to specific process trees.

Information-flow labels propagate along fork, exec, read, write, and connect and are monotonic: once a process reads a labeled object, the label cannot be removed. When a process reads .env, it acquires that file's source label. If it later attempts to connect to an external endpoint, the rule matching that label fires and blocks the connection. This is how S7 from the study ("Data read from .env must not reach the network") becomes an enforceable cross-event rule.

Policy authority relies on a temporal trust boundary. Rules loaded before the agent starts are higher-authority and immutable to the agent. The agent and its sub-agents can add new rules or narrow existing ones within child domains, but they cannot weaken, remove, or disable inherited constraints. Runtime deltas arrive through a ring buffer and pass through an in-kernel authority checker that validates each change against the domain hierarchy before activation. The trusted computing base consists of the kernel enforcement engine and the higher-authority policy, and everything below this boundary is untrusted execution. A compromised userspace agent therefore cannot modify the active rule set beyond what its domain hierarchy permits.

Because labels are monotonic, long-running sessions risk over-tainting: after many reads, a process can accumulate so many labels that every subsequent operation triggers a rule. In a typical coding session, a process might read dozens of configuration and source files, and without mitigation each read adds a label, so after enough reads every subsequent write or connect would match some rule. ActPlane mitigates this by clearing inherited labels when a fresh subprocess is spawned, bounding taint accumulation to the lifetime of each process rather than the entire session.

The 607-policy dataset exercises most DSL features and validates the language's expressiveness. Effects skew toward observation: 66% of clauses are notify, 29% are block, and only 5% are kill, reflecting that most policies monitor rather than prevent. Hooks concentrate on code execution (60% exec) and file mutation (37% write), with network and cleanup operations under 1% each. Cross-event features see substantial use, with 28% of policies using an after/since temporal gate and 214 using unless to encode exceptions.

The implementation stays compact enough to reason about. The userspace compiler and runner are roughly 3.2K lines of Rust, and the eBPF enforcement engine is roughly 1.8K lines of BPF C. BPF-LSM hooks handle pre-operation decisions (block), while tracepoints handle observation and post-operation termination (kill). Labels live as 64-bit bitmasks in per-object BPF maps, so propagation reduces to a single bitwise OR, and the engine can support up to 128 concurrent rules, comfortably above the largest observed repository's 66 policies. For deeper coverage of the deployment architecture and mechanism details, see ActPlane: Pushing Agent Harness Enforcement Down to Kernel eBPF.

Recovery Reveals What Enforcement Alone Misses

Before recovery can matter, the DSL has to accept the policies developers already wrote. A Codex agent compiled all 607 OS-enforceable policies into ActPlane rules on the first or second attempt, with only 2 of 607 needing a syntax-error retry. The translation completed in 34 minutes using 7 subagents with 4 running in parallel, expanding 607 policies into 1,283 rule lines. The cost was roughly $0.028 per policy using 1.7M input tokens and 177K output, compared to approximately $11 per rule at typical software-engineer rates. Most policies are structurally simple, with 74% having at most two enforcement clauses and even the 95th percentile staying under 152 tokens.

Contextual enforcement resolves far more violations than any baseline. On the decision-compliance benchmark (190 traces, 38 rules drawn from the empirical study), ActPlane achieves a 75.8% Decision Compliance Rate under the primary Qwen3.6-27B setting. In this setting, the tested agent, prompt-filter classifier, and trajectory judge all use Qwen3.6-27B; the benchmark traces were generated earlier by Codex CLI under GPT-5.5, and policy artifacts for the non-prompt systems were translated separately with Claude Opus 4.6:

System	DCR
ActPlane	75.8%
ActPlane-opaque (no feedback)	53.7%
FIDES (tool-level IFC)	48.9%
Prompt-filter	48.4%
Tool-regex	45.3%

The benchmark generates five traces per rule: two compliant traces that stress false-positive detection (one exercising permitted actions near the boundary, one performing technically compliant operations that resemble violations) and three violation traces reflecting execution-path categories observed in deployed coding agents. Direct violations trigger the prohibited operation via a tool call, script violations split authoring and execution into a subprocess, and hidden violations embed the side effect in an auxiliary artifact behind a neutral entrypoint.

The gap concentrates on violation traces, where ActPlane correctly resolves 86 of 114, compared to 27 to 44 for baselines, a 2.0 to 3.2 times improvement. All systems perform comparably on compliant traces and direct violations, so ActPlane's advantage opens on script and hidden trace families, where the prohibited effect occurs inside a subprocess or behind a neutral entrypoint and tool-call interception cannot observe it.

Semantic feedback is the dividing line between compliance and retry loops. Full ActPlane produces three times more correct violation-trace outcomes than the same engine without feedback, 86 versus 27. Of the violations ActPlane detects, recovery rate, the fraction where the agent then complies, is 97.7% with semantic feedback versus 31.4% without. When the agent learns why an action was blocked and what state it needs to repair, it reroutes and completes the task. When it receives only a bare denial, it retries the same forbidden action through alternative paths.

That recovery loop only helps if enforcement stays cheap enough for everyday coding. End-to-end overhead was measured on two workloads under no-hit configurations where policies are loaded but no rule fires. The first workload is an agent trace suite that replays 68 tool actions with 20 Bash subprocesses. The second is a Linux kernel build (defconfig + vmlinux, make -j24). At 32 active rules, ActPlane adds 1.9% on the agent trace and 6.5% on the kernel build. Even at 100 rules, overhead stays below 8.4%.

Microbenchmarks isolate where per-syscall cost concentrates. Across the one- through 100-rule configurations, the absolute additions on fork and exec range from 3.12 to 68.73 microseconds. At 100 rules specifically, fork adds 20.39 microseconds and exec adds 68.73 microseconds over native latencies of 48.94 and 248.30 microseconds.

Under the same 100-rule load, absolute latencies reach 13.4 microseconds for open, 0.84 microseconds for write, and 3.17 microseconds for connect, so path lookups and rule scans dominate these otherwise sub-microsecond file and network calls. The cumulative ActPlane overhead of an entire tool-call's syscall sequence is five to six orders of magnitude smaller than a single LLM inference turn of 2 to 10 seconds. Policy updates propagate quickly: a one-rule hot reload submitted through the userspace ring buffer reaches the kernel drain path in 26.3 microseconds on average, and an immediate exec violation is detected at p50 176.4 microseconds including process launch and event delivery.

ActPlane's advantage replicates under a second model. A DeepSeek-Pro V4 end-to-end replication preserves the system ranking with ActPlane highest at 77.4% DCR, and per-cell agreement between the two model settings yields a Cohen's kappa of 0.822.

Translation quality drives both detection and recovery rates, because rules that are too narrow miss violations while rules that are too broad match compliant actions. To measure improvability, the paper feeds each false-negative trace's evidence and corrective feedback to the translation agent and lets it revise the rule once. Rerunning the 28 false-negative traces with revised rules recovers 26 (93%), showing that the DSL supports iterative refinement.

Results on a selected set of real-world coding tasks suggest the pattern may extend beyond synthetic traces. On a 21-task subset of OctoBench with 61 OS-enforceable rules spanning seven repositories, ActPlane improves user-query reward by 9.9 points and implementation/test reward by 9.7 points over the no-enforcement baseline. The gains extend beyond compliance-typed checks, suggesting that OS-level enforcement with semantic feedback can help agents follow rules and complete tasks more effectively on this subset.

A separate safety benchmark extends the evidence beyond the paper's own dataset. On 361 OpenAgentSafety personal-assistant tasks, ActPlane loads agent-generated safety policies as higher-authority rules before the agent begins, preventing 74% of baseline-unsafe behaviors (78 of 106 unsafe outcomes blocked). Those policies were generated only from task descriptions, without human tuning. That deployment-like constraint also reveals a cost: ActPlane activated on 16% of tasks whose baseline was already safe when a description-only policy matched a benign operation near the prohibited boundary.

The 28 unblocked cases fall into three categories: chat or semantic harm where the unsafe behavior is a message with no OS-observable artifact, unsafe file content that falls outside ActPlane's primary scope, and service-side artifacts where the effect is a WebDAV upload or database mutation inside a service container that the current hook set does not observe.

The ActPlane source code is available on GitHub. The policies/ directory contains the full set of 607 translated rules across all 64 repositories, ready to serve as starting points for your own instruction files.

Where Layered Enforcement Stops

Is eBPF enough for AI agent safety?

eBPF provides deterministic enforcement over OS events such as file writes, process launches, and network connections. The per-event and cross-event classes form the directly OS-enforceable 45% of policies. The broader 83% system-observable set also includes 38% content policies, which need linters or static analyzers, while the remaining 17% cover reasoning, communication style, or output quality and require harness-level controls. Task intent, policy authority, content semantics, and isolation remain responsibilities of the layers around kernel enforcement.

Can behavioral baselines replace policy?

Behavioral baselines answer "is this unusual?" by flagging deviations from historical patterns. Policy answers "is this permitted under the current task?" A routine git commit can be perfectly normal by baseline standards and still violate a project rule. An unfamiliar deployment endpoint can trigger an anomaly alert and still be legitimate because the user explicitly requested it. Detection and authorization are two different problems, and a mature system uses both, letting anomaly signals propose candidate rules that go through authority-aware review before becoming enforced policy.

Does AgentSight enforce ActPlane policies?

No. AgentSight is a system-level profiler and monitor that captures what agents do at runtime. ActPlane is the enforcement engine that blocks or redirects actions violating loaded policies. The two serve complementary roles: runtime evidence from AgentSight can feed into policy review and refinement, but observation and enforcement remain separate responsibilities. An organization might use AgentSight to discover that agents frequently write to production config files, then encode a corresponding ActPlane rule to block that pattern with an explanation of why it is forbidden.

What remains outside ActPlane's coverage?

The paper's threat model is explicit about boundaries. Semantically equivalent operations can bypass string-based matching: an agent could use a custom Git client instead of calling exec git, though the underlying connect and write system calls remain visible. Service-side effects behind protocol boundaries, such as WebDAV uploads or database mutations inside service containers, also escape the current hook set. File-content semantics, kernel compromise, CAP_BPF compromise, and side channels are all out of scope. The 17% of policies that are semantic-only, covering reasoning quality, communication tone, or output formatting, require harness-layer handling rather than kernel enforcement.

For a team governing coding agents, the practical decision is narrower than "add more eBPF." Start from the statements already sitting in CLAUDE.md and AGENTS.md, resolve the project and task context they omit, and only then compile the OS-enforceable subset into kernel checks with feedback the agent can act on. Prior instruction-file studies worked at file or section granularity, while the ActPlane dataset measures individual policy statements and maps them to enforcement and context requirements. The ActPlane repository contains the implementation, and a broader three-layer security model placing kernel enforcement alongside isolation, identity, and content controls appears in Runtime Observability and Enforcement for Opaque AI Agents with eBPF.

References

ACRFence: Preventing Semantic Rollback Attacks in Agent Checkpoint-Restore

云微 — Tue, 02 Jun 2026 11:11:23 +0000

AI agent frameworks are bringing checkpoint/restore, time travel, and rewind into everyday developer workflows. If an agent makes a mistake, it can go back to a checkpoint. If a user wants to explore another path, the agent can branch from an earlier state. This is useful for debugging and human-in-the-loop control, but it becomes dangerous once the agent has already called external tools.

Traditional checkpoint/restore rolls back local state. It cannot undo side effects that have already happened in the external world. For ordinary programs, the usual answer is idempotency: retry the external call with the same request id, and the server returns the previous result instead of executing the action again. But an LLM agent is not an ordinary deterministic program. After restore, it may synthesize a semantically equivalent tool call with slightly different fields, such as a new UUID, timestamp, nonce, or reference number. The server cannot see that this is a retry of the same intent. It only sees a new valid request.

This post is based on our arXiv paper ACRFence: Preventing Semantic Rollback Attacks in Agent Checkpoint-Restore. We introduce semantic rollback attacks: attacks that exploit the gap between rolled-back agent state and non-rolled-back external state to trigger duplicate irreversible actions or revive consumed authority.

A Simple Transfer Example

Suppose a user asks an agent to transfer $500 to Bob. The agent calls a bank API, generates a unique reference id a1b2c3d4, and the transfer succeeds. The agent then calls Bob's MCP service to confirm the receipt. Bob's service returns a malformed response that crashes the agent. The framework restores the agent to a checkpoint before the transfer.

After restore, the agent again executes the intent "transfer $500 to Bob." This time, however, it generates a different reference id, f9a8b7c6. The bank's duplicate detection logic only sees two different references, so it accepts the second transfer. Bob receives $1000, while the agent's local view remains "I transferred once."

Figure 1: Action Replay. A malicious MCP service triggers a crash after a successful transfer. After restore, the agent reissues the transfer with a new reference id, so the bank treats it as a new transaction.

The key point is not that the transfer API lacks idempotency. The problem is that the precondition for idempotency is broken. Systems such as Stripe and AWS ECS rely on the caller retrying with the same idempotency key or the same critical parameters. An LLM agent rethinks after restore and may produce a different token sequence. Even at temperature 0, byte-identical tool calls are not guaranteed. As a result, traditional server-side deduplication cannot recognize a "semantically same" retry.

Root Cause: Local Rollback, External Progress

Checkpoint/restore systems can save local process state, conversation context, variables, file descriptors, and related runtime state. They cannot automatically undo committed external effects. Transfers, emails, cloud resource creation, data deletion, and one-time token consumption are all irreversible side effects from the framework's point of view.

In the agent setting, three facts combine badly:

The agent state is rolled back. The agent returns to an old checkpoint and no longer remembers that the transfer succeeded.
The external state is not rolled back. The bank ledger, approval system, or cloud control plane still records the previous successful action.
The post-restore tool call may differ. The LLM may regenerate UUIDs, nonces, timestamps, or even change the target object under user guidance.

Figure 2: Restore only affects local agent state. External state keeps moving forward. This divergence is the core of semantic rollback attacks.

This resembles the classic output commit problem in distributed systems: once output has been committed to the outside world, rolling back the local process alone cannot take the whole system back in time. The new twist is that an LLM agent may synthesize a different request after restore, blurring the boundary between "retry" and "new request."

Attack 1: Action Replay

Action Replay targets irreversible tool calls that have already succeeded. The attacker does not need to control the bank or compromise the agent. It is enough to control a later service in the agent's tool chain, such as Bob's invoice-confirmation MCP service or a seemingly harmless callback endpoint.

The attack path is direct:

The agent executes an irreversible action after a checkpoint, such as a transfer or cloud resource creation.
The external service returns success, and the side effect is committed.
An attacker-controlled later tool returns a malformed response, triggering crash or restore.
The agent returns to the old checkpoint and repeats the same task.
The LLM generates a fresh request id, so the target service cannot recognize the repeated intent and commits again.

Figure 3: Normal execution transfers once. In the attack path, crash-induced restore causes the same semantic action to execute twice.

In our experiments, we used Claude Code CLI backed by Qwen3-32B. External services were simulated as MCP tool servers: a bank service with UUID-based duplicate detection and a malicious payee service that crashes the agent after a successful transfer. Across 10 checkpoint/restore trials, all 10 produced duplicate commits. A no-checkpoint baseline produced none. This confirms that the vulnerability comes from the interaction between restore and external side effects, not from ordinary model randomness alone.

Attack 2: Authority Resurrection

The second attack class is Authority Resurrection, which targets one-time authorization tokens or short-lived credentials.

Consider an enterprise data deletion workflow. The agent first obtains manager approval, and the approval service returns a one-time token. The agent uses that token to delete Alice's data, and the server marks the token as consumed. A user or malicious insider then rewinds the agent to the checkpoint immediately after approval was granted. In the agent's local state, the token appears again. In the external approval system, the token should already be consumed.

If the target service validates tokens statelessly, for example by checking only a signature and expiration time, the agent may reuse the same token on another target, such as Bob's data. The audit log may show that the manager approved deletion for Alice, while Bob's data was also deleted. The discrepancy is visible only by correlating approval and execution logs.

Our experiment simulated two approval services:

Validation mode	Result
Stateless validation, checking only token signature	2/2 reuse attempts succeeded
Stateful validation, recording token consumption server-side	All reuse attempts were rejected

This shows that checkpoint/restore can do more than duplicate financial side effects. It can also break authorization semantics by reviving authority that should have been consumed.

Why This Is Not One Framework's Bug

The paper surveys reports across multiple frameworks and communities. The concrete symptoms differ, but they point to the same boundary: restore, retry, approval, preemption, and human-in-the-loop flows can cause tool calls to execute more than once, while frameworks generally do not enforce exactly-once semantics at the tool boundary.

Framework or system	Observed issue type
LangGraph	Tool nodes may re-execute after resume or interrupt
CrewAI	Workflows run twice, causing repeated emails or actions
Google ADK	Rewind documentation warns that external side effects are not undone
AutoGen / OpenAI Agents	Graph nodes or function calls are triggered repeatedly
Claude Code / Cursor	Duplicate tool behavior around approval, checkpoint, or undo flows
OpenHands / Vercel AI / LiveKit / n8n	Duplicate messages, repeated tool calls, doubled token cost, or repeated charges

These cases do not mean every framework has the same bug. They show that "restoring agent state to the past" while "the external world remains in the present" is a systemic issue. Relying on developers to make tools idempotent is not enough, because the post-restore agent request may not be the same request.

ACRFence: Replay-or-Fork at the Tool Boundary

ACRFence does not try to make every LLM agent deterministic. Instead, it records irreversible effects at the tool boundary and enforces replay-or-fork semantics after restore.

ACRFence can be deployed as an MCP proxy or a similar tool-call proxy between the agent and external services. For each irreversible tool call, ACRFence records an effect log that includes:

thread and branch identifiers, to distinguish execution branches in the same session;
tool name and arguments;
return value or error;
runtime context, such as process, network, and file-access context, which can be enriched by eBPF-based system-level monitors such as AgentSight;
consumed credentials or authorization objects, when applicable.

When the agent restores from a checkpoint and issues another tool call, ACRFence does not immediately forward it. It first compares the new call with the historical effect log:

Semantically equivalent: replay. If the new call only changes non-intent fields such as request id or timestamp, while recipient, amount, resource target, and other intent fields are the same, ACRFence returns the previously recorded response without re-executing the external operation.
Semantically divergent: fork. If the new call changes intent-critical fields, such as a different recipient or a different customer deletion target, ACRFence blocks the call, shows the prior effect log, and requires an explicit fork.
Credential reuse: reject or inform. If the call tries to reuse a consumed token, ACRFence informs the agent before the request reaches the target service.

We use an analyzer LLM for semantic comparison instead of requiring every tool to provide a hand-written schema and idempotency rule. For example, two transfer calls with different UUIDs but the same amount and recipient should be treated as the same intent. Two delete_customer_data calls with the same approval token but different customer ids should be treated as dangerous divergence. The analyzer runs only on the restore path, not on every normal tool call.

ACRFence aims to provide two guarantees:

Replay safety: semantically equivalent irreversible calls after restore do not execute again; ACRFence returns the cached result.
Divergence detection: semantically different calls after restore must explicitly fork; they cannot silently inherit external effects or authority from an old branch.

How This Differs from Idempotency and Durable Execution

Idempotency is still important, but it solves the problem of "the same request is retried." ACRFence works one level higher, at agent intent: request fields may change while intent stays the same, or the fields may look valid while intent has drifted to a new target.

Durable execution systems usually require deterministic orchestrator logic, with nondeterministic values recorded as side effects and replayed on recovery. That works well for traditional workflows. LLM agents, however, generate their next action from context. Rather than assuming post-restore calls will be byte-identical, ACRFence treats divergence as expected and makes replay versus fork explicit at the tool boundary.

In this division of labor, checkpoint/restore lets the agent return to an earlier state. ACRFence ensures that reconnecting that old state to the external world does not duplicate irreversible side effects or revive consumed authority.

Limitations and Next Steps

The work validates the two attack classes, while ACRFence itself remains a design that still needs a full implementation and system evaluation. Several challenges remain:

The analyzer LLM may misclassify calls, so false replay and false fork risks need careful evaluation.
An adaptive attacker who knows the comparison logic may craft ambiguous parameters to evade semantic detection.
The boundary between "intent fields" and "non-intent fields" is not always obvious for every tool.
The current experiments cover one model and one framework; more agent frameworks, models, and real tool ecosystems should be evaluated.

The core conclusion is clear: once agent frameworks introduce checkpoint, rewind, time travel, and branch exploration, external tool calls cannot rely only on traditional idempotency keys. The restore path is a new security boundary.

Conclusion

Checkpoint/restore makes AI agents easier to debug, recover, and steer across multiple execution paths. But once agents can call external tools, local rollback and external non-rollback create a semantic gap. Action Replay can turn one payment, one resource creation, or one email into many. Authority Resurrection can make consumed authorization reappear in local agent state.

ACRFence records irreversible effects at the tool boundary and enforces replay-or-fork after restore: same intent replays the result without re-execution, different intent must explicitly fork, and consumed credentials cannot be silently reused. As more agent frameworks support checkpoint and time travel, this kind of tool-boundary semantics will become part of the reliability and security foundation.

References

Runtime Observability and Enforcement for Opaque AI Agents with eBPF: Beyond Sandboxes and Approvals

云微 — Tue, 02 Jun 2026 11:11:21 +0000

AI coding agents now run for hours, complete entire features end-to-end,
optimize production GPU kernels, and merge thousands of pull requests
autonomously. Meanwhile, most agent security still relies on human-in-the-loop
approval, and Anthropic's own data shows users approve 93% of prompts without
meaningful review. The result is predictable: products add bypass modes, users
disable permission gates, and 65% of firms report agent security incidents.

But the deeper problem is not approval fatigue. It is that the agent harness
(the prompt loop, tool routing, permission logic, and sandbox defaults) is
increasingly a third-party product the platform team did not write, running in a
sandbox the platform team may not own. The harness is not a trusted security
boundary. This post argues for separating agent security into three layers with
three different owners: intent authorization (harness-owned), execution
isolation (ownership contested), and side-effect verification (must be
platform-owned). When the layers agree, you have confidence. When they
disagree, you need independent observability and enforcement at the OS level to
detect it, and that is exactly the layer most agent platforms are missing. We
are building projects towards this direction:
AgentSight for runtime observation and
ActPlane for runtime harness enforcement, both using eBPF to provide an
independent runtime observability and enforcement below the agent harness.

Why Now: Complexity Up, Guardrails Behind

The important change in 2026 is not that agents exist. It is the scale and
duration of what they do.

A year ago, the typical agent task was "fix this bug" or "write this function."
In 2026, agents routinely run for hours on complex, multi-step work. OpenAI
documented a Codex session that ran for 25 hours uninterrupted,
consuming 13 million tokens and producing 30,000 lines of code from a blank
repository. Anthropic's agentic coding report cites a 12.5-million-line
codebase change completed in a single 7-hour run. Meta's
KernelEvolve uses multi-agent coordination to write and optimize
production GPU kernels, compressing work that previously required weeks of
expert systems engineering into hours. On SWE-bench Verified, top agents now
resolve 60–70% of real GitHub issues, up from under 30% in early

Devin has merged hundreds of thousands of pull requests across enterprise customers with a 67% merge rate. Goldman Sachs deployed hundreds of Devin instances across a 12,000-person engineering team.

Beyond coding, general-purpose autonomous agents have gone mainstream.
OpenClaw, an open-source agent with
over 300,000 GitHub stars, connects to LLMs and executes shell commands,
browser automation, email, calendar, and file operations on the user's machine.
CrowdStrike called it "the AI Super Agent" security teams need to worry
about:
between January and April 2026, 470 security advisories
were filed against it across three disclosure waves.

These are not research demos. They are production workflows: background tasks,
parallel execution, multi-hour sessions, end-to-end feature development, kernel
optimization, and enterprise-scale code changes.

Meanwhile, the guardrails designed to keep agents safe have not kept pace.

Most agent security still relies on human-in-the-loop approval: a prompt asks
the user to approve or deny each action before it executes. This works for short
sessions with a few tool calls. It does not work when an agent makes hundreds of
decisions over hours of autonomous operation.

The evidence suggests that approval-based control is already failing in
practice. Anthropic's own data shows that Claude Code users approve 93% of
permission prompts, a rate consistent with rubber-stamping
rather than meaningful review. An independent stress test of Claude Code's auto
mode found an 81% false negative rate on ambiguous
state-changing actions, meaning the classifier allowed 4 out of 5 actions that
should have required human review. Real incidents have followed: in documented
cases, users running agents without permission gates had their home directories
deleted by rm -rf commands the agent generated. A 2026
industry survey found that 65% of firms reported AI agent security
incidents, primarily
unauthorized data access, credential exposure, and exfiltration to external
endpoints, with most involving organizations lacking proper agent access
controls.

Products have responded by adding bypass mechanisms. Claude Code offers
--dangerously-skip-permissions. Windsurf's Cascade agent proceeds
autonomously where Cursor stops to ask. Community guides now
focus on "how to safely use YOLO mode." Anthropic researcher Nicholas Carlini
ran 16 parallel Claude agents with permissions bypassed, with the
caveat: "Run this in a container, not your actual machine."

This is the tension: the more capable agents become, the more users want to
let them run uninterrupted, and the less effective human-in-the-loop becomes as
the primary security boundary.

That tension is what creates the need for a different security model.

The Accountability Gap

The deeper issue is not just that agents are more capable. It is that the agent
harness, the component that decides what the agent does, is increasingly a
third-party product the platform team did not write.

A modern agent harness is not a thin wrapper around a model. It includes a
prompt loop, planning and retry logic, tool routing, MCP clients, permission
modes, approval gates, hooks, memory, logs, credential handling, and sometimes
sandbox defaults. In many deployments, that harness comes from a hosted
coding-agent service or an open-source framework the platform team does not
control.

This is already visible across the ecosystem. GitHub Copilot's coding
agent runs autonomously in GitHub Actions, researching
repositories, creating plans, making changes, and opening pull requests. OpenAI
Codex runs background tasks in sandboxed cloud environments with
controlled network access. Claude Code runs cloud sessions in Anthropic-managed
VMs with scoped credentials. Kubernetes SIG is defining Agent
Sandbox for isolated, stateful agent workloads. Recent research
datasets show agent-authored pull requests at scale across real
repositories.

The ownership split is now explicit in major platforms. Anthropic's shared
responsibility framework divides agent security into four
layers (Model, Harness, Tools, Environment) and
stresses that an agent's behavior depends on all four working together, so the
harness, tools, and environment, the layers shaped by the deploying party, are
as decisive as the model itself. Anthropic itself notes that even together,
these layered safeguards are not a guarantee. The question the framework
leaves open is what happens when a failure crosses these layers, and whether
the deployer has independent observability to detect it. In cloud infrastructure,
the analogous gap in shared responsibility led to independent observability
and audit services (CloudTrail, Config, GuardDuty) controlled by the
customer, not the provider. Agent infrastructure has no equivalent yet: the
deployer is told it owns harness, tools, and environment, but often has no
independent way to verify what those layers actually did at runtime.

GitHub's agentic
workflow architecture starts from the premise that "agents cannot be trusted by
default, especially in the presence of untrusted inputs",
using kernel-enforced communication boundaries that hold even if the agent
container is compromised. OpenAI's Codex documentation acknowledges
that "devcontainers provide substantial protection, but they do not prevent
every attack."

The platform team still owns the repository, the CI runner, the Kubernetes
cluster, the service accounts, the secrets, and the internal network. But the
runtime acting on those assets may be opaque.

There is also a second split that matters even more for platform teams: the
sandbox may not be controlled by the environment owner either. If the agent
runs in a provider-managed cloud (Claude Code on the web runs in
Anthropic-managed isolated VMs with scoped credential
proxies; Codex runs in OpenAI-managed containers), the
platform team cannot attach its own monitoring, modify isolation policy, or
inspect the sandbox internals. Even Anthropic's own managed agent architecture
explicitly decouples the "brain" (Claude + harness) from the
"hands" (sandboxes), treating containers as disposable and ensuring tokens are never reachable
from the sandbox where generated code runs. This is good architecture, but it is the provider's architecture,
not the platform team's.

When agents run locally or on self-hosted infrastructure (GitHub now supports
self-hosted runners for its coding agent, and Kubernetes
Agent Sandbox provides gVisor/Kata-backed isolation under the
platform operator's control), the environment owner can wrap the agent in its
own sandbox and observability. When agents run in provider-managed
environments, independent observability and enforcement must move to the
boundaries the platform team does control.

This creates the accountability gap: the platform team is responsible for
production impact from a workload it cannot fully inspect, running in a sandbox
it may not own.

The old mental model was simple: the agent is risky, so put it in a sandbox.
The new reality has a different trust boundary: the agent and its harness are
part of the workload, and the environment owner needs independent runtime observability.

Three Layers, Three Questions

MCP, sandboxes, and OS-level observability are all necessary for agent security.
They are not interchangeable. Each answers a fundamentally different question,
and each has a different owner.

Intent authorization (MCP, tool gateways, approval prompts) answers: what
is the agent supposed to do? Which tools may it call, under which identity,
with which scopes? This is the right place to enforce access control before a
dangerous action happens. But a tool approval is not proof of side effects. A
framework log saying "run tests" does not prove that the process tree only ran
tests. An MCP server can be well-authenticated and still be part of a workflow
that causes unexpected local effects. This layer is typically owned or mediated
by the agent harness.

Execution isolation (containers, VMs, network policy, namespaces) answers:
what can the agent reach? Which files, network endpoints, credentials, and
syscalls are available? This is the right place to limit blast radius. But a
sandbox does not automatically record what the agent attempted within its
constraints: which process read a secret, which subprocess opened a network
connection, whether the sandbox policy matched the approved intent. This layer's
ownership is contested: it may belong to the agent provider, the platform team,
or both.

Side-effect verification (OS/runtime observability) answers: what actually
happened? Which processes ran, which files were read, which network connections
were opened, which credentials were accessed? This layer provides facts about
execution, independent of what the framework reported or the sandbox intended.
This layer must be owned by the environment operator. Otherwise there is no
independent source of truth.

The security model is the combination:

authorize intent  →  isolate execution  →  verify side effects
(harness-owned)      (ownership contested)  (must be platform-owned)

When all three layers agree, you have confidence. When they disagree, you need
OS-level observability and controls, independent of the harness, to detect the
mismatch, contain the damage, and reconstruct what happened.

Why Independence Matters

The reason to keep these layers independent follows from the trends above, but
also from a deeper structural argument about ownership and trust.

Approval fatigue

When approvals are relaxed (as the evidence above shows they routinely are),
the other two layers must compensate. If you auto-approve routine actions, you
need an independent way to verify what those actions actually did. If you
bypass permissions for speed, you need stronger containment and stronger observability.

Harness opacity

When the harness is opaque, application-level telemetry cannot be the sole
source of truth. OpenTelemetry GenAI conventions and framework-level tracing are
valuable when you own the framework. But opaque agent apps, closed-source
runtimes, hosted execution, stripped binaries, and arbitrary subprocess trees
can all break the assumption that the framework trace is complete. OpenClaw
illustrates this directly: its behavior is non-deterministic across
runs, producing different tool-calling
sequences for the same input, which makes static code review inadequate and
drove multiple teams to build dedicated runtime observability tools for it
(OneClaw,
ClawTrace).
Security researchers have already found 30+ vulnerabilities across all major AI
IDEs (Cursor, Copilot, Windsurf, Claude Code), enabling data theft
and remote code execution through prompt injection into agent tool chains.

The MCP layer records intended tool calls. The OS layer records actual side
effects. When the harness is opaque, the gap between these two is exactly where
security incidents live.

The trust boundary is an ownership boundary

The deepest reason for independence is that the three layers serve different
owners with different incentives.

The harness provider's goal is to complete the user's task: maximize
autonomous coding productivity, reduce permission friction, deliver results.
The platform team's goal is to protect the repository, secrets, cluster,
CI runner, internal network, and production APIs. These goals are not opposed,
but they are not identical. When they conflict, when the fastest path to task
completion involves reading credentials, opening network connections, or
modifying files outside the workspace, the harness will optimize for
completion unless an independent boundary stops it.

This is why Bhattarai and Vu argue that
"probabilistic compliance is not compliance": training-based and
classifier-based defenses may reduce empirical attack rates, but cannot provide
deterministic guarantees under adversarial conditions. Only architectural
enforcement can. Red Hat's experience deploying multi-agent systems on Kagenti
frames the same insight differently: this is "a multi-tenancy problem disguised
as an AI problem". The agent is an untrusted tenant. The
platform needs the same kind of isolation, identity, and audit controls it would
apply to any untrusted workload.

The OWASP Top 10 for Agentic Applications reinforces this
framing. Its top risk (ASI01, Agent Goal Hijacking) is that "agents cannot
reliably distinguish instructions from data," and a single malicious input from a
repository, issue, MCP response, or web page can redirect the agent to perform
harmful actions using its legitimate tools. This is not a hypothetical:
Bishop Fox demonstrated confused deputy attacks where
instructions embedded in support tickets caused agents to exfiltrate data using
authorized tools, with "the user's name on every audit log entry." Docker
documented a GitHub prompt injection chain where a
malicious issue hijacked an MCP-connected agent to steal confidential data from
private repositories.

The threat model for platform teams therefore has three adversary categories:

Threat	Which layer fails	Runtime observability detects
Compromised agent (prompt injection, malicious repo/issue/MCP response)	Intent layer: agent is tricked into unintended actions	Actual side effects diverge from stated intent
Untrusted harness (opaque permission logic, incomplete logs, unauditable internal state)	Cannot verify harness completeness	OS-level facts independent of harness reporting
Sandbox escape or policy gap (container breakout, mounted credentials, network bypass)	Isolation layer fails or is misconfigured	Detects behavior outside expected sandbox boundary

AISI's SandboxEscapeBench makes the third category concrete:
frontier models can reliably escape container sandboxes under
misconfigurations that plausibly occur in real systems, and the researchers
discovered four unintended escape paths the benchmark designers had missed.
Their recommendation: "treat plain Docker isolation as insufficient by
default."

In all three cases, OS/runtime observability is the independent control
that lets the platform team detect the problem, regardless of which other layer
failed.

What OS-Level Monitoring Captures

At the OS/runtime layer, observability captures:

Process lineage: the full tree from agent to subprocess to network call
File access: which paths were read or written, including credential paths
Network behavior: connections, destinations, timing, data volume
Container metadata: namespace, cgroup, pod identity, service account
Subprocess behavior: commands that bypass framework instrumentation

This data is collected below the application layer, typically via eBPF,
audit subsystems, or kernel instrumentation. It does not require modifying the
agent app. Its key property is independence: the observability is owned and
operated by the environment operator, not by the agent provider.

This makes cross-layer comparison possible:

Framework report:    run tests
Sandbox policy:      workspace mounted, registry allowed, SA token mounted
OS observability:       agent → shell → python → curl
                     read: /var/run/secrets/.../token
                     connect: unknown external host

Each layer saw a different part of the event. Without the OS layer, this is an
undetected credential theft: a service account token read and exfiltrated while
the framework logged only "running tests." The platform team discovers the
breach days later, if at all. OS-level observability is what turns an invisible data leak into a real-time
detection.

Deployment Reality

OS-level observability is strongest when you control the host, node, or VM where the
agent executes. If the agent runs entirely in a provider-managed environment,
you may not be able to attach eBPF inside it.

In that case, the same model applies, but observability shifts to the boundaries you do control:

Repository permissions and branch protection
Scoped credentials with minimal lifetime
CI/CD and GitHub audit logs
Network proxies and webhook events
Artifact access logs
Provider-supplied session logs

This observability is weaker than owning the runtime boundary, but it is still better
than treating the agent transcript as the only source of truth.

The design question for platform teams is:

Where is the lowest layer I actually control?
That is where independent observability should live.

AgentSight and ActPlane: Observe, Then Enforce

We are building open-source tools that implement the verification layer
described above, each addressing a different half of the problem.

AgentSight is a zero-instrumentation observability tool for
AI agents. It uses eBPF to intercept SSL/TLS traffic and monitor process
behavior at the system boundary, with no code changes, no SDKs, and no
framework integration required. Point it at any agent process (Claude Code,
Codex, a custom Python agent) and it captures the full picture: process
lineage, LLM API calls (prompts and completions), file access, network
connections, and tool invocations, all correlated into a live timeline. This is
the "see what actually happened" layer. Because it operates below the
application, it works even when the agent runtime is opaque, closed-source, or
running arbitrary subprocesses that bypass framework-level tracing. In
practice, this means detecting credential access, data exfiltration attempts,
and unauthorized network connections as they happen, not days later when an
external party reports the breach.

ActPlane is an OS-level harness for AI agents. Where AgentSight
observes, ActPlane enforces. You write behavioral contracts in a YAML-based
rule language (labeled information-flow control, not static allow-lists), and
ActPlane compiles them into an eBPF program that enforces constraints at the
kernel level: every exec, file open, and network connect in the agent's
entire process tree is checked against the policy. When a rule is violated,
ActPlane blocks the action and feeds a human-readable reason back to the agent
through its hook system, so the agent self-corrects rather than failing
silently. The rule language supports data-flow tracking across fork/exec
chains, causal ordering ("run tests before committing"), and staleness
invalidation, going well beyond what sandboxes or tool-layer guards can
express.

The two tools are complementary. AgentSight provides runtime observability:
independent, below-the-application visibility into what the agent did. ActPlane
provides the enforcement plane: deterministic, kernel-level guarantees about
what the agent cannot do. Together they implement the "verify side effects"
layer of the three-layer model, independent of the harness provider and
independent of who owns the sandbox.

Both are possible implementations of this architecture, not the only ones.
The important point is the separation: observe and enforce at a layer the
environment operator controls, regardless of which agent runtime sits above.

This also addresses ecosystem gaps Anthropic identifies: the need for
cross-deployment security telemetry sharing and open standards for agent
security. Independent runtime observability that travels with the workload,
rather than being locked to a specific harness or provider, is the foundation
for both.

Practical Checklist

If you are building or evaluating an agent platform, ask these questions at
each layer.

Intent authorization (MCP / tool access):

Are MCP servers allowlisted?
Are OAuth scopes minimal and audience-bound?
Are local MCP servers treated as code execution risk?
Are high-risk tools gated by human approval?
Are tool calls logged with enough context for audit?

Execution isolation (sandboxing):

Is filesystem access default-deny or broad workspace mount?
Can the agent reach cloud metadata endpoints?
Is network egress restricted by domain, IP, or proxy?
Are service account tokens mounted into the environment?
Are process, memory, CPU, and runtime duration bounded?
Who owns the sandbox policy: the platform team or the agent provider?

Side-effect verification (runtime observability):

Can you reconstruct process lineage for an agent session?
Can you see file and credential access below the framework?
Can you correlate network egress with pod, service account, and command?
Can you detect mismatch between tool intent and OS side effects?
Can you replay an incident without trusting only framework logs?
Can you demonstrate to auditors (SOC 2, ISO 27001) how automated agent access to production data and credentials is monitored and logged?

Guardrail integration:

Which side effects should be blocked immediately?
Which should trigger alert or human review?
Which policies belong in MCP config, sandbox config, Kubernetes policy, eBPF/LSM, or network controls?
What happens when framework logs and OS-level observability disagree?

Closing

Agent runtimes are becoming more capable, more managed, and more opaque. The
security model cannot depend on any single layer, especially when the layers
have different owners.

The harness is not a trusted boundary. The sandbox ownership depends on the
deployment model. The only layer the environment operator can guarantee it
owns is OS/runtime observability.

MCP authorizes intent. Sandboxes constrain execution. OS-level observability verifies side
effects. Each is necessary; none is sufficient. The practical model is their
separation:

authorize intent  →  isolate execution  →  verify side effects
(harness-owned)      (ownership contested)  (must be platform-owned)

The implementation details vary by deployment, but the separation, and the
ownership question, is the part that should remain stable.

If you are exploring this space, AgentSight and
ActPlane are our open-source starting points for the observation
and enforcement layers respectively.

References

When CPU Noise Slows Down GPU Inference: Measuring Scheduler and IRQ Impact with eBPF

云微 — Sun, 31 May 2026 23:35:33 +0000

GPU inference often looks like a GPU problem, but the CPU still sits on the critical path. It prepares inputs, launches CUDA kernels, manages synchronization, handles runtime calls, and shares cores with system work, interrupts, and other tenants. If that CPU-side launch path is delayed, the GPU can be left waiting even when the GPU kernels themselves are fast.

This post asks a concrete question: when an LLM inference workload is running on a GPU, how much do Linux CPU scheduling decisions and IRQ handling actually matter?

To answer it, we built an eBPF tracing tool, cuda_sched_trace, that records CUDA kernel launches, scheduler context switches, and hard/soft IRQ events with nanosecond timestamps. We then ran Qwen3 0.6B inference under clean and noisy-neighbor conditions: CPU load from stress-ng, network load from iperf3, disk load from fio, a combined heavy-load case, and a mitigation case using CPU pinning and priority adjustment.

The short version: in a clean environment, scheduler and IRQ overhead are small. Under production-like noisy-neighbor conditions, they can become very real. Combined CPU, network, and disk interference reduced throughput by 20.5%, while simple CPU pinning reduced context switches by 96.3% and recovered most of the lost throughput.

Why CPU Scheduling Shows Up in GPU Inference

Modern GPU workloads, particularly LLM inference and training, require tight coordination between CPU and GPU execution. The CPU is responsible for:

preparing input data and kernel parameters
launching GPU kernels through CUDA APIs
managing memory transfers and synchronization

An interruption to that CPU-side workflow can delay GPU kernel submission. In the worst case, the GPU has available compute capacity but no new work to execute.

The motivation comes partly from Meta's work on sched_ext for AI training optimization, where production issues include "IRQs preempting our important tasks." Network interrupts (NET_RX/NET_TX) and block device interrupts can matter for large distributed training jobs, and custom scheduling policies can improve AI workload performance by 5-20%.

But the impact is workload-dependent. A single-node LLM inference loop is not the same as distributed training with all-reduce traffic. Before investing in custom scheduling, we wanted measurements that separate scheduler problems from normal application behavior.

The study has four goals:

Measure the baseline impact of CPU scheduling on GPU kernel launches.
Characterize IRQ interference patterns and their performance cost.
Quantify noisy-neighbor impact under CPU, network, disk, and combined load.
Evaluate how much CPU pinning and priority adjustment help.

Tracing the Launch Path

We developed cuda_sched_trace, an eBPF-based tracing tool that combines CUDA API uprobes, Linux scheduler tracepoints, and IRQ tracepoints.

CUDA API Tracing

The tool attaches uprobes to CUDA Driver and Runtime APIs:

// Attach to CUDA Driver API
SEC("uprobe/cuLaunchKernel")
int trace_cuLaunchKernel(struct pt_regs *ctx) {
    // Capture: timestamp, pid, tid, grid/block dimensions, shared memory, stream
    // Mark process as GPU process for scheduler tracking
}

// Attach to CUDA Runtime API
SEC("uprobe/cudaLaunchKernel")
int trace_cudaLaunchKernel(struct pt_regs *ctx) { ... }

SEC("uprobe/cudaDeviceSynchronize")
int trace_cudaDeviceSynchronize_enter(struct pt_regs *ctx) { ... }

SEC("uretprobe/cudaDeviceSynchronize")
int trace_cudaDeviceSynchronize_exit(struct pt_regs *ctx) { ... }

Scheduler Event Tracing

Scheduler activity is captured through sched_switch, filtered to GPU-related processes:

SEC("tp_btf/sched_switch")
int BPF_PROG(sched_switch, bool preempt, struct task_struct *prev, struct task_struct *next) {
    // Only track if prev or next is a GPU process
    // Record: timestamp, prev/next pid, off-cpu/on-cpu duration
}

IRQ Tracing

Hard and soft IRQs are tracked through kernel tracepoints:

SEC("tp_btf/irq_handler_entry")
int BPF_PROG(irq_handler_entry, int irq, struct irqaction *action) {
    // Track hard IRQ entry, record IRQ number and handler name
}

SEC("tp_btf/irq_handler_exit")
int BPF_PROG(irq_handler_exit, int irq, struct irqaction *action) {
    // Calculate IRQ duration
}

SEC("tp_btf/softirq_entry")
int BPF_PROG(softirq_entry, unsigned int vec_nr) {
    // Track soft IRQ: TIMER, NET_RX, NET_TX, BLOCK, SCHED, RCU, etc.
}

SEC("tp_btf/softirq_exit")
int BPF_PROG(softirq_exit, unsigned int vec_nr) {
    // Calculate soft IRQ duration
}

The data path is straightforward: the GPU application issues CUDA calls; eBPF programs observe CUDA, scheduler, and IRQ events in kernel space; events are sent through a BPF ring buffer; analysis scripts parse the resulting CSV.

┌─────────────────────────────────────────────────────────────────┐
│                         User Space                               │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │ GPU App     │    │ cuda_sched  │    │ Analysis Scripts    │  │
│  │ (qwen3.cu)  │    │ _trace      │    │ (Python)            │  │
│  └──────┬──────┘    └──────┬──────┘    └──────────┬──────────┘  │
│         │                  │                       │             │
│         │ CUDA calls       │ perf_event            │ CSV parsing │
│         ▼                  ▼                       ▼             │
├─────────────────────────────────────────────────────────────────┤
│                         Kernel Space                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │ uprobes     │    │ tracepoints │    │ BPF Ring Buffer     │  │
│  │ (CUDA API)  │    │ (sched,irq) │    │ (Event Queue)       │  │
│  └─────────────┘    └─────────────┘    └─────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Benchmark and Environment

The benchmark is Qwen3 0.6B LLM inference using qwen3.cu.

Property	Value
Model	Qwen3-0.6B-FP32
Task	Single-turn Q&A
Input	"What is eBPF?"
Output	~30-50 tokens
Kernel Pattern	Burst submission (~950 launches per token)
GPU Memory	~3 GB

This benchmark is useful because it resembles modern LLM inference, mixes compute-bound and memory-bound kernels, shows a clear burst submission pattern, and produces a measurable throughput metric in tokens per second.

Component	Specification
CPU	24 cores (specific model TBD)
GPU	NVIDIA GPU with CUDA support
Memory	Sufficient for model + system
OS	Linux 6.15.11-061511-generic
Kernel	BTF-enabled for CO-RE eBPF
CUDA	Driver API + Runtime API

We used three interference tools:

Tool	Purpose	Configuration
stress-ng	CPU load	`--cpu 0 --cpu-method fft` (all cores)
iperf3	Network I/O	Server + Client, 10 parallel streams, 60s
fio	Disk I/O	`randwrite, bs=4k, iodepth=32, 4 jobs`

The full experiment has six scenarios:

Scenario	Description	Interference
Baseline	Clean environment	None
Noisy CPU	CPU-intensive	stress-ng on all cores
Noisy Network	Network I/O	iperf3 localhost loopback
Noisy Disk	Disk I/O	fio random write
Heavy Load	Combined	CPU + Network + Disk simultaneously
Optimized	CPU pinning	stress-ng + taskset -c 0-3 + nice -n -10

Data collection follows the same pattern in every run:

# Start tracing
sudo ./cuda_sched_trace > trace.csv 2> trace.log &
TRACE_PID=$!

# Run benchmark
cd qwen3.cu
/usr/bin/time -v ./runcu Qwen3-0.6B-FP32.gguf -q "What is eBPF?" -r 1

# Stop tracing
sudo kill -SIGINT $TRACE_PID

# Analyze results
python3 analyze_scheduler_impact.py

Analysis Method

The central analysis compares consecutive CUDA kernel launches:

Launch_i -> [interval] -> Launch_i+1

Group A: Launches with NO context switch in interval (normal flow)
Group B: Launches with context switch in interval (preempted)

Preemption Penalty = median(Group B interval) - median(Group A interval)

To compare runs of different lengths, scheduler and IRQ counts are normalized per 1,000 kernel launches:

Sched/1K = (Total Context Switches / Total Kernel Launches) x 1000
IRQ/1K = (Total IRQs / Total Kernel Launches) x 1000

Performance impact is reported as:

Slowdown % = (Baseline tok/s - Scenario tok/s) / Baseline tok/s x 100

RQ1: Does CPU Scheduler Significantly Impact GPU Performance in Clean Environments?

The first question is whether scheduler preemption matters when the machine is otherwise clean.

Experiment design

Condition: clean system, no artificial interference
Metrics: context switch frequency, preemption penalty, total runtime impact
Analysis: launch-pair comparison with and without context switches

Results

Metric	Value
Total Runtime	79.5 seconds
Kernel Launches	51,464
Context Switches	592 (7.44 Hz)
OFF-CPU Time	7.88 ms (0.01%)

Launch-pair analysis shows that almost every consecutive launch pair is unaffected by context switches:

Group	Count	Percentage	P50 Interval	P90 Interval	P99 Interval
No Context Switch	51,401	99.9%	2 us	4 us	4 us
With Context Switch	62	0.1%	15.3 ms	15.5 ms	5.0 s

The median preemption penalty is 15.3 ms. That is large for the affected pairs, but only 62 pairs were affected.

Tail-latency attribution confirms that most outliers are not caused by scheduler preemption:

Percentile	Total Outliers	With Context Switch	Attribution
P95+	2,580	62 (2.4%)	97.6% application
P99+	515	62 (12.0%)	88.0% application

The total scheduler impact is:

Impact = Affected Pairs x Penalty = 62 x 15ms = 0.93 seconds
Percentage = 0.93 / 79.5 = 1.2%

Finding: in clean environments, CPU scheduler impact is minimal at 1.2%. The vast majority of kernel launch pairs, 99.9%, are unaffected by context switches. Tail latency mostly comes from application behavior such as token-generation boundaries, not scheduler preemption.

RQ2: What Is the Impact of IRQ Interrupts on GPU Performance?

The second question is whether IRQs directly interfere with the CPU-side launch path.

Experiment design

Condition: clean system with IRQ tracing enabled
Metrics: IRQ frequency, duration, type distribution
Analysis: IRQ time as percentage of total runtime

Results

Metric	Value
Total Runtime	4.99 seconds
Kernel Launches	125,236
Soft IRQs	653 events
Hard IRQs	0 events

Soft IRQ type distribution:

Type	Count	Total Time	Avg Time	Max Time	Percentage
TIMER	317	0.77 ms	2.4 us	30.1 us	49%
RCU	291	0.40 ms	1.4 us	17.2 us	45%
NET_RX	30	0.13 ms	4.5 us	14.0 us	4.6%
SCHED	15	0.07 ms	4.9 us	18.9 us	2.3%

Total IRQ impact:

Total IRQ Time: 1.38 ms
Percentage of Runtime: 0.0276%

There are real reasons to worry about IRQs: direct handler time, cache pollution, CPU pipeline disruption, and delay accumulation on critical paths. But for this local inference workload, actual IRQ impact is small.

The reason is the workload shape. Qwen3 submits about 950 launches in a burst lasting less than 100 us, so IRQs rarely land inside the burst. Most IRQs happen between bursts during CPU compute. TIMER interrupts dominate and have a small cache footprint. There is little network I/O, so NET_RX appears only 30 times, and there are no hard IRQs from NVMe or SSD block-device interrupts.

Finding: IRQ impact is negligible for local LLM inference at 0.0276%. This does not mean IRQs never matter. Distributed training with network communication or on-the-fly data loading can see much higher IRQ impact, estimated around 5-20%.

RQ3: How Do Noisy Neighbors Affect GPU Performance?

The third question is the most production-relevant one: what happens when the GPU workload shares a machine with other CPU, network, and disk activity?

Experiment design

Scenario	Interference	Purpose
Baseline	None	Reference point
Noisy CPU	stress-ng (all cores)	CPU contention
Noisy Network	iperf3 (10 streams)	Network IRQ
Noisy Disk	fio (4 jobs, randwrite)	Block IRQ
Heavy Load	All three combined	Production simulation
Optimized	CPU stress + taskset + nice	Mitigation test

Results

Normalized metrics per 1,000 kernel launches:

Scenario	Launches	Sched/1K	Soft IRQ/1K	Hard IRQ/1K	IRQ Time (ms)
Baseline	56,882	22.8	5.8	0.0	0.62
Noisy CPU	61,184	11,932.8	6.4	0.0	0.33
Noisy Network	154,394	6.0	2.7	0.0	0.92
Noisy Disk	126,670	29.3	3.9	0.1	1.03
Heavy Load	99,424	6,044.6	2.4	0.0	0.37
Optimized	108,984	445.2	2.8	0.0	0.71

Performance impact:

Scenario	tok/s	Runtime (s)	Slowdown	Context Switch Increase
Baseline	54.77	3.00	-	1x
Noisy CPU	49.93	4.15	8.8%	524x
Noisy Network	53.23	7.22	2.8%	0.26x
Noisy Disk	54.95	5.60	-0.3%	1.3x
Heavy Load	43.56	6.97	20.5%	265x
Optimized	53.75	5.10	1.9%	19.5x

Scenario Analysis

Noisy CPU (stress-ng) causes the most direct scheduling pressure. Context switches increase 524x, from 22.8 to 11,932.8 per 1,000 launches, and throughput drops by 8.8%. The mechanism is simple: the CFS scheduler time-slices between the GPU process and stress-ng workers.

Noisy Network (iperf3) behaves differently. Context switches actually decrease, because the network load changes CPU competition patterns, while soft IRQs rise slightly. Throughput drops only 2.8%. In this local setup, network I/O primarily shows up as IRQ overhead rather than scheduler pressure.

Noisy Disk (fio) introduces the first hard IRQs, corresponding to block-device interrupts, but context switches remain low and throughput is effectively unchanged at -0.3% slowdown. Disk I/O has little impact on this workload.

Heavy Load (CPU + Network + Disk) is the worst case. Throughput drops by 20.5%, and scheduler events rise to 6,044.6 per 1,000 launches, a 265x increase over baseline. Interestingly, that is only 50.7% of the context-switch rate in the Noisy CPU case. The interference sources compete with each other, but their combined effect is still worst overall.

Heavy-load soft IRQ breakdown:

Type	Count	Total Time	Avg Time
RCU	213	217.4 us	1.0 us
TIMER	17	122.9 us	7.2 us
SCHED	5	33.3 us	6.7 us

Finding: noisy neighbors significantly affect GPU performance. Combined CPU, network, and disk interference causes 20.5% degradation. The signatures differ by source: CPU contention increases context switches, network I/O affects IRQ overhead, disk I/O introduces block interrupts with little throughput impact here, and combined load is worst due to cumulative effects.

RQ4: Can CPU Pinning Effectively Mitigate Scheduler Impact?

The fourth question is whether a simple deployment-level mitigation helps before reaching for a custom scheduler.

Experiment design

Baseline: Noisy CPU scenario with stress-ng on all cores
Optimized: same stress-ng load, but the GPU process runs with:
- taskset -c 0-3 to pin it to cores 0-3
- nice -n -10 to give it higher priority

Results

Metric	Noisy CPU	Optimized	Improvement
Sched/1K	11,932.8	445.2	96.3% reduction
tok/s	49.93	53.75	7.6% improvement
vs. Baseline	8.8% slower	1.9% slower	Significant recovery

CPU pinning and priority adjustment recover most of the lost throughput. But the optimized case still has 445.2 scheduler events per 1,000 launches, compared with 22.8 in the clean baseline. That is still 19.5x higher than baseline.

Complete elimination is hard because:

stress-ng workers may still be scheduled on cores 0-3.
System daemons and kernel threads cannot be fully excluded by taskset.
IRQ affinity may still route interrupts to pinned cores.

For stronger isolation, the next steps are kernel-level isolation and IRQ placement:

# 1. Use isolcpus kernel parameter (boot time)
isolcpus=4-7 nohz_full=4-7

# 2. Bind GPU process to isolated cores
taskset -c 4-7 ./gpu_app

# 3. Bind IRQs away from GPU cores
echo 0-3 > /proc/irq/*/smp_affinity_list

# 4. Use cgroups for CPU isolation
cgcreate -g cpu:gpu_workload
cgset -r cpuset.cpus=4-7 gpu_workload
cgexec -g cpu:gpu_workload ./gpu_app

Finding: CPU pinning is highly effective. It reduces context switches by 96.3% and recovers 7.6% throughput. But full recovery under heavy load requires deeper isolation such as isolcpus, nohz_full, cpusets, and IRQ affinity management.

What the Results Mean

The results point to four practical insights.

First, environment matters. Scheduler impact ranges from 1.2% in a clean environment to 20.5% under combined heavy load. Optimizing the scheduler on a quiet dedicated server may not be worth the complexity. On a shared host, it can be the difference between stable and degraded inference.

Second, workload shape matters. Qwen3 has bursty kernel submission, roughly 950 launches in less than 100 us per token burst. That shape makes it resilient to many IRQs because interrupts usually occur between bursts. A different workload with continuous network communication, streaming input, or tighter CPU-GPU handoff might behave differently.

Third, interference sources have distinct signatures:

Interference	Primary Impact	Secondary Impact
CPU	Context switches	None
Network	IRQ overhead	Slight scheduling
Disk	Hard IRQs	Minimal
Combined	All of above	Worst overall

Fourth, simple mitigations work, but only up to a point:

CPU pinning: very effective, 96% context-switch reduction
Priority adjustment: helpful but limited
Full isolation: requires kernel configuration and IRQ affinity management

Comparison with Meta's sched_ext Findings

Our results differ from Meta's AI training observations because the workload is different.

Aspect	Meta (AI Training)	Our Study (LLM Inference)
Primary Issue	Network IRQ (NET_RX)	CPU scheduling
IRQ Impact	5-20%	0.03% (local inference)
Optimization	sched_ext layer	taskset + nice
Workload	Distributed training	Single-node inference

The key difference is communication. Distributed training constantly exchanges data through all-reduce, making NET_RX a major bottleneck. Local inference has minimal network I/O, so the dominant issue under noise is CPU scheduling rather than network interrupts.

Limitations

There are several limits to this study:

eBPF tracing itself adds 1-5% overhead.
The tool only supports CUDA, not OpenCL or HIP.
The trace does not include GPU-side execution timing, so it cannot directly measure actual kernel runtime.
IRQ attribution is limited: the trace cannot always identify which process caused a given IRQ.
The experiments use a single GPU and do not cover multi-GPU behavior.

Practical Recommendations

For production deployments:

Environment	Recommendation	Expected Benefit
Dedicated Server	No optimization needed	-
Shared Server (light)	taskset + nice	5-10% improvement
Shared Server (heavy)	isolcpus + IRQ affinity	15-20% improvement
Kubernetes	CPU limits + nodeSelector	Varies

The decision tree is simple:

Is GPU workload latency-sensitive?
├── No -> No optimization needed
└── Yes -> Is server shared?
    ├── No -> Monitor only, optimize if needed
    └── Yes -> How heavy is colocated load?
        ├── Light -> taskset + nice
        └── Heavy -> isolcpus + dedicated cores

Conclusion

CPU scheduling and IRQ handling do not always matter for GPU inference, but they matter under the conditions where production systems often run: shared hosts, background load, and noisy neighbors.

The clean baseline shows minimal overhead: 1.2% scheduler impact and 0.03% IRQ impact. But combined CPU, network, and disk interference causes 20.5% throughput degradation. CPU pinning cuts context switches by 96.3% and recovers most of the lost performance, but not all of it.

The practical lesson is to measure first. Use tracing to identify whether your workload is scheduler-bound, IRQ-sensitive, or mostly application-limited. Then choose the mitigation that matches the signature: CPU pinning for CPU contention, IRQ affinity for interrupt interference, I/O tuning for block-device pressure, and full CPU isolation when the workload is latency-sensitive and colocated load is heavy.

References

Meta Platforms, Inc. "Accelerating AI Training with sched_ext." Linux Plumbers Conference 2025. https://lpc.events/event/19/contributions/2039/
NVIDIA Corporation. "CUDA Driver API Reference." https://docs.nvidia.com/cuda/cuda-driver-api/
Linux Kernel Documentation. "BPF Documentation." https://www.kernel.org/doc/html/latest/bpf/
stress-ng. "A tool to load and stress a computer system." https://github.com/ColinIanKing/stress-ng
iperf3. "A TCP, UDP, and SCTP network bandwidth measurement tool." https://github.com/esnet/iperf
fio. "Flexible I/O Tester." https://github.com/axboe/fio

eBPF Tutorial by Example 50: Composable Traffic Control with TCX Links

云微 — Sun, 31 May 2026 23:34:31 +0000

Ever tried attaching multiple BPF programs to the TC ingress path and got frustrated managing qdisc handles, filter priorities, and the tc CLI? Or needed one application's TC program to coexist safely with another's without accidentally overwriting it? Traditional cls_bpf attachment through tc works, but it inherits decades of queueing discipline plumbing that was never designed for the BPF-centric world. What if you could attach, order, and manage TC programs using the same link-based API that XDP and cgroup programs already enjoy?

This is what TCX (Traffic Control eXtension) solves. Introduced by Daniel Borkmann and merged in Linux 6.6, TCX provides a lightweight, fd-based multi-program attach infrastructure for the TC ingress and egress data path. Programs get BPF link semantics (safe ownership, auto-detachment on close, and explicit ordering through BPF_F_BEFORE / BPF_F_AFTER flags) without touching a single qdisc or filter priority.

In this tutorial, we'll attach two TCX ingress programs to the loopback interface, place one before the other, query the kernel's live chain state, and generate traffic to verify execution order.

The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/50-tcx

Introduction to TCX: Why Classic TC Attachment Needed a Rethink

The Problem: Qdisc Plumbing and Unsafe Ownership

Classic tc BPF attachment (cls_bpf) was bolted onto the existing Traffic Control framework. To attach a BPF program, you first needed a clsact qdisc on the interface, then added a filter with a handle and priority. This worked fine for a single operator, but created real problems in cloud-native environments where multiple applications need to attach TC programs to the same interface:

No ownership model: A tc filter del from one application can accidentally remove another application's program. There's no protection against this because classic tc filters are identified by handle/priority, not by the process that created them.
Priority conflicts: Two applications might pick the same priority number. The second attachment silently replaces the first.
Permanent attachment by default: Classic tc filters persist until explicitly removed. If the application that attached a filter crashes without cleanup, the filter remains, potentially with stale program logic.
CLI dependency: Even with libbpf, the attachment model was tied to netlink, the same mechanism the tc CLI uses. This meant your BPF application was sharing a control plane with every other tc user on the system.

These issues became acute in projects like Cilium, where the BPF dataplane needs to coexist with third-party CNI plugins, observability agents, and security tools that all want to hook into TC.

The Solution: Link-Based Multi-Program Management

TCX takes a fundamentally different approach. Instead of piggybacking on qdisc infrastructure, it provides a dedicated, qdisc-less extension point for BPF programs at the TC ingress and egress hooks. The key design principles:

BPF Link Semantics: bpf_program__attach_tcx() creates a BPF_LINK_TYPE_TCX link. Like XDP links and cgroup links, TCX links give you safe ownership: the link is pinned to the file descriptor, auto-detaches when the fd is closed, and cannot be accidentally overridden by another application.

Explicit Ordering: Instead of implicit priority numbers, you place programs relative to each other using BPF_F_BEFORE and BPF_F_AFTER. You can also use BPF_F_REPLACE to atomically swap a specific program. All operations support an expected_revision field that prevents race conditions during concurrent modifications.

Chain Return Codes: TCX defines simplified return codes that make multi-program composition explicit:

Return Code	Value	Meaning
`TCX_NEXT`	-1	Non-terminating; pass the packet to the next program in the chain
`TCX_PASS`	0	Accept the packet and terminate the chain
`TCX_DROP`	2	Drop the packet and terminate the chain
`TCX_REDIRECT`	7	Redirect the packet and terminate the chain

Unknown return codes are mapped to TCX_NEXT for forward compatibility.

Coexistence with Classic TC: TCX links can coexist with traditional cls_bpf filters on the same interface. The kernel runs TCX programs first, then falls through to classic tcf_classify() if present. This allows gradual migration from classic tc to TCX without a disruptive cutover.

Writing the eBPF Program

Our BPF object contains two programs that demonstrate chain composition. Here is the complete source:

// SPDX-License-Identifier: GPL-2.0
#include <linux/bpf.h>
#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>

#ifndef TCX_NEXT
#define TCX_NEXT -1
#endif

#ifndef TCX_PASS
#define TCX_PASS 0
#endif

char LICENSE[] SEC("license") = "GPL";

__u64 stats_hits;
__u64 classifier_hits;
__u32 last_len;
__u16 last_protocol;
__u32 last_ifindex;

SEC("tcx/ingress")
int tcx_stats(struct __sk_buff *skb)
{
    stats_hits++;
    last_len = skb->len;
    last_protocol = bpf_ntohs(skb->protocol);
    last_ifindex = skb->ifindex;
    return TCX_NEXT;
}

SEC("tcx/ingress")
int tcx_classifier(struct __sk_buff *skb)
{
    classifier_hits++;
    return TCX_PASS;
}

Let's walk through this step by step.

Section Names: `SEC("tcx/ingress")`

The SEC("tcx/ingress") annotation tells libbpf that this program should be attached to the TCX ingress hook rather than the classic TC classifier. This is not just a naming convention; libbpf maps this section name to BPF_PROG_TYPE_SCHED_CLS with the appropriate attach type for TCX. The corresponding egress variant is SEC("tcx/egress").

Note that SEC("tc"), SEC("classifier"), and SEC("action") are now considered deprecated by libbpf in favor of the tcx/* section names.

Global Variables as Counters

Instead of using a BPF map for counters, we use global variables (stats_hits, classifier_hits, last_len, etc.). The libbpf skeleton exposes these through skel->bss->stats_hits, which makes the user-space code simpler. This is fine for a single-CPU demo; for production use, you would want per-CPU maps to avoid data races.

Return Codes: `TCX_NEXT` vs `TCX_PASS`

This is the heart of TCX composition:

tcx_stats returns TCX_NEXT, which means "I've done my work, now pass the packet to the next program in the chain." The chain continues executing.
tcx_classifier returns TCX_PASS, which is a terminal verdict: the packet is accepted and no further programs in the chain run.

If we had placed tcx_classifier before tcx_stats in the chain, tcx_stats would never execute because TCX_PASS terminates the chain. Ordering matters, and TCX makes it explicit.

User-Space Loader: Attaching and Querying the Chain

The user-space code demonstrates three key TCX operations: attaching programs, ordering them relative to each other, and querying the live chain.

Step 1: Attach the First Program

classifier_link = bpf_program__attach_tcx(skel->progs.tcx_classifier,
                     ifindex, NULL);

This attaches tcx_classifier to the TCX ingress hook on the specified interface. Passing NULL for options means "use defaults", so the program gets appended to the chain. At this point, the chain has one program.

Step 2: Insert the Second Program Before the First

LIBBPF_OPTS(bpf_tcx_opts, before_opts,
    .flags = BPF_F_BEFORE,
    .relative_fd = bpf_program__fd(skel->progs.tcx_classifier));

stats_link = bpf_program__attach_tcx(skel->progs.tcx_stats,
                    ifindex, &before_opts);

The bpf_tcx_opts structure tells the kernel to insert tcx_stats before tcx_classifier in the chain. The .relative_fd field identifies the reference point, which is the fd of the already-attached classifier program. After this, the chain is: tcx_stats → tcx_classifier.

You could equivalently use BPF_F_AFTER with a different reference to achieve the same ordering. The important point is that you express the desired order directly, rather than hoping that two numeric priorities sort correctly.

Step 3: Query the Chain

LIBBPF_OPTS(bpf_prog_query_opts, query);

query.count = 8;
query.prog_ids = prog_ids;
query.link_ids = link_ids;

err = bpf_prog_query_opts(ifindex, BPF_TCX_INGRESS, &query);

After attachment, the loader queries the kernel for the live chain state. The returned data includes:

revision: A monotonically increasing counter that changes on every chain modification. This is the value you would pass as expected_revision if you wanted to perform atomic updates.
prog_ids[]: The BPF program IDs in chain order.
link_ids[]: The corresponding BPF link IDs.

This allows any observer to determine exactly which programs are attached and in what order, which is invaluable for debugging multi-program pipelines.

Step 4: Generate Traffic and Read Counters

The loader sends a UDP packet to 127.0.0.1 (port 9, discard) to trigger the chain, waits briefly, then reads the global variables to verify both programs executed:

printf("  tcx_stats hits      : %llu\n",
       (unsigned long long)skel->bss->stats_hits);
printf("  tcx_classifier hits : %llu\n",
       (unsigned long long)skel->bss->classifier_hits);

If both counters are 1, the chain worked as expected: tcx_stats ran first (recording metadata and returning TCX_NEXT), then tcx_classifier ran second (counting the packet and returning TCX_PASS).

Compilation and Execution

This example requires Linux 6.6+ with TCX support and a recent libbpf.

cd bpf-developer-tutorial/src/50-tcx
make
sudo ./tcx_demo -i lo

Expected output:

Attached TCX programs to lo (ifindex=1)
TCX ingress chain revision: 3
  slot 0: prog_id=812 link_id=901
  slot 1: prog_id=811 link_id=900

Counters:
  tcx_stats hits      : 1
  tcx_classifier hits : 1
  last ifindex        : 1
  last protocol       : 0x0800
  last length         : 46

The revision is 3 because the chain was modified twice: once when tcx_classifier was attached (revision went from 0 to 1), and once when tcx_stats was inserted before it (revision went to 2). The query itself increments the revision to 3.

If you want to inspect the attach behavior without traffic, add -n:

sudo ./tcx_demo -i lo -n

Use -v to enable libbpf debug output, which is helpful for seeing the low-level BPF syscall sequence.

How This Differs from Lesson 20 (Classic TC)

Lesson 20-tc teaches the classic TC data path: creating a clsact qdisc, attaching a SEC("tc") program as a filter, and using __sk_buff for packet inspection. That lesson is still valuable because the packet processing model is identical: TCX programs receive the same __sk_buff context and use the same helpers for packet parsing.

What TCX replaces is the control plane:

Aspect	Classic TC (Lesson 20)	TCX (Lesson 50)
Attach mechanism	Netlink / `tc` CLI	`bpf_program__attach_tcx()`
Ownership	None; anyone can `tc filter del`	BPF link; auto-detaches on fd close
Ordering	Implicit priority numbers	Explicit `BPF_F_BEFORE` / `BPF_F_AFTER`
Multi-program	Manual priority management	Built-in chain with revision tracking
Section name	`SEC("tc")`	`SEC("tcx/ingress")` / `SEC("tcx/egress")`
Kernel requirement	Any modern kernel	Linux 6.6+

If you are building new libbpf-based networking tools, TCX is the recommended interface. Cilium has already migrated from classic tc to TCX for its dataplane.

Summary

In this tutorial, we learned how TCX modernizes TC program attachment by replacing qdisc-based plumbing with BPF link semantics. We attached two ingress programs, controlled their execution order with BPF_F_BEFORE, queried the live chain with bpf_prog_query_opts(), and verified that both programs executed in the correct order. TCX provides safe ownership, explicit ordering, revision-aware updates, and coexistence with classic TC, making it the foundation for composable, multi-program traffic control in modern eBPF applications.

If you'd like to learn more about eBPF, visit our tutorial code repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or website https://eunomia.dev/tutorials/ for more examples and complete tutorials.

References

eBPF Tutorial by Example: BPF Token for Delegated Privilege and Secure Program Loading

云微 — Tue, 17 Mar 2026 07:48:37 +0000

Ever needed to let a container or CI job load an eBPF program without giving it full CAP_BPF or CAP_SYS_ADMIN? Or wanted to expose XDP packet processing to a tenant workload while ensuring it can only create the specific map types and program types you've approved? Before BPF token, the answer was binary: either you had the capabilities to do everything in BPF, or you could do nothing. There was no middle ground.

This is what BPF Token solves. Introduced by Andrii Nakryiko and merged in Linux 6.9, BPF token is a delegation mechanism that lets a privileged process (like a container runtime or systemd) create a precisely scoped permission set for BPF operations, then hand it to an unprivileged process through a bpffs mount. The unprivileged process can load programs, create maps, and attach hooks, but only the types that were explicitly allowed. No broad capabilities required.

In this tutorial, we'll set up a delegated bpffs mount in a user namespace, derive a BPF token from it, and use libbpf to load and attach a minimal XDP program, all from a process that has zero BPF capabilities of its own.

The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_token

Introduction to BPF Token: Solving the Privilege Problem

The Problem: All-or-Nothing BPF Capabilities

Traditional eBPF requires CAP_BPF for program loading and map creation, plus additional capabilities like CAP_PERFMON for tracing, CAP_NET_ADMIN for networking hooks, and CAP_SYS_ADMIN for certain advanced operations. These capabilities are inherently system-wide: you cannot namespace or sandbox CAP_BPF. As the kernel documentation explains, this is by design: BPF tracing helpers like bpf_probe_read_kernel() can access arbitrary kernel memory, which fundamentally cannot be scoped to a single namespace.

This creates a real problem in multi-tenant environments:

Container isolation: A Kubernetes pod that needs to run a simple XDP program must be given CAP_BPF + CAP_NET_ADMIN, which also grants it the ability to load any BPF program type and create any map type. There's no way to say "you can load XDP programs but not kprobes."
CI/CD pipelines: A build job that tests an eBPF-based observability tool needs root-equivalent capabilities to load programs, even though the test only exercises a specific, well-known program type.
Third-party integrations: A service mesh sidecar that attaches sockops programs needs capabilities that also grant it the ability to trace every process on the host.

The result is that organizations either give broad BPF capabilities (weakening their security posture) or prohibit BPF entirely in unprivileged contexts (limiting the technology's adoption).

The Solution: Scoped Delegation Through bpffs

BPF token takes a different approach. Instead of trying to namespace capabilities (which is fundamentally unsafe for BPF), it introduces an explicit delegation model:

A privileged process (container runtime, init system, platform daemon) creates a bpffs instance with specific delegation options that define exactly which BPF operations are allowed.
The privileged process passes this bpffs mount to an unprivileged process (container, CI job, tenant workload).
The unprivileged process derives a BPF token from the bpffs mount. The token is a file descriptor that carries the delegated permission set.
When the unprivileged process makes bpf() syscalls (through libbpf or directly), it passes the token fd. The kernel checks permissions against the token instead of against the process's capabilities.

The token is scoped along four independent axes:

Delegation Option	What It Controls	Example
`delegate_cmds`	Which `bpf()` commands are allowed	`prog_load:map_create:btf_load:link_create`
`delegate_maps`	Which map types can be created	`array:hash:ringbuf`
`delegate_progs`	Which program types can be loaded	`xdp:socket_filter`
`delegate_attachs`	Which attach types are allowed	`xdp:cgroup_inet_ingress` or `any`

Each axis is a bitmask. If a bit isn't set, the corresponding operation is denied even if the token is present. This gives platform engineers fine-grained control: you can allow a container to load XDP programs with array maps but deny it access to kprobes, perf events, or hash-of-maps.

The User Namespace Constraint

One critical design decision: a BPF token must be created inside the same user namespace as the bpffs instance, and that user namespace must not be init_user_ns. This is intentional. It means:

A host-namespace bpffs (the one at /sys/fs/bpf) does not produce usable tokens. Tokens only work when the bpffs is associated with a non-init user namespace.
The privileged parent configures the bpffs before passing it to the child, but the child (in its own user namespace) is the one that creates and uses the token.
This design prevents a process with an existing token from using it to escalate privileges outside its namespace boundary.

How libbpf Makes It Transparent

For applications built with libbpf (which is most of them), token usage is nearly transparent. You have three options:

Explicit path: Set bpf_object_open_opts.bpf_token_path when opening the BPF object. libbpf will derive the token from the specified bpffs mount.
Environment variable: Set LIBBPF_BPF_TOKEN_PATH to point to the bpffs mount. libbpf picks it up automatically.
Default path: If the default /sys/fs/bpf is a delegated bpffs in the current user namespace, libbpf uses it implicitly.

Once the token is derived, libbpf passes it to every relevant syscall (BPF_MAP_CREATE, BPF_BTF_LOAD, BPF_PROG_LOAD, and BPF_LINK_CREATE) without any source-code changes in the BPF application.

Writing the eBPF Program

The BPF side of this demo is intentionally minimal: a tiny XDP program on loopback. This keeps the focus on the token workflow. Here's the complete source:

// SPDX-License-Identifier: GPL-2.0
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>

char LICENSE[] SEC("license") = "GPL";

struct token_stats {
    __u64 packets;
    __u32 last_ifindex;
};

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 1);
    __type(key, __u32);
    __type(value, struct token_stats);
} stats_map SEC(".maps");

SEC("xdp")
int handle_packet(struct xdp_md *ctx)
{
    struct token_stats *stats;
    __u32 key = 0;

    stats = bpf_map_lookup_elem(&stats_map, &key);
    if (!stats)
        return 0;

    stats->packets++;
    stats->last_ifindex = ctx->ingress_ifindex;
    return XDP_PASS;
}

A few design choices to note:

BPF_MAP_TYPE_ARRAY was chosen because the delegation policy explicitly allows array maps. If we had used a hash map instead, loading would fail because the token doesn't grant hash map creation permission. This is the token model in action; even trivial program changes can be caught by the delegation policy.

SEC("xdp") matches the delegate_progs=xdp policy. If you changed this to SEC("kprobe/..."), the kernel would reject it at load time with an EPERM because kprobe isn't in the allowed program types.

XDP_PASS simply lets every packet through. The program's only purpose is to prove that a token-backed load and attach succeeded. In production, you'd replace this with real packet-processing logic.

User-Space Loader: Token-Backed Loading

The token_trace.c loader is a standard libbpf skeleton program with one key addition: it passes a bpf_token_path:

struct bpf_object_open_opts open_opts = {};

open_opts.sz = sizeof(open_opts);
open_opts.bpf_token_path = env.token_path;

skel = token_trace_bpf__open_opts(&open_opts);

From this point on, libbpf takes over. When it calls bpf(BPF_MAP_CREATE) to create stats_map, it includes the token fd. When it calls bpf(BPF_PROG_LOAD) for the XDP program, it includes the token fd. When it calls bpf(BPF_LINK_CREATE) to attach to the interface, it includes the token fd.

The rest of the loader is straightforward:

err = token_trace_bpf__load(skel);    // token used for map_create + prog_load
link = bpf_program__attach_xdp(skel->progs.handle_packet, ifindex);  // token used for link_create

After attaching, the loader reads the map before and after generating a test packet to verify the program executed:

err = bpf_map_lookup_elem(map_fd, &key, &before);
// ... generate UDP packet to 127.0.0.1 ...
err = bpf_map_lookup_elem(map_fd, &key, &after);
printf("delta          : %llu\n", after.packets - before.packets);

If the delta is 1, the XDP program was successfully loaded and attached using only delegated capabilities.

The Namespace Orchestrator: `token_userns_demo`

Because BPF token requires a non-init user namespace, running a bare token_trace -t /sys/fs/bpf on the host won't work. The token_userns_demo.c wrapper automates the complex namespace choreography. Here's the full sequence:

Step 1: Fork and Create Namespaces

parent (root, init_user_ns)          child (unprivileged, new userns)
         │                                        │
         │   fork()                               │
         ├────────────────────────────────────────>│
         │                                        │
         │                            unshare(CLONE_NEWUSER)
         │                            unshare(CLONE_NEWNS | CLONE_NEWNET)

The child creates a new user namespace (where it maps itself to uid/gid 0), a new mount namespace (so bpffs mounts are private), and a new network namespace (so lo is a fresh interface it can attach to).

Step 2: Create bpffs and Configure Delegation

parent (root, init_user_ns)          child (new userns)
         │                                        │
         │                            fs_fd = fsopen("bpf", 0)
         │   <───── send fs_fd via SCM_RIGHTS ────│
         │                                        │
    fsconfig(fs_fd, "delegate_cmds", ...)         │  (waiting for ack)
    fsconfig(fs_fd, "delegate_maps", "array")     │
    fsconfig(fs_fd, "delegate_progs", "xdp:...")  │
    fsconfig(fs_fd, "delegate_attachs", "any")    │
    fsconfig(fs_fd, FSCONFIG_CMD_CREATE)          │
         │                                        │
         │   ───────── send ack ─────────────────>│

The child calls fsopen("bpf", 0) to create a bpffs filesystem context in its user namespace, then sends the file descriptor to the parent via a Unix socket (SCM_RIGHTS). The parent, running as root in the init namespace, configures the delegation policy with fsconfig(), then materializes the filesystem with FSCONFIG_CMD_CREATE.

This two-step dance is necessary because: (a) the bpffs must be created in the child's user namespace (for the token to be valid there), but (b) only the privileged parent can set delegation options (because those options grant BPF capabilities).

Step 3: Mount and Load

child (new userns)
         │
    mnt_fd = fsmount(fs_fd, 0, 0)
    token_path = "/proc/self/fd/<mnt_fd>"
    set_loopback_up()
    exec("./token_trace", "-t", token_path, "-i", "lo")

The child materializes the bpffs as a detached mount (no mount point needed, since /proc/self/fd/<mnt_fd> gives a path), brings the loopback interface up in its network namespace, and execs token_trace with the bpffs path. From token_trace's perspective, it's just opening a BPF object with a token path. It doesn't know or care about the namespace setup.

Preparing a bpffs Mount Manually

If you want to experiment with the mount syntax outside the demo wrapper, the repository includes a helper script:

cd bpf-developer-tutorial/src/features/bpf_token
bash setup_token_bpffs.sh /tmp/bpf-token

This mounts bpffs at /tmp/bpf-token with:

delegate_cmds=prog_load:map_create:btf_load:link_create
delegate_maps=array
delegate_progs=xdp:socket_filter
delegate_attachs=any

Why socket_filter? libbpf performs a trivial program-load probe before loading the real BPF object. This probe uses a generic BPF_PROG_TYPE_SOCKET_FILTER program to detect kernel feature support. Without socket_filter in the delegation policy, the probe fails and libbpf refuses to proceed.

Why delegate_attachs=any? The same libbpf probe path also triggers attach-type validation in the kernel's token checking code. Using any avoids having to enumerate every possible attach type for probe compatibility.

Note that a host-namespace mount like this is useful for inspecting the delegation policy (e.g., with bpftool token list), but won't produce working tokens unless the bpf(BPF_TOKEN_CREATE) syscall comes from a matching non-init user namespace.

Compilation and Execution

Build all binaries:

cd bpf-developer-tutorial/src/features/bpf_token
make

Run the end-to-end demo:

sudo ./token_userns_demo

Expected output:

token path     : /proc/self/fd/5
interface      : lo (ifindex=1)
packets before : 0
packets after  : 1
delta          : 1
last ifindex   : 1

The delta: 1 confirms that the XDP program was successfully loaded and attached using a BPF token, with no CAP_BPF or CAP_SYS_ADMIN in the child process.

Add -v for verbose libbpf output to see the token being created and used:

sudo ./token_userns_demo -v

If you already manage your own delegated bpffs in a user namespace, you can run the loader directly:

./token_trace -t /proc/self/fd/<mnt-fd> -i lo

Real-World Applications

While this tutorial uses a minimal XDP program, the BPF token pattern scales to production scenarios:

Container runtimes (LXD, Docker, Kubernetes): Mount a delegated bpffs into a container with only the program and map types the workload needs. LXD already supports this through its security.delegate_bpf option.
CI/CD testing: Give build jobs the ability to load and test specific eBPF programs without granting them host-level capabilities. The delegation policy acts as an allowlist for BPF operations.
Multi-tenant BPF platforms: A platform daemon creates per-tenant bpffs mounts with different delegation policies. One tenant might be allowed XDP + array maps, while another might get tracepoint + ringbuf access.
LSM integration: Because BPF tokens integrate with Linux Security Modules, you can combine token delegation with SELinux or AppArmor policies for defense-in-depth. Each token gets its own security context that LSM hooks can inspect.

Summary

In this tutorial, we learned how BPF token provides a delegation model for eBPF privilege that goes beyond the binary "all or nothing" of Linux capabilities. We walked through the complete flow: a privileged parent configures a bpffs instance with specific delegation options, an unprivileged child in a user namespace derives a token from that bpffs, and libbpf transparently uses the token for map creation, program loading, and attachment. The result is a minimal XDP program running in an unprivileged context, something that was impossible before Linux 6.9.

BPF token is not a niche feature. It represents the kernel's answer to a fundamental question in the eBPF ecosystem: how do you safely share BPF capabilities in a multi-tenant world without granting unconstrained access to the BPF subsystem?

References

eBPF Tutorial: cgroup-based Policy Control

云微 — Tue, 24 Feb 2026 07:43:56 +0000

Do you need to enforce network access control on containers or specific process groups without affecting the entire system? Or do you need to restrict certain processes from accessing specific devices while allowing others to use them normally? Traditional iptables and device permissions are global, making fine-grained per-process-group control impossible.

This is the problem cgroup eBPF solves. By attaching eBPF programs to cgroups (control groups), you can implement policy control based on process membership—only processes belonging to a specific cgroup are affected. This enables container isolation, multi-tenant security, and sandbox environments. In this tutorial, we'll build a complete "policy guard" program that demonstrates TCP connection filtering, device access control, and sysctl read restrictions—three types of cgroup eBPF usage.

What is cgroup eBPF?

The core idea of cgroup eBPF is simple: attach an eBPF program to a cgroup, and all processes in that cgroup will be controlled by this program. Unlike XDP/tc which filter traffic by network interface, cgroup eBPF filters by process membership—put a container in a cgroup, attach a policy program, and that container's network access, device access, and sysctl reads/writes are all under your control. Processes in other cgroups are completely unaffected.

This model is perfect for container and multi-tenant scenarios. Kubernetes NetworkPolicy uses cgroup eBPF under the hood. You can also use it for device isolation (e.g., restricting which containers can access GPUs), security sandboxes (preventing reads of sensitive sysctls), and more. When a cgroup eBPF program denies an operation, userspace syscalls return EPERM (Operation not permitted).

cgroup eBPF Hook Points

1. `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` - Socket Address Hooks

Triggered on socket address syscalls (bind/connect/sendmsg/recvmsg):

Hook	Section Name	Description
IPv4 bind	`cgroup/bind4`	Filter bind() calls
IPv6 bind	`cgroup/bind6`	Filter bind() calls
IPv4 connect	`cgroup/connect4`	Filter connect() calls
IPv6 connect	`cgroup/connect6`	Filter connect() calls
UDP sendmsg	`cgroup/sendmsg4`, `cgroup/sendmsg6`	Filter UDP sends
UDP recvmsg	`cgroup/recvmsg4`, `cgroup/recvmsg6`	Filter UDP receives
Unix connect	`cgroup/connect_unix`	Filter Unix socket connect

Context: struct bpf_sock_addr - contains user_ip4, user_port (network byte order)

Return semantics: return 1 = allow, return 0 = deny (EPERM)

2. `BPF_PROG_TYPE_CGROUP_DEVICE` - Device Access Control

Hook	Section Name	Description
Device access	`cgroup/dev`	Filter device open/read/write/mknod

Context: struct bpf_cgroup_dev_ctx - contains major, minor, access_type

Return semantics: return 0 = deny (EPERM), non-zero = allow

3. `BPF_PROG_TYPE_CGROUP_SYSCTL` - Sysctl Access Control

Hook	Section Name	Description
Sysctl access	`cgroup/sysctl`	Filter /proc/sys reads/writes

Context: struct bpf_sysctl - use bpf_sysctl_get_name() to get sysctl name

Return semantics: return 0 = reject (EPERM), return 1 = proceed

4. Other cgroup Hooks

cgroup_skb/ingress, cgroup_skb/egress - Packet-level filtering
cgroup/getsockopt, cgroup/setsockopt - Socket option filtering
cgroup/sock_create, cgroup/sock_release - Socket lifecycle
sockops - TCP-level optimization (attached via BPF_CGROUP_SOCK_OPS)

This Tutorial: cgroup Policy Guard

We implement a single eBPF object with three programs:

Network (TCP): Block connect() to a specified destination port
Device: Block access to a specified major:minor device
Sysctl: Block reading a specified sysctl (read-only, safer for testing)

Events are sent to userspace via ringbuf for observability.

Implementation

Shared Header: cgroup_guard.h

This header defines data structures shared between kernel and userspace:

// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
#ifndef __CGROUP_GUARD_H
#define __CGROUP_GUARD_H

#ifndef TASK_COMM_LEN
#define TASK_COMM_LEN 16
#endif

#define SYSCTL_NAME_LEN 64

enum event_type {
    EVENT_CONNECT4 = 1,
    EVENT_DEVICE   = 2,
    EVENT_SYSCTL   = 3,
};

struct event {
    __u64 ts_ns;
    __u32 pid;
    __u32 type;
    char comm[TASK_COMM_LEN];

    union {
        struct {
            __u32 daddr;  /* IPv4, network order */
            __u16 dport;  /* host order */
            __u16 proto;  /* e.g. 6 for TCP */
        } connect4;

        struct {
            __u32 major;
            __u32 minor;
            __u32 access_type;
        } device;

        struct {
            __u32 write;
            char name[SYSCTL_NAME_LEN];
        } sysctl;
    };
};

#endif /* __CGROUP_GUARD_H */

The event structure uses a union to store type-specific data for different events, saving space while maintaining a unified event format.

eBPF Program: cgroup_guard.bpf.c

// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* cgroup_guard.bpf.c - cgroup eBPF policy guard
 *
 * This program demonstrates three types of cgroup eBPF hooks:
 * 1. cgroup/connect4 - TCP connection filtering
 * 2. cgroup/dev - Device access control
 * 3. cgroup/sysctl - Sysctl read/write control
 */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>

#include "cgroup_guard.h"

char LICENSE[] SEC("license") = "Dual BSD/GPL";

/* ===== Configurable options: set by userspace before load ===== */
#define IPPROTO_TCP 6

const volatile __u16 blocked_tcp_dport = 0;                   /* host order */
const volatile __u32 blocked_dev_major = 0;
const volatile __u32 blocked_dev_minor = 0;
const volatile char denied_sysctl_name[SYSCTL_NAME_LEN] = {}; /* NUL-terminated */

/* ===== ringbuf: send denied events to userspace ===== */
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 24); /* 16MB */
} events SEC(".maps");

static __always_inline void fill_common(struct event *e, __u32 type)
{
    e->ts_ns = bpf_ktime_get_ns();
    e->type = type;
    e->pid = (__u32)(bpf_get_current_pid_tgid() >> 32);
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
}

/* Compare two strings, return 1 if equal, 0 if not
 * Note: b is volatile to handle const volatile rodata arrays correctly */
static __always_inline int str_eq(const char *a, const volatile char *b, int max_len)
{
#pragma unroll
    for (int i = 0; i < SYSCTL_NAME_LEN; i++) {
        char ca = a[i];
        char cb = b[i];
        if (ca != cb)
            return 0;
        if (ca == '\0')
            return 1;
    }
    return 1;
}

/* ===== 1) Network: block TCP connect4 to specified port =====
 * ctx: struct bpf_sock_addr
 * user_ip4/user_port: network byte order (need conversion)
 *
 * Return semantics:
 * - return 1: allow
 * - return 0: deny (userspace gets EPERM)
 */
SEC("cgroup/connect4")
int cg_connect4(struct bpf_sock_addr *ctx)
{
    if (blocked_tcp_dport == 0)
        return 1;

    if (ctx->protocol != IPPROTO_TCP)
        return 1;

    __u16 dport = bpf_ntohs((__u16)ctx->user_port);
    if (dport != blocked_tcp_dport)
        return 1;

    struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (e) {
        fill_common(e, EVENT_CONNECT4);
        e->connect4.daddr = ctx->user_ip4; /* network order */
        e->connect4.dport = dport;         /* host order */
        e->connect4.proto = ctx->protocol;
        bpf_ringbuf_submit(e, 0);
    }

    return 0; /* deny -> userspace gets EPERM on connect */
}

/* ===== 2) Device: block access to specified major:minor =====
 * ctx: struct bpf_cgroup_dev_ctx { access_type, major, minor }
 *
 * Return semantics:
 * - return 0: deny (userspace gets EPERM)
 * - return non-zero: allow
 */
SEC("cgroup/dev")
int cg_dev(struct bpf_cgroup_dev_ctx *ctx)
{
    if (blocked_dev_major == 0 && blocked_dev_minor == 0)
        return 1;

    if (ctx->major != blocked_dev_major || ctx->minor != blocked_dev_minor)
        return 1;

    struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (e) {
        fill_common(e, EVENT_DEVICE);
        e->device.major = ctx->major;
        e->device.minor = ctx->minor;
        e->device.access_type = ctx->access_type;
        bpf_ringbuf_submit(e, 0);
    }

    return 0; /* deny -> -EPERM */
}

/* ===== 3) Sysctl: block reading specified sysctl =====
 * ctx: struct bpf_sysctl
 * Use bpf_sysctl_get_name() to get name
 *
 * Return semantics:
 * - return 0: reject
 * - return 1: proceed
 * If return 0, userspace read/write returns -1 with errno=EPERM
 */
SEC("cgroup/sysctl")
int cg_sysctl(struct bpf_sysctl *ctx)
{
    char name[SYSCTL_NAME_LEN];
    int ret = bpf_sysctl_get_name(ctx, name, sizeof(name), 0);
    if (ret < 0)
        return 1;

    if (denied_sysctl_name[0] == '\0')
        return 1;

    /* Only deny reads, allow writes (safer for testing) */
    if (ctx->write)
        return 1;

    if (!str_eq(name, denied_sysctl_name, SYSCTL_NAME_LEN))
        return 1;

    struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (e) {
        fill_common(e, EVENT_SYSCTL);
        e->sysctl.write = ctx->write;
#pragma unroll
        for (int i = 0; i < SYSCTL_NAME_LEN; i++) {
            e->sysctl.name[i] = name[i];
            if (name[i] == '\0')
                break;
        }
        bpf_ringbuf_submit(e, 0);
    }

    return 0; /* deny -> -EPERM */
}

Understanding the BPF Code

The overall logic of this program is clear: three cgroup hooks handle network connections, device access, and sysctl reads/writes respectively. Each hook follows the same workflow—check if the current operation matches the configured blocking rule, report an event via ringbuf and return 0 (deny) if it matches, otherwise return 1 (allow).

The cg_connect4 function uses SEC("cgroup/connect4") to attach at IPv4 connection time. There's an important detail here: ctx->user_port is in network byte order (big-endian), while our configured port is in host byte order, so we must convert with bpf_ntohs() before comparing. If the destination port matches our configured blocked_tcp_dport, the program returns 0, and the userspace connect() call fails with EPERM.

The cg_dev function handles device access. Its context struct bpf_cgroup_dev_ctx contains three key fields: major and minor identify the device (e.g., /dev/null is 1:3), and access_type indicates the access type (read/write/mknod). We simply compare whether major:minor matches the configured values.

The cg_sysctl function intercepts sysctl reads/writes under /proc/sys/. It uses bpf_sysctl_get_name() to get the sysctl name, in path format like kernel/hostname (slash-separated, not dots). We only block reads, allowing writes—this is safer for testing and won't accidentally change system configuration.

The configuration options at the top of the program are declared as const volatile. This is the standard CO-RE (Compile Once, Run Everywhere) pattern: these values are defaults (0 or empty string) at compile time, and userspace sets the actual values via skel->rodata-> before load(). This allows a single compiled BPF program to run with different configurations.

Userspace Loader: cgroup_guard.c

// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* cgroup_guard.c - Userspace loader for cgroup eBPF policy guard */
#include <errno.h>
#include <fcntl.h>
#include <getopt.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/resource.h>
#include <sys/stat.h>
#include <unistd.h>
#include <arpa/inet.h>

#include <bpf/libbpf.h>

#include "cgroup_guard.skel.h"
#include "cgroup_guard.h"

static volatile sig_atomic_t exiting = 0;

static void sig_handler(int sig)
{
    (void)sig;
    exiting = 1;
}

static int libbpf_print_fn(enum libbpf_print_level level,
                           const char *format, va_list args)
{
    if (level == LIBBPF_DEBUG)
        return 0;
    return vfprintf(stderr, format, args);
}

static void usage(const char *prog)
{
    fprintf(stderr,
        "Usage: %s [OPTIONS]\n"
        "\n"
        "Options:\n"
        "  -c, --cgroup PATH           cgroup v2 path (default: /sys/fs/cgroup/ebpf_demo)\n"
        "  -p, --block-port PORT       block TCP connect() to this dst port (IPv4)\n"
        "  -d, --deny-device MAJ:MIN   deny device access for (major:minor)\n"
        "  -s, --deny-sysctl NAME      deny sysctl READ of this name\n"
        "  -h, --help                  show this help\n",
        prog);
}

static int handle_event(void *ctx, void *data, size_t data_sz)
{
    (void)ctx;
    (void)data_sz;

    const struct event *e = (const struct event *)data;

    if (e->type == EVENT_CONNECT4) {
        char ip[INET_ADDRSTRLEN] = {0};
        struct in_addr addr = { .s_addr = e->connect4.daddr };
        inet_ntop(AF_INET, &addr, ip, sizeof(ip));

        printf("[DENY connect4] pid=%u comm=%s daddr=%s dport=%u proto=%u\n",
               e->pid, e->comm, ip, e->connect4.dport, e->connect4.proto);
    } else if (e->type == EVENT_DEVICE) {
        printf("[DENY device]   pid=%u comm=%s major=%u minor=%u access_type=0x%x\n",
               e->pid, e->comm, e->device.major, e->device.minor, e->device.access_type);
    } else if (e->type == EVENT_SYSCTL) {
        printf("[DENY sysctl]   pid=%u comm=%s write=%u name=%s\n",
               e->pid, e->comm, e->sysctl.write, e->sysctl.name);
    }

    fflush(stdout);
    return 0;
}

int main(int argc, char **argv)
{
    const char *cgroup_path = "/sys/fs/cgroup/ebpf_demo";
    int block_port = 0;
    int dev_major = 0, dev_minor = 0;
    const char *deny_sysctl = NULL;

    /* Parse command line arguments */
    static const struct option long_opts[] = {
        { "cgroup",      required_argument, NULL, 'c' },
        { "block-port",  required_argument, NULL, 'p' },
        { "deny-device", required_argument, NULL, 'd' },
        { "deny-sysctl", required_argument, NULL, 's' },
        { "help",        no_argument,       NULL, 'h' },
        {}
    };

    int opt;
    while ((opt = getopt_long(argc, argv, "c:p:d:s:h", long_opts, NULL)) != -1) {
        switch (opt) {
        case 'c': cgroup_path = optarg; break;
        case 'p': block_port = atoi(optarg); break;
        case 'd': /* parse major:minor */ break;
        case 's': deny_sysctl = optarg; break;
        default: usage(argv[0]); return 1;
        }
    }

    libbpf_set_print(libbpf_print_fn);
    signal(SIGINT, sig_handler);
    signal(SIGTERM, sig_handler);

    /* Create cgroup directory if needed */
    mkdir(cgroup_path, 0755);

    int cg_fd = open(cgroup_path, O_RDONLY | O_DIRECTORY);
    if (cg_fd < 0) {
        fprintf(stderr, "open(%s) failed: %s\n", cgroup_path, strerror(errno));
        return 1;
    }

    /* Open and configure BPF skeleton */
    struct cgroup_guard_bpf *skel = cgroup_guard_bpf__open();
    if (!skel) {
        fprintf(stderr, "cgroup_guard_bpf__open() failed\n");
        close(cg_fd);
        return 1;
    }

    /* Write .rodata configuration (must be before load) */
    if (block_port > 0 && block_port <= 65535)
        skel->rodata->blocked_tcp_dport = (__u16)block_port;
    if (dev_major > 0 || dev_minor > 0) {
        skel->rodata->blocked_dev_major = (__u32)dev_major;
        skel->rodata->blocked_dev_minor = (__u32)dev_minor;
    }
    if (deny_sysctl) {
        snprintf((char *)skel->rodata->denied_sysctl_name,
                 SYSCTL_NAME_LEN, "%s", deny_sysctl);
    }

    /* Load BPF programs into kernel */
    int err = cgroup_guard_bpf__load(skel);
    if (err) {
        fprintf(stderr, "cgroup_guard_bpf__load() failed: %d\n", err);
        goto cleanup;
    }

    /* Attach programs to cgroup */
    struct bpf_link *link_connect = bpf_program__attach_cgroup(skel->progs.cg_connect4, cg_fd);
    struct bpf_link *link_dev = bpf_program__attach_cgroup(skel->progs.cg_dev, cg_fd);
    struct bpf_link *link_sysctl = bpf_program__attach_cgroup(skel->progs.cg_sysctl, cg_fd);

    /* Setup ring buffer for events */
    struct ring_buffer *rb = ring_buffer__new(bpf_map__fd(skel->maps.events),
                                              handle_event, NULL, NULL);

    printf("Attached to cgroup: %s\n", cgroup_path);
    printf("Config: block_port=%d, deny_device=%d:%d, deny_sysctl_read=%s\n",
           block_port, dev_major, dev_minor, deny_sysctl ? deny_sysctl : "(none)");

    /* Main event loop */
    while (!exiting) {
        err = ring_buffer__poll(rb, 200 /* ms */);
        if (err == -EINTR)
            break;
    }

    ring_buffer__free(rb);

cleanup:
    bpf_link__destroy(link_sysctl);
    bpf_link__destroy(link_dev);
    bpf_link__destroy(link_connect);
    cgroup_guard_bpf__destroy(skel);
    close(cg_fd);
    return err ? 1 : 0;
}

Understanding the Userspace Code

The userspace loader's core job is to attach BPF programs to the specified cgroup, then continuously poll the ringbuf to print denied events.

The program first uses getopt_long to parse command-line arguments, getting the cgroup path and three policy configurations. Then it uses open() with O_RDONLY | O_DIRECTORY to open the cgroup directory and get a file descriptor. This fd is the attach target—cgroup eBPF programs are attached to cgroup directories.

Next comes the standard skeleton workflow: open() opens the BPF object, set .rodata configuration, then load() loads it into the kernel. Note that configuration must be set before load—after load, .rodata becomes read-only.

Attaching uses bpf_program__attach_cgroup(prog, cg_fd) to attach each BPF program to the cgroup. Here we attach three programs: connect4, dev, and sysctl. After successful attachment, all processes in this cgroup will have their relevant operations go through these BPF programs.

Finally, the event loop. ring_buffer__poll() polls the ringbuf, calling the handle_event callback whenever events arrive to print them. This lets you see which operations are being denied in real-time.

Building

cd src/cgroup
make

Running

Terminal A: Start the loader

# Block: TCP port 9090, /dev/null (1:3), reading kernel/hostname
sudo ./cgroup_guard \
  --cgroup /sys/fs/cgroup/ebpf_demo \
  --block-port 9090 \
  --deny-device 1:3 \
  --deny-sysctl kernel/hostname

You should see:

Attached to cgroup: /sys/fs/cgroup/ebpf_demo
Config: block_port=9090, deny_device=1:3, deny_sysctl_read=kernel/hostname
Press Ctrl-C to stop.

Terminal B: Start test servers (outside cgroup)

# Start two HTTP servers
python3 -m http.server 8080 --bind 127.0.0.1 &
python3 -m http.server 9090 --bind 127.0.0.1 &

Terminal C: Test from within the cgroup

sudo bash -c '
echo $$ > /sys/fs/cgroup/ebpf_demo/cgroup.procs

echo "== TCP test =="
curl -s http://127.0.0.1:8080 >/dev/null && echo "8080 OK"
curl -s http://127.0.0.1:9090 >/dev/null && echo "9090 OK (unexpected)" || echo "9090 BLOCKED (expected)"

echo
echo "== Device test =="
cat /dev/null && echo "/dev/null OK (unexpected)" || echo "/dev/null BLOCKED (expected)"

echo
echo "== Sysctl test =="
cat /proc/sys/kernel/hostname && echo "sysctl read OK (unexpected)" || echo "sysctl read BLOCKED (expected)"
'

Expected output:

8080 OK - Port 8080 is allowed
9090 BLOCKED (expected) - Port 9090 is blocked
/dev/null BLOCKED (expected) - Device 1:3 is blocked
sysctl read BLOCKED (expected) - Reading kernel/hostname is blocked

Terminal A output (events)

[DENY connect4] pid=12345 comm=curl daddr=127.0.0.1 dport=9090 proto=6
[DENY device]   pid=12346 comm=cat major=1 minor=3 access_type=0x...
[DENY sysctl]   pid=12347 comm=cat write=0 name=kernel/hostname

One-click Test

We provide a test script that automatically compiles, starts servers, runs tests, and cleans up:

sudo ./test.sh

Verifying with bpftool

sudo bpftool cgroup tree /sys/fs/cgroup/ebpf_demo

When to Use cgroup eBPF

Choosing the right technology depends on your control granularity requirements.

cgroup eBPF's control granularity is process groups—put processes in a cgroup, attach a BPF program, and the policy applies to that group. This is perfect for container scenarios: each container is a cgroup, and you can set different network policies, device permissions, and sysctl access rules for different containers. When a process leaves the cgroup, the policy automatically stops applying—no manual cleanup needed.

XDP and tc's control granularity is network interfaces. They handle all traffic passing through a specific NIC, regardless of which process it comes from. If you need high-performance packet processing, DDoS protection, or load balancing, XDP/tc are better choices. But if you want "only allow container A to access port 80, while container B can access any port," XDP/tc become inconvenient.

seccomp-BPF's control granularity is individual processes. It filters system calls, such as preventing a process from calling fork, exec, or socket. seccomp is lower-level and suitable for process sandboxing. But it can't control network destination addresses or device major:minor—these higher-level semantics.

Traditional iptables/nftables are global. Rules you configure apply to all processes on the entire system—there's no way to say "this rule only affects container A."

In summary: if you need per-container/process-group policies, want to control network, devices, and sysctls together, and want policies to automatically follow process lifecycles, cgroup eBPF is the right choice.

Summary

cgroup eBPF solves the problem of fine-grained control that traditional global policies can't achieve by binding policies to process groups. This tutorial demonstrated three commonly used cgroup hooks:

cgroup/connect4: Filter destination ports at TCP connection time, blocking disallowed outbound connections
cgroup/dev: Check major:minor at device access time, restricting reads/writes to specific devices
cgroup/sysctl: Check names at sysctl read/write time, preventing sensitive configuration leaks or tampering

This "policy guard" pattern can be extended to production use cases: container network policies (similar to Kubernetes NetworkPolicy), device isolation (GPU/TPU exclusive access), security sandboxes (restricting system information access). With ringbuf event reporting, you can also implement policy auditing and alerting.

If you want to learn more about eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

References

Kernel docs: libbpf program types - all cgroup-related section names
eBPF docs: CGROUP_SOCK_ADDR - socket address hooks explained
eBPF docs: CGROUP_DEVICE - device access control explained
eBPF docs: CGROUP_SYSCTL - sysctl access control explained
Tutorial repository: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/cgroup

Full source code is available in the tutorial repository. Requires Linux kernel 4.10+ (cgroup v2) and libbpf.

eBPF Tutorial by Example: BPF Dynamic Pointers for Variable-Length Data

云微 — Tue, 17 Feb 2026 07:43:38 +0000

Ever written an eBPF packet parser and struggled with those verbose data_end bounds checks that the verifier still rejects? Or tried to send variable-length events through ring buffers only to find yourself locked into fixed-size structures? Traditional eBPF development forces you to prove memory safety statically at compile time, which becomes painful when dealing with runtime-determined sizes like packet lengths or user-configurable snapshot lengths.

This is what BPF dynptrs (dynamic pointers) solve. Introduced gradually from Linux v5.19, dynptrs provide a verifier-friendly way to work with variable-length data by shifting some bounds checking from compile-time static analysis to runtime validation. In this tutorial, we'll build a TC ingress program that uses skb dynptrs to parse TCP packets safely and ringbuf dynptrs to output variable-length events containing configurable payload snapshots.

The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/dynptr

Introduction to BPF Dynamic Pointers

The Problem: When Static Verification Isn't Enough

The eBPF verifier's core mission is proving memory safety at load time. Every pointer dereference must be bounded, every array access must be within limits. This works beautifully for simple cases, but becomes a struggle when sizes are determined at runtime.

Consider parsing a packet where the IP header length comes from a 4-bit field, or reading user-configurable amounts of TCP payload. The classic approach requires extensive bounds checking with data_end comparisons, and even correctly written code sometimes fails verification because the verifier cannot trace all possible paths. When working with non-linear skb data (paged buffers), the situation gets worse since that data isn't directly accessible through ctx->data at all.

Variable-length output presents similar challenges. The traditional bpf_ringbuf_reserve() returns a raw pointer, but writing runtime-determined amounts of data to it makes the verifier uncomfortable because it cannot statically prove your writes stay within bounds.

The Solution: Runtime-Checked Dynamic Pointers

Dynptrs introduce an opaque handle type that carries metadata about the underlying memory region including its bounds and type. You cannot dereference a dynptr directly since the verifier will reject such attempts. Instead, you must use helper functions or kfuncs that perform the appropriate safety checks.

The key insight is that some of these checks happen at runtime rather than compile time. Functions like bpf_dynptr_read() and bpf_dynptr_write() validate bounds when they execute and return errors on failure. Functions like bpf_dynptr_slice() return NULL when the requested region cannot be accessed safely. This lets you express logic that would be unprovable statically while maintaining safety guarantees.

For the verifier, dynptrs are tracked specially. They have lifecycle rules (some must be released), type constraints (skb dynptrs behave differently than local dynptrs), and the verifier ensures you follow these rules. The runtime checks are the verifier's way of delegating what it cannot prove statically.

Dynptr API Overview

Helpers vs Kfuncs

The dynptr ecosystem spans two categories of functions. Helper functions are part of the stable UAPI and generally maintain backward compatibility. Kfuncs (kernel functions) are internal kernel exports to BPF with no ABI stability guarantees, meaning they may change between kernel versions.

For dynptrs, the foundational read/write operations are helpers, while newer features like skb dynptrs and slicing are kfuncs. This means some dynptr functionality requires newer kernels and you should verify availability before relying on specific features.

Creating Dynptrs

There are several ways to create dynptrs depending on your data source. The bpf_dynptr_from_mem() helper creates a dynptr from map values or global variables, useful for working with configuration data or scratch buffers. The bpf_dynptr_from_skb() kfunc creates a dynptr from a socket buffer, enabling safe access to packet data including non-linear (paged) regions. For XDP programs, bpf_dynptr_from_xdp() provides similar functionality.

Ring buffer operations use bpf_ringbuf_reserve_dynptr() to allocate variable-length records. Unlike regular bpf_ringbuf_reserve() which returns a pointer to a fixed-size region, the dynptr variant lets you specify the size at runtime. This is crucial for variable-length event structures.

Reading and Writing

The bpf_dynptr_read() helper copies data from a dynptr into a destination buffer. It takes an offset and length, performing runtime bounds checking and returning an error if the read would exceed the dynptr's bounds. This is the safe way to extract data when you need it in a local buffer.

The bpf_dynptr_write() helper does the reverse, copying data into a dynptr. For skb dynptrs, writing may have additional semantics similar to bpf_skb_store_bytes(), and note that writes can invalidate previously obtained slices.

The bpf_dynptr_data() helper returns a direct pointer to data within the dynptr, with the verifier tracking the bounds statically. However, this does NOT work for skb or xdp dynptrs since their data may not be in a single contiguous region.

Slicing for Packet Parsing

For skb and xdp dynptrs, bpf_dynptr_slice() is the primary way to access data. You provide an offset, a length, and optionally a local buffer. The function returns a pointer to the requested data, which may be either a direct pointer into the packet or your provided buffer (if the data needed to be copied from non-linear regions).

The critical rule is that you must NULL-check the return value. A NULL return means the requested region cannot be accessed, either because it exceeds packet bounds or for other internal reasons. Once you have a valid slice pointer, you can dereference it safely within the requested bounds.

There's also bpf_dynptr_slice_rdwr() for obtaining writable slices, with availability depending on the program type and whether the underlying data supports writes.

Ring Buffer Lifecycle

The bpf_ringbuf_reserve_dynptr() function has special lifecycle rules enforced by the verifier. Once you call it, you must call either bpf_ringbuf_submit_dynptr() or bpf_ringbuf_discard_dynptr() on the dynptr, regardless of whether the reservation succeeded. This is not optional since the verifier tracks dynptr state and will reject programs that leak reserved dynptrs.

This differs from regular ringbuf usage where a NULL return from bpf_ringbuf_reserve() means nothing was allocated. With dynptrs, the reserve failure still requires explicit cleanup through discard. The verifier needs this guarantee to ensure proper resource management.

Implementation: TC Ingress with Dynptr Parsing and Variable-Length Events

Our demonstration program attaches to TC ingress and accomplishes three things. First, it creates an skb dynptr from incoming packets using bpf_dynptr_from_skb(). Second, it parses Ethernet, IPv4, and TCP headers using bpf_dynptr_slice() for safe bounds-checked access. Third, it outputs variable-length events through a ringbuf dynptr, including a configurable snapshot of TCP payload.

Complete BPF Program: dynptr_tc.bpf.c

// SPDX-License-Identifier: GPL-2.0
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>

#include "dynptr_tc.h"

/* kfunc declarations for dynptr operations (v6.4+) */
extern int bpf_dynptr_from_skb(struct __sk_buff *s, __u64 flags,
                               struct bpf_dynptr *ptr__uninit) __ksym;
extern void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, __u32 offset,
                              void *buffer__opt, __u32 buffer__sz) __ksym;

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 24); /* 16MB */
} events SEC(".maps");

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 1);
    __type(key, __u32);
    __type(value, struct dynptr_cfg);
} cfg_map SEC(".maps");

SEC("tc")
int dynptr_tc_ingress(struct __sk_buff *ctx)
{
    const struct dynptr_cfg *cfg;
    struct bpf_dynptr skb_ptr;

    /* Temporary buffers for slice (data may be copied here) */
    struct ethhdr eth_buf;
    struct iphdr  ip_buf;
    struct tcphdr tcp_buf;

    const struct ethhdr *eth;
    const struct iphdr  *iph;
    const struct tcphdr *tcp;

    cfg = bpf_map_lookup_elem(&cfg_map, &(__u32){0});
    if (!cfg)
        return TC_ACT_OK;

    /* Create dynptr from skb */
    if (bpf_dynptr_from_skb(ctx, 0, &skb_ptr))
        return TC_ACT_OK;

    /* Parse Ethernet header using slice */
    eth = bpf_dynptr_slice(&skb_ptr, 0, &eth_buf, sizeof(eth_buf));
    if (!eth)
        return TC_ACT_OK;

    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return TC_ACT_OK;

    /* Parse IPv4 header */
    __u32 ip_off = sizeof(*eth);
    iph = bpf_dynptr_slice(&skb_ptr, ip_off, &ip_buf, sizeof(ip_buf));
    if (!iph || iph->version != 4 || iph->protocol != IPPROTO_TCP)
        return TC_ACT_OK;

    /* Parse TCP header */
    __u32 tcp_off = ip_off + ((__u32)iph->ihl * 4);
    tcp = bpf_dynptr_slice(&skb_ptr, tcp_off, &tcp_buf, sizeof(tcp_buf));
    if (!tcp)
        return TC_ACT_OK;

    __u16 dport = bpf_ntohs(tcp->dest);
    __u16 sport = bpf_ntohs(tcp->source);
    __u8 drop = (cfg->blocked_port && (sport == cfg->blocked_port || dport == cfg->blocked_port));

    /* Output variable-length event using ringbuf dynptr */
    if (cfg->enable_ringbuf) {
        __u32 snap_len = cfg->snap_len;
        __u8 payload[MAX_SNAPLEN] = {};

        __u32 payload_off = tcp_off + ((__u32)tcp->doff * 4);
        if (payload_off < ctx->len) {
            __u32 avail = ctx->len - payload_off;
            if (snap_len > avail) snap_len = avail;
            if (snap_len > MAX_SNAPLEN) snap_len = MAX_SNAPLEN;

            if (bpf_dynptr_read(payload, snap_len, &skb_ptr, payload_off, 0))
                snap_len = 0;
        } else {
            snap_len = 0;
        }

        struct event_hdr hdr = {
            .ts_ns = bpf_ktime_get_ns(),
            .ifindex = ctx->ifindex,
            .pkt_len = ctx->len,
            .saddr = iph->saddr,
            .daddr = iph->daddr,
            .sport = bpf_ntohs(tcp->source),
            .dport = dport,
            .drop = drop,
            .snap_len = snap_len,
        };

        /* Reserve variable-length ringbuf record */
        struct bpf_dynptr rb;
        __u32 total_sz = sizeof(hdr) + snap_len;

        long err = bpf_ringbuf_reserve_dynptr(&events, total_sz, 0, &rb);
        if (err) {
            /* Must discard even on failure */
            bpf_ringbuf_discard_dynptr(&rb, 0);
            return drop ? TC_ACT_SHOT : TC_ACT_OK;
        }

        bpf_dynptr_write(&rb, 0, &hdr, sizeof(hdr), 0);
        if (snap_len)
            bpf_dynptr_write(&rb, sizeof(hdr), payload, snap_len, 0);

        bpf_ringbuf_submit_dynptr(&rb, 0);
    }

    return drop ? TC_ACT_SHOT : TC_ACT_OK;
}

char _license[] SEC("license") = "GPL";

Understanding the BPF Code

The program begins by declaring the kfuncs it needs. The bpf_dynptr_from_skb() function creates a dynptr from the socket buffer, and bpf_dynptr_slice() returns pointers to specific regions within it. The __ksym attribute tells the loader these are kernel symbols to be resolved at load time.

When parsing headers, notice how we provide local buffers (eth_buf, ip_buf, tcp_buf) to each slice call. The slice function may return a pointer directly into packet data if it's linearly accessible, or it may copy data into our buffer and return a pointer to the buffer. Either way, we get a valid pointer we can dereference, or NULL on failure.

The NULL check pattern is crucial. Each slice call can fail if the requested offset plus length exceeds packet bounds or if the data cannot be accessed for other reasons. Checking for NULL before using the returned pointer is mandatory.

For ringbuf output, we use bpf_dynptr_read() to copy TCP payload from the skb into a local buffer first. This demonstrates reading from an skb dynptr with runtime-determined length (bounded by configuration and available data). The read may fail if bounds are exceeded, in which case we set snap_len to zero.

The ringbuf dynptr reserve shows the variable-length allocation pattern. We compute the total size (header plus snapshot) and reserve that exact amount. After writing both the header and payload using bpf_dynptr_write(), we submit the record. Note the discard call on reserve failure to satisfy the verifier's lifecycle requirements.

Complete User-Space Program: dynptr_tc.c

// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <string.h>
#include <arpa/inet.h>
#include <net/if.h>
#include <bpf/libbpf.h>
#include <bpf/bpf.h>

#include "dynptr_tc.skel.h"
#include "dynptr_tc.h"

static volatile sig_atomic_t exiting = 0;

static void sig_handler(int signo) { exiting = 1; }

static int handle_event(void *ctx, void *data, size_t data_sz)
{
    const struct event_hdr *e = data;
    char saddr[INET_ADDRSTRLEN], daddr[INET_ADDRSTRLEN];

    inet_ntop(AF_INET, &e->saddr, saddr, sizeof(saddr));
    inet_ntop(AF_INET, &e->daddr, daddr, sizeof(daddr));

    printf("if=%u %s:%u -> %s:%u len=%u drop=%u snap=%u",
           e->ifindex, saddr, e->sport, daddr, e->dport,
           e->pkt_len, e->drop, e->snap_len);

    if (e->snap_len && data_sz >= sizeof(*e) + e->snap_len) {
        printf(" payload=\"");
        for (int i = 0; i < e->snap_len; i++) {
            unsigned char c = e->payload[i];
            putchar((c >= 32 && c <= 126) ? c : '.');
        }
        printf("\"");
    }
    printf("\n");
    return 0;
}

int main(int argc, char **argv)
{
    const char *ifname = NULL;
    struct dynptr_cfg cfg = { .blocked_port = 0, .snap_len = 64, .enable_ringbuf = 1 };

    /* Parse arguments */
    for (int i = 1; i < argc; i++) {
        if (!strcmp(argv[i], "-i") && i+1 < argc) ifname = argv[++i];
        else if (!strcmp(argv[i], "-p") && i+1 < argc) cfg.blocked_port = atoi(argv[++i]);
        else if (!strcmp(argv[i], "-s") && i+1 < argc) cfg.snap_len = atoi(argv[++i]);
        else if (!strcmp(argv[i], "-n")) cfg.enable_ringbuf = 0;
    }

    if (!ifname) {
        fprintf(stderr, "Usage: %s -i <ifname> [-p port] [-s len] [-n]\n", argv[0]);
        return 1;
    }

    int ifindex = if_nametoindex(ifname);
    if (!ifindex) { perror("if_nametoindex"); return 1; }

    signal(SIGINT, sig_handler);
    signal(SIGTERM, sig_handler);

    struct dynptr_tc_bpf *skel = dynptr_tc_bpf__open_and_load();
    if (!skel) { fprintf(stderr, "Failed to load BPF\n"); return 1; }

    /* Configure */
    bpf_map_update_elem(bpf_map__fd(skel->maps.cfg_map), &(__u32){0}, &cfg, BPF_ANY);

    /* Attach to TC ingress */
    struct bpf_tc_hook hook = { .sz = sizeof(hook), .ifindex = ifindex, .attach_point = BPF_TC_INGRESS };
    struct bpf_tc_opts opts = { .sz = sizeof(opts), .handle = 1, .priority = 1,
                                .prog_fd = bpf_program__fd(skel->progs.dynptr_tc_ingress) };

    bpf_tc_hook_create(&hook);
    if (bpf_tc_attach(&hook, &opts)) { fprintf(stderr, "TC attach failed\n"); goto cleanup; }

    struct ring_buffer *rb = cfg.enable_ringbuf ?
        ring_buffer__new(bpf_map__fd(skel->maps.events), handle_event, NULL, NULL) : NULL;

    printf("Attached to %s. blocked_port=%u snap_len=%u\n", ifname, cfg.blocked_port, cfg.snap_len);

    while (!exiting) {
        if (rb) ring_buffer__poll(rb, 100);
        else usleep(100000);
    }

    ring_buffer__free(rb);
    bpf_tc_detach(&hook, &opts);
    bpf_tc_hook_destroy(&hook);
cleanup:
    dynptr_tc_bpf__destroy(skel);
    return 0;
}

Understanding the User-Space Code

The userspace program loads the BPF skeleton, configures it through the array map, and attaches to TC ingress. The ring buffer callback handle_event() receives each variable-length event and prints it.

Notice how we access the variable-length payload. The struct event_hdr has a flexible array member payload[] at the end. When an event arrives, data_sz tells us the total size, and e->snap_len tells us specifically how much payload was included. We validate both before accessing the payload bytes.

The configuration map allows runtime control over blocking behavior and snapshot length without reloading the BPF program. This demonstrates the common pattern of using maps for user-to-kernel communication.

Compilation and Execution

Navigate to the dynptr directory and build:

cd bpf-developer-tutorial/src/features/dynptr
make

This compiles the BPF program with the repository's standard toolchain, generating the skeleton header and linking against libbpf.

Creating a Test Environment

To test properly, we need a network namespace so traffic actually traverses the veth pair rather than going through loopback. The included test.sh script handles this automatically, but here's the manual setup:

# Create network namespace
sudo ip netns add test_ns

# Create veth pair with one end in the namespace
sudo ip link add veth_host type veth peer name veth_ns
sudo ip link set veth_ns netns test_ns

# Configure host side
sudo ip addr add 10.200.0.1/24 dev veth_host
sudo ip link set veth_host up

# Configure namespace side
sudo ip netns exec test_ns ip addr add 10.200.0.2/24 dev veth_ns
sudo ip netns exec test_ns ip link set veth_ns up

# Start HTTP server inside the namespace
sudo ip netns exec test_ns python3 -m http.server 8080 --bind 10.200.0.2 &

Running the Demo

Start the dynptr TC program attached to the host side of the veth:

sudo ./dynptr_tc -i veth_host -p 0 -s 32

In another terminal, make a request:

curl http://10.200.0.2:8080/

You should see output showing captured packets:

Attached to TC ingress of veth_host (ifindex=X). Ctrl-C to exit.
blocked_port=0 snap_len=32 ringbuf=1
if=X 10.200.0.2:8080 -> 10.200.0.1:XXXXX len=221 drop=0 snap=32 payload="HTTP/1.0 200 OK..Server: SimpleH"
if=X 10.200.0.2:8080 -> 10.200.0.1:XXXXX len=742 drop=0 snap=32 payload="<!DOCTYPE HTML>.<html lang="en">"

The output shows HTTP response packets from the server, with the payload field containing the beginning of the response data.

Testing the Drop Policy

Test blocking by specifying port 8080:

sudo ./dynptr_tc -i veth_host -p 8080 -s 32

In another terminal:

curl --max-time 3 http://10.200.0.2:8080/

The curl should timeout since response packets are blocked. The dynptr_tc output shows drop=1:

if=X 10.200.0.2:8080 -> 10.200.0.1:XXXXX len=74 drop=1 snap=0

Using the Test Script

For convenience, run the included test script which handles all setup automatically:

sudo ./test.sh

This creates the namespace, runs both capture and blocking tests, and cleans up afterward.

When to Use Dynptrs

Dynptrs shine in several scenarios. Variable-length events are the classic use case since ringbuf dynptrs let you allocate exactly the size you need at runtime, avoiding wasted space from oversized fixed structures or complex multi-record schemes.

Packet parsing benefits from dynptrs when dealing with non-linear skbs or complex protocol stacks where traditional bounds checking becomes unwieldy. The slice API provides a cleaner abstraction that handles both linear and paged data uniformly.

Crypto and verification operations like bpf_crypto_encrypt(), bpf_verify_pkcs7_signature(), and bpf_get_file_xattr() all use dynptrs as buffer arguments, making dynptr familiarity essential for these advanced use cases.

User ringbuf consumption through bpf_user_ringbuf_drain() delivers samples as dynptrs, enabling safe handling of userspace-provided data in BPF programs.

For simple fixed-size operations where you know bounds at compile time, traditional approaches may be simpler. But as your BPF programs grow more sophisticated, dynptrs become increasingly valuable.

Summary

BPF dynptrs provide a verifier-friendly mechanism for working with variable-length and runtime-bounded data. Rather than proving memory safety entirely through static analysis, dynptrs shift some verification to runtime checks, enabling patterns that would otherwise be impossible or extremely awkward to express.

Our example demonstrated the two primary dynptr patterns: using skb dynptrs with slices for clean packet parsing, and using ringbuf dynptrs for variable-length event output. The key takeaways are to always NULL-check slice returns, always submit or discard ringbuf dynptrs, and remember that skb dynptrs require kfuncs available from Linux v6.4.

As eBPF capabilities continue to expand, dynptrs form an increasingly important part of the toolkit. Whether you're building packet processors, security monitors, or performance tools, understanding dynptrs will help you write cleaner, more capable BPF programs.

If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

References

Dynptr Concept Documentation: https://docs.ebpf.io/linux/concepts/dynptrs/
bpf_ringbuf_reserve_dynptr Helper: https://docs.ebpf.io/linux/helper-function/bpf_ringbuf_reserve_dynptr/
bpf_dynptr_from_skb Kfunc: https://docs.ebpf.io/linux/kfuncs/bpf_dynptr_from_skb/
bpf_dynptr_slice Kfunc: https://docs.ebpf.io/linux/kfuncs/bpf_dynptr_slice/
Kernel Kfuncs Documentation: https://docs.kernel.org/bpf/kfuncs.html
Tutorial Repository: https://github.com/eunomia-bpf/bpf-developer-tutorial

This example requires Linux kernel 6.4 or newer for the skb dynptr kfuncs. The ringbuf dynptr helpers are available from Linux 5.19. Complete source code is available in the tutorial repository.

A Taxonomy of GPU Bugs: 19 Defect Classes for CUDA Verification

云微 — Tue, 10 Feb 2026 07:53:16 +0000

Introduction

GPU programming introduces a distinct class of correctness and performance challenges that differ fundamentally from traditional CPU-based systems. The SIMT (Single Instruction, Multiple Threads) execution model, hierarchical memory architecture, and massive parallelism create unique bug patterns that require specialized verification and detection techniques.

Just as eBPF enables safe, verified extension code to run inside the Linux kernel, bpftime gpu_ext (The arxiv, previous name eGPU) bring eBPF to GPUs, allowing user-defined policy code (for observability, scheduling, or resource control) to be injected into GPU drivers and kernels with static verification guarantees. Such a GPU extension framework must ensure that policy code cannot introduce crashes, hangs, data races, or unbounded overhead. A critical concern in modern GPU deployments is performance interference in multi-tenant environments: contention for shared resources makes execution time unpredictable. "Making Powerful Enemies on NVIDIA GPUs" studies how adversarial kernels can amplify slowdowns, arguing that performance interference is a system-level safety property when GPUs are shared. This motivates treating bounded overhead as a correctness property, not merely an optimization goal.

To build a sound GPU extension verifier, we must first understand what can go wrong. This taxonomy identifies the defect classes a verifier must address, drawing lessons from eBPF's success: restrict the programming model, enforce bounded execution, and verify memory safety before loading. We synthesize findings from static verifiers (GPUVerify, GKLEE, ESBMC-GPU), dynamic detectors (Compute Sanitizer, Simulee, CuSan), and empirical bug studies (Wu et al., ScoRD, iGUARD) into 19 defect classes organized along two dimensions: impact type (Safety, Correctness, Performance) and GPU specificity (GPU-specific, GPU-amplified, CPU-shared). Each entry provides concrete examples, documents detection tools, and offers actionable verification strategies.

Taxonomy Overview

Each bug class is categorized along four dimensions:

Impact Type:

Safety: Program fails to complete safely (crash, hang, isolation failure, deadlock)
Correctness: Program completes but produces wrong results
Performance: Program works correctly but inefficiently

GPU Specificity:

GPU-specific: Unique to GPU/SIMT execution model
GPU-amplified: Exists on CPUs but much more severe on GPUs
CPU-shared: Similar on both platforms

Verification Scope (for GPU extension frameworks):

E (Extension-local): Can be verified by examining only the extension/policy code, without inspecting the host kernel. This is the ideal case: like eBPF, the verifier can provide strong safety guarantees for any kernel the extension attaches to.
C (Combined): Requires joint analysis of extension + kernel, or a contract between them. These bugs arise from interactions between policy code and kernel state/behavior.
H (Host+Device/System): Involves host-side API ordering, driver state, or cross-boundary interactions that cannot be verified by device-side analysis alone.

Assurance Type (Soundness/Completeness guarantees):

By-construction: Bug class is structurally impossible due to language/feature restrictions. Soundness: perfect (the bug cannot exist). Completeness: high for policy use cases (restrictions rarely limit legitimate policies).
Static-sound: If verifier accepts, property holds; but some safe programs rejected. Soundness: strong. Completeness: low (conservative).
Contract-based: Requires declared preconditions validated at attach/launch time. Soundness: conditional on contract correctness. Completeness: depends on contract expressiveness.
Bounded-sound: Sound within specified bounds (loop unrolling, context switches). Soundness: within bounds. Completeness: limited by bound coverage.
Dynamic-only: Detected at runtime; no static guarantee. Soundness: for executed paths only. Completeness: coverage-dependent.
Runtime-enforced: Property enforced via instrumentation/interception. Soundness: if enforcement is complete. Completeness: N/A (enforcement, not verification).

Why These Dimensions Matter for GPU Extension Verifiers

A GPU extension framework (like bpftime gpu_ext) aims to provide static verification guarantees analogous to eBPF: policy code should be safe to attach to any kernel without risking crashes, hangs, or unbounded overhead. The key insight is:

Extension-local verification is the only path to strong, universal guarantees. If a bug class can be eliminated by restricting the policy language or enforcing invariants on policy code alone, the verifier can guarantee safety without inspecting (potentially closed-source) kernels.

For Combined bugs, the framework has two options: (1) restrict policy capabilities so the bug becomes Extension-local (e.g., forbid policies from writing kernel memory), or (2) require kernel-side contracts/annotations and validate at attach time.

For Host+Device bugs, device-side verification is insufficient; these require host-side tooling (CuSan, TSan) or runtime enforcement in the driver/loader.

Understanding Soundness vs. Completeness

The Assurance Type dimension makes explicit what guarantees each verification approach provides:

Soundness answers: "If the verifier accepts, does the property definitely hold?" A sound verifier never produces false negatives (misses real bugs).
Completeness answers: "If the property holds, will the verifier accept?" A complete verifier never produces false positives (rejects safe programs).

For safety-critical GPU extensions, we prioritize soundness over completeness: it's acceptable to reject some safe policies if it means we never accept unsafe ones. The table below shows not just what can be verified, but how strong the guarantee is.

#	Bug Class	Impact	GPU Spec.	Scope	Assurance Type
1	Barrier Divergence	Safety	GPU-specific	E	Static-sound (enforce uniform barrier placement)
2	Invalid Warp Sync	Safety	GPU-specific	E	By-construction (ban warp sync)
3	Insufficient Atomic/Sync Scope	Correctness	GPU-specific	C→E	Static-sound (isolate state + device-scope)
4	Warp-divergence Race	Correctness	GPU-specific	E	Static-sound (uniform side-effects)
5	Uncoalesced Memory Access	Performance	GPU-specific	E/C	Static-sound (restrict patterns)
6	Control-Flow Divergence	Performance	GPU-specific	E	Static-sound (enforce uniformity)
7	Bank Conflicts	Performance	GPU-specific	E	Static-heuristic (enforce conflict-free patterns)
8	Block-Size Dependence	Correctness	GPU-specific	E/C	Contract-based (declare requirements)
9	Launch Config Assumptions	Correctness	GPU-specific	C	Contract-based (validate at attach)
10	Missing Volatile/Fence	Correctness	GPU-specific	E	By-construction (ban spin-wait)
11	Shared-Memory Data Races	Correctness	GPU-specific	E	Static-sound (restrict writes)
12	Redundant Barriers	Performance	GPU-specific	E	Static-heuristic (detect unnecessary barriers)
13	Host ↔ Device Async Races	Correctness	GPU-specific	H	Dynamic-only (CuSan/TSan)
14	Atomic Contention	Performance	GPU-amplified	C→E	Static-sound (budgetize atomics)
15	Non-Barrier Deadlocks	Safety	GPU-amplified	E	By-construction (ban blocking)
16	Kernel Non-Termination	Safety	GPU-amplified	E	Static-sound (bound iterations)
17	Global-Memory Data Races	Correctness	CPU-shared	C→E	Static-sound (isolate state)
18	Memory Safety	Safety	CPU-shared	E	Static-sound (restrict pointers)
19	Arithmetic Errors	Correctness	CPU-shared	E	Static-sound (range analysis)

Insights from a Taxonomy of GPU Defects

We conducted a comprehensive study of GPU correctness defects by synthesizing findings from empirical bug analyses (Wu et al., iGUARD), static verifiers (GPUVerify, GKLEE, ESBMC-GPU), and runtime detectors (Compute Sanitizer, Simulee, ScoRD). Our taxonomy identifies 19 distinct classes of GPU programming defects, uncovering fundamental insights into the unique correctness challenges posed by GPU architectures:

First, we observe that control-flow uniformity is a foundational correctness requirement for GPU kernels. Non-uniform execution across threads, caused by GPU's SIMT execution model, breaks implicit synchronization assumptions and triggers GPU-specific correctness violations, such as barrier divergence, warp synchronization errors, and subtle warp-divergence races. This insight elevates uniformity from a performance concern to a correctness property that GPU verification frameworks must explicitly enforce.

Second, GPU's scoped memory synchronization semantics (e.g., block-scoped atomics, missing fences, volatile misuse) create unique correctness hazards rarely encountered on CPU platforms. Our analysis emphasizes that synchronization primitives' scopes must be explicit, conservative, and verifiable at the kernel level. This requirement is critical for correctness given GPU memory model subtleties.

Third, performance interference in GPUs, manifested as uncoalesced accesses, atomic contention, redundant barriers, and bank conflicts, must be viewed as a safety and isolation concern rather than mere inefficiency. Our taxonomy reveals how adversarial workloads exploit GPU parallelism to amplify performance issues into denial-of-service attacks in multi-tenant environments. Consequently, bounded overhead must be explicitly enforced as a correctness property in GPU extension frameworks.

Finally, our study highlights that liveness (deadlocks, infinite loops) and memory safety (out-of-bounds accesses, temporal violations) are system-level concerns uniquely amplified by GPU parallelism. Unlike traditional CPU environments, GPU kernel hangs or memory violations can trigger hardware-level recovery affecting all tenants. Thus, GPU liveness and memory safety must be explicitly recognized as first-class system-level correctness properties in verifier designs.

Together, these insights not only characterize GPU correctness issues more precisely but also inform principled design requirements for GPU kernel extensibility and verification frameworks, moving beyond traditional CPU-centric correctness towards a GPU-aware system correctness definition. We are applying these principles in bpftime, you can find more detail in arXiv.

Insights from Verification Scope and Assurance Analysis

Beyond characterizing what can go wrong, we analyze whether and how each bug class can be addressed by a GPU extension verifier. By examining each defect through the lens of verification scope (Extension-local vs. Combined vs. Host+Device) and assurance type (soundness and completeness guarantees), we arrive at several key conclusions for GPU extension framework design.

Extension-local verification is sufficient for the majority of GPU bug classes. Of the 19 defect classes identified, 14 can be fully addressed through Extension-local verification, examining only the policy code without inspecting the host kernel. Some of these (#2, #10, #15) can be eliminated by construction through language restrictions: banning warp sync primitives, spin-wait patterns, and blocking constructs makes entire bug classes structurally impossible. Others (#1, #7, #12) use static analysis to enforce safe usage patterns (uniform barrier placement, conflict-free shared-memory access, redundant barrier detection) rather than outright bans, preserving useful functionality while maintaining safety. Four additional classes (#3, #5, #14, #17) that initially appear to require Combined analysis can be reduced to Extension-local through state isolation, restricting policies to write only policy-owned objects (maps, ringbuffers) rather than kernel data structures. This finding validates the eBPF design philosophy: by appropriately restricting extension capabilities, a verifier can provide strong safety guarantees for any kernel, including closed-source ones.

Only three bug classes fundamentally resist Extension-local verification. Block-size dependence (#8) and launch configuration assumptions (#9) depend on host-determined launch parameters invisible to the policy verifier; these require a contract-based approach where policies declare preconditions validated at attach time. Host↔device async races (#13) span the host API boundary entirely outside device-side verification scope; these can only be addressed through dynamic detection tools like CuSan. Importantly, these three classes represent a small, well-defined subset that can be handled through complementary mechanisms rather than requiring full Combined verification of kernel+extension.

Soundness and completeness trade-offs are explicit and favorable for safety-critical extensions. By-construction approaches (banning genuinely dangerous features like spin-wait and blocking primitives) achieve perfect soundness with high completeness for policy use cases. Static-sound approaches (uniform barrier placement, conflict-free access pattern enforcement, uniformity analysis, bounds checking, range analysis) provide strong soundness while preserving useful functionality, at the cost of conservatively rejecting some safe programs. For safety-critical GPU extensions, this trade-off is appropriate: it is better to reject a safe policy than to accept an unsafe one. The verifier's job is to guarantee safety for any kernel, not to accept every possible safe program.

A two-track verification pipeline emerges as the principled design. The production track provides hard guarantees for any kernel through Extension-local verification at load time, contract validation at attach time, and optional runtime enforcement for multi-tenant isolation. The CI/offline track enhances coverage through Combined analysis tools (GPUVerify, ESBMC-GPU) when kernel source is available, dynamic sanitizers (Compute Sanitizer, iGUARD, Simulee) for regression testing, and host-side race detection (CuSan) for API ordering bugs. This separation acknowledges that Combined verification, while valuable for development and testing, cannot be a production requirement for systems targeting arbitrary kernels.

Performance interference can be bounded but not eliminated. While adversarial workloads can systematically amplify interference through shared GPU resources (as demonstrated by "Making Powerful Enemies on NVIDIA GPUs"), the verifier can still provide meaningful guarantees: bounding policy overhead per invocation through instruction/helper budgets, limiting atomic contention through warp-aggregation requirements, and enforcing coalesced access patterns. These guarantees bound the policy's contribution to interference, even if system-wide slowdown bounds remain impossible to guarantee statically.

In summary, the verification scope analysis reveals that the eBPF success pattern (restricting extension capabilities to what can be verified without inspecting the host) transfers effectively to GPUs. Through language restrictions, state isolation, and budgetization, a GPU extension verifier can provide strong, universal safety guarantees while relegating the few irreducibly Combined or Host+Device properties to contracts and dynamic detection.

Canonical bug list

1) Barrier Divergence at Block Barriers (`__syncthreads`) [Safety, GPU-specific]

What it is / why it matters

A block-wide barrier requires all threads in the block to reach it. If the barrier is placed under a condition that evaluates differently across threads, some threads wait forever → deadlock / kernel hang. This is treated as a first-class defect in GPU kernel verification (e.g., "barrier divergence" in GPUVerify), and is also one of the main CUDA synchronization bug types characterized/targeted by AuCS/Wu. Note that general control-flow divergence is a performance issue, but barrier divergence is the specific, critical case where divergent control flow causes threads to reach a barrier non-uniformly, turning a performance issue into a liveness/correctness failure (deadlock).

Bug example

__global__ void k(float* a) {
  if (threadIdx.x < 16) __syncthreads(); // divergent barrier => UB / deadlock
  a[threadIdx.x] = 1.0f;
}

Seen in / checked by

GPUVerify: checking divergence is a core goal ("divergence freedom").(Nathan Chong)
Simulee detects barrier divergence bugs in real-world code.(zhangyuqun.github.io)
Wu et al.: explicitly defines barrier divergence and places it under improper synchronization.(arXiv)
Tools like Compute Sanitizer synccheck report "divergent thread(s) in block"; Oclgrind can also detect barrier divergence (OpenCL).

Checking approach

Static check (GPUVerify-style): prove that each barrier is reached by all threads in the relevant scope, often via uniformity reasoning.(Nathan Chong)
Dynamic check: synccheck-style runtime validation, and Simulee-style bug finding.(zhangyuqun.github.io)

Verification strategy

Require warp-/block-uniform control flow for any path reaching a barrier (GPUVerify-style uniform predicate analysis): the verifier statically proves that every __syncthreads() is reached by all threads in the block, otherwise reject. This allows policies to use barriers for legitimate shared-memory coordination while preventing divergent barriers that cause deadlocks.

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-sound. Enforcing uniform barrier placement via static analysis prevents barrier divergence with strong soundness. Policies can use __syncthreads() when the verifier can prove all threads in the block reach the barrier uniformly.

Production guarantee: The verifier statically analyzes control flow to ensure every __syncthreads() call is reached by all threads in the block. Barriers under divergent conditions (e.g., if (threadIdx.x < 16) __syncthreads()) are rejected. This allows safe barrier usage for shared-memory coordination while preventing GPU hangs.

Offline/CI tools: For kernel-level analysis, GPUVerify proves divergence freedom via static verification; Compute Sanitizer synccheck detects divergent barriers at runtime; Simulee finds barrier divergence bugs through evolutionary simulation.

Residual gap: Some safe barrier placements under complex but provably uniform conditions may be conservatively rejected. The verifier guarantees policy cannot introduce barrier divergence, but cannot guarantee the kernel itself is free of this bug; kernel-level bugs require kernel-level tools.

2) Invalid Warp Synchronization (`__syncwarp` mask, warp-level barriers) [Safety, GPU-specific]

What it is / why it matters

Warp-level sync requires correct participation masks. A common failure is calling __syncwarp(mask) where not all lanes that reach the barrier are included in mask, or where divergence causes only a subset to arrive.

Bug example

__global__ void k(int* out) {
  int lane = threadIdx.x & 31;
  if (lane < 16) {
    __syncwarp(0xffffffff);  // only 16 lanes arrive, but mask expects all 32
  }
  out[threadIdx.x] = lane;
}

Seen in / checked by

Compute Sanitizer synccheck explicitly reports "Invalid arguments" and "Divergent thread(s) in warp" classes for these hazards.(NERSC Documentation)
iGUARD discusses how newer CUDA features (e.g., independent thread scheduling + cooperative groups) create new race/sync hazards beyond the classic model.(Aditya K Kamath)

Checking approach

Runtime validation via synccheck.
Static analysis to verify mask correctness at each __syncwarp callsite.

Verification strategy

If policies can ever emit warp-level sync or cooperative-groups barriers, require a verifiable mask discipline: e.g., only __syncwarp(0xffffffff) (full mask) or masks proven to equal the active mask at the callsite. Otherwise, simplest is: ban warp sync primitives entirely inside policies.

Verification scope analysis

Scope & Assurance: Extension-local (E), By-construction. Banning __syncwarp/CG barriers entirely (or requiring only full-mask sync at provably uniform points) makes invalid warp sync structurally impossible, providing perfect soundness with high completeness for policy use cases where warp-level sync is rarely needed.

Production guarantee: Policy code cannot introduce invalid warp synchronization because the verifier bans warp-level sync primitives. If allowed, only full-mask __syncwarp(0xffffffff) at provably uniform points is permitted.

Offline/CI tools: Compute Sanitizer synccheck reports invalid sync arguments and divergent warps at runtime; iGUARD provides NVBit-based instrumentation for detecting sync hazards from modern CUDA features.

Residual gap: iGUARD notes that ITS (Independent Thread Scheduling) and CG create new hazards that even experienced developers misuse. This justifies conservative restrictions; banning these primitives in policy code is the only sound approach without complex ITS-aware analysis.

3) Insufficient Atomic/Sync Scope [Correctness, GPU-specific]

What it is / why it matters

GPU adds scope and memory-model subtleties that don't exist on CPUs. Scoped races occur when synchronization/atomics are done at an insufficient scope (e.g., using atomicAdd_block when atomicAdd with device scope is needed). This is a distinct GPU bug class because scope semantics are unique to CUDA's memory model.

Bug example

// Scoped race: using block-scope atomic when device-scope is needed
__global__ void k(int* counter) {
  atomicAdd_block(counter, 1);  // only block-scope, may race across blocks
}

Seen in / checked by

ScoRD introduces scoped races due to insufficient scope and argues this is a distinct bug class.(CSA - IISc Bangalore)
iGUARD further targets races introduced by "scoped synchronization" and advanced CUDA features (independent thread scheduling, cooperative groups).(Aditya K Kamath)

Checking approach

Scope verification: ensure atomics/sync use sufficient scope for the access pattern.
Require explicit scope annotations and validate against access patterns.

Verification strategy

Treat scope as part of the verifier contract: if policies do atomic/synchronizing operations, require the strongest allowed scope (or forbid nontrivial scope usage). Practically: ban cross-block shared global updates unless they're done through a small set of "safe" helpers (e.g., per-SM/per-warp buffers → host aggregation). If policies use scoped atomics, require the scope to be explicit and conservative.

Verification scope analysis

Scope & Assurance: Combined → Extension-local (C→E) via state isolation, Static-sound. If policies can touch kernel-shared global objects, scope correctness depends on kernel access patterns (Combined). However, this reduces to Extension-local by restricting policies to write only policy-owned state or requiring all atomics to use device-scope by default, providing strong soundness with medium completeness (policies needing block-scope atomics must use conservative device-scope).

Production guarantee: Two design choices enable Extension-local verification: (A) Policy only writes policy-owned state (maps, ringbuffers), never kernel globals: scope becomes irrelevant; (B) All policy atomics use device-scope by default: sufficient for any access pattern. Both approaches eliminate scope bugs without kernel inspection.

Offline/CI tools: ScoRD introduces "scoped races" as a distinct bug class and provides detection (research prototype requiring hardware support); iGUARD targets races from scoped synchronization and advanced CUDA features via NVBit GPU-side runtime instrumentation.

Residual gap: If policies must write kernel-shared objects with fine-grained scope optimization, Combined analysis or contracts are required. ScoRD and iGUARD emphasize scope bugs are subtle and underdetected: defaulting to device-scope is a sound engineering choice.

4) Warp-divergence Race [Correctness, GPU-specific]

What it is / why it matters

A warp-divergence race is a GPU-specific phenomenon where divergence changes which threads are effectively concurrent, producing racy outcomes that don't map cleanly to CPU assumptions. SIMT execution order + reconvergence can create subtle concurrency patterns. This is one reason "CPU-style race reasoning" doesn't port directly to GPUs. While control-flow divergence is generally a performance issue (serialized execution paths), warp-divergence race is a correctness issue where divergence creates unexpected concurrency patterns leading to data races: same root cause, but different failure modes: perf degradation vs. racy/undefined behavior.

Bug example

__global__ void k(int* A) {
  int lane = threadIdx.x & 31;
  if (lane < 16) A[0] = 1;      // first half writes
  else           A[0] = 2;      // second half writes
  // outcome depends on SIMT execution + reconvergence
}

Seen in / checked by

GKLEE explicitly lists "warp-divergence race" among discovered bug classes.(Lingming Zhang)
Simulee stresses CUDA-aware race definitions and discusses GPU-specific race interpretation constraints (e.g., avoiding false positives due to warp lockstep).(zhangyuqun.github.io)

Checking approach

Verifier rule: treat "lane-divergent side effects" as forbidden unless proven safe.
Require that any helper with side effects is guarded by a warp-uniform predicate or executed only by a designated lane (e.g., lane0). Then the verifier only needs to prove uniformity (or single-lane execution), not full SIMT interleavings.

Verification strategy

Enforce warp-uniform control flow for policy side effects. If divergence is unavoidable, force "single-lane execution" patterns where only lane0 performs the side effect. This eliminates warp-divergence races by construction.

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-sound. Warp-divergence races arise from SIMT execution semantics, but can be prevented by structural restrictions on policy code, providing strong soundness with medium completeness (legitimately safe lane-divergent writes are rejected).

Production guarantee: The verifier enforces that all side-effecting operations are either (1) under warp-uniform predicates, or (2) executed only by lane0 (single-lane execution pattern). This eliminates warp-divergence races without analyzing the kernel. The verifier proves uniformity or single-lane execution statically.

Offline/CI tools: GKLEE explicitly lists "warp-divergence race" among discovered bug classes and explores divergent execution paths via concolic/symbolic testing; Simulee uses CUDA-aware race definitions that account for warp lockstep behavior to avoid false positives.

Residual gap: Policies with legitimately safe lane-divergent writes will be rejected. This trade-off is favorable: warp-divergence races are notoriously subtle: GKLEE found them in real SDK code: eliminating by construction is safer than complex SIMT interleaving analysis.

5) Uncoalesced / Non-Coalesceable Global Memory Access Patterns [Performance, GPU-specific]

What it is / why it matters

Warp memory coalescing is a GPU-specific performance contract. "Uncoalesced" accesses can cause large slowdowns (memory transactions split into many).

Bug example

__global__ void k(float* a, int stride) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  float x = a[tid * stride];   // stride>1 => likely uncoalesced
  a[tid * stride] = x + 1.0f;
}

Seen in / checked by

GPUDrano: "detects uncoalesced global memory accesses" and treats them as performance bugs.(GitHub, CAV17)
GKLEE: reports "non-coalesced memory accesses" as performance bugs it finds.(Lingming Zhang)
GPUCheck: detects "non-coalesceable memory accesses."(WebDocs)

Checking approach

Static analysis (GPUDrano/GPUCheck-style): analyze address expressions in terms of lane-to-address stride; flag when stride exceeds coalescing thresholds.(CAV17)

Verification strategy

If you want "performance as correctness," this is a flagship rule: restrict policy memory ops to patterns provably coalesced (e.g., affine, lane-linear indexing with small stride), and/or require warp-level aggregation so only one lane performs global updates. Require map operations to use warp-uniform keys or contiguous per-lane indices (e.g., base + lane_id), not random hashes. If policies must do random accesses, restrict them to lane0 only, amortizing the uncoalesced behavior to 1 lane/warp.

Verification scope analysis

Scope & Assurance: Extension-local (E) for policy-owned memory; Combined (C) for kernel arrays. Static-sound for policy memory: affine/lane-linear indexing guarantees coalescing with strong soundness but low completeness (random-access patterns rejected; kernel-array reads require Combined analysis).

Production guarantee: For policy-owned memory (maps, ringbuffers), restricting index expressions to affine/lane-linear forms (base + lane_id) or lane0-only access provides bounded overhead guarantees. Warp-level aggregation (only lane0 performs global updates) amortizes uncoalesced behavior to 1 lane/warp. The verifier cannot guarantee coalescing for kernel-array reads without kernel knowledge.

Offline/CI tools: GPUDrano statically detects uncoalesced global memory accesses and treats them as performance bugs; GPUCheck identifies non-coalesceable access patterns via thread-divergent expression analysis; GKLEE reports "non-coalesced memory accesses" as performance bugs via symbolic exploration.

Residual gap: True coalescing depends on hardware cache behavior and concurrent workloads: static analysis provides structural guarantees, not tight performance bounds. "Is it really slow / how slow" is architecture-dependent; static tools provide sound-ish structural warnings rather than tight performance proofs.

6) Control-Flow Divergence (warp branch divergence) [Performance, GPU-specific]

What it is / why it matters

SIMT divergence serializes paths within a warp, lowering "branch efficiency" and increasing worst-case overhead. This entry focuses on divergence as a performance issue. However, divergence is also the root cause of more severe correctness bugs: barrier divergence (deadlock when barriers are in conditional code) and warp-divergence races (unexpected concurrency patterns leading to data races).

Bug example

__global__ void k(float* out, float* in) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if ((tid & 1) == 0) out[tid] = in[tid] * 2;
  else                out[tid] = in[tid] * 3;  // divergence within warp
}

Seen in / checked by

GPUCheck explicitly targets "branch divergence" as a performance problem arising from thread-divergent expressions.(WebDocs)
GKLEE: "divergent warps" as performance bugs.(Lingming Zhang)
Wu et al.: "non-optimal implementation" includes performance loss causes like branch divergence.(arXiv)

Checking approach

Static taint + symbolic reasoning (GPUCheck-style): identify conditions dependent on thread/lane id, and prove whether divergence is possible.(WebDocs)

Verification strategy

Divergence is the core reason you can treat performance as correctness. Enforce warp-uniform control flow for policies (or at least for any code path that triggers side effects / heavy helpers). If you can't prove uniformity, force "single-lane execution" of policy side effects (others become no-ops) to prevent warp amplification. Put a hard cap on the number of helper calls on any path, to bound the "divergence amplification factor."

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-sound. Control-flow divergence is determined entirely by the policy's branch conditions and their dependence on thread IDs, providing strong soundness via taint analysis but low completeness (data-dependent branches that happen to be uniform at runtime are rejected).

Production guarantee: The verifier tracks which values depend on threadIdx/laneId (taint analysis). Branches on tainted values are either forbidden or force single-lane execution for side effects (others become no-ops). This bounds the "warp amplification factor" and prevents SIMT-amplified performance degradation.

Offline/CI tools: GPUCheck explicitly targets "branch divergence" as a performance problem via thread-divergent expression analysis; GKLEE reports "divergent warps" as performance bugs via symbolic exploration.

Residual gap: Some safe data-dependent branches will be rejected. The gpu_ext design principle lists warp-uniform control flow as a load-time verification requirement: treating divergence as a correctness property (bounded overhead), not just optimization. For kernel-level divergence analysis, use GPUCheck or GKLEE.

7) Shared-Memory Bank Conflicts [Performance, GPU-specific]

What it is / why it matters

Bank conflicts are a shared-memory–specific performance pathology: accesses serialize when multiple lanes hit the same bank.

Bug example

__global__ void k(int* out) {
  __shared__ int s[32*32];
  int lane = threadIdx.x & 31;
  // stride hits same bank pattern (illustrative)
  int x = s[lane * 32];
  out[threadIdx.x] = x;
}

Seen in / checked by

GKLEE explicitly lists "memory bank conflicts" among detected performance bugs.(Peng Li's Homepage)

Checking approach

Static heuristic: classify shared-memory index expressions by lane stride and bank mapping; warn if likely conflict.

Verification strategy

If policies use shared scratchpads (e.g., per-block staging), enforce a conflict-free access pattern (e.g., contiguous per-lane indexing such as base + threadIdx.x). A static heuristic can classify shared-memory index expressions by lane stride and bank mapping, rejecting or warning on patterns likely to cause conflicts. Shared memory should not be banned entirely for this performance issue—it remains useful for legitimate policy scratchpads.

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-heuristic. Enforcing conflict-free access patterns on shared memory eliminates most bank conflicts while still allowing policies to use shared scratchpads for legitimate purposes.

Production guarantee: Policies using shared memory are restricted to conflict-free index patterns (base + threadIdx.x for contiguous access). The verifier statically checks shared-memory index expressions and rejects patterns with likely bank conflicts (e.g., stride-32 access). This preserves shared memory availability for per-block staging and aggregation.

Offline/CI tools: GKLEE explicitly lists "memory bank conflicts" among detected performance bugs via symbolic exploration.

Residual gap: Some safe but complex index patterns may be conservatively rejected. Kernel-level bank conflict analysis requires GPUDrano-style static tools or profiling. Policies needing non-trivial shared-memory access patterns may need to demonstrate conflict-freedom through annotations or simplified indexing.

8) Block-Size Dependence [Correctness, GPU-specific]

What it is / why it matters

Block-size independence is essential for safe block-size tuning. Kernels that implicitly depend on specific blockDim values can produce incorrect results or races when launched with different configurations. This is critical for auto-tuning and portability across GPU generations. This entry focuses on compile-time hardcoded assumptions within the kernel code itself (e.g., fixed shared memory sizes, hardcoded reduction strides), distinct from runtime launch configuration assumptions about grid dimensions.

Bug example

__global__ void reduce(float* out, float* in) {
  __shared__ float s[256];
  int tid = threadIdx.x;
  s[tid] = in[blockIdx.x * blockDim.x + tid];
  __syncthreads();
  // Hardcoded reduction assumes exactly 256 threads
  if (tid < 128) s[tid] += s[tid + 128];  // OOB read if blockDim.x < 256
  __syncthreads();                         // incomplete reduction if blockDim.x > 256
  if (tid < 64) s[tid] += s[tid + 64];
  // ... continues with warp-level reduction ...
  if (tid == 0) out[blockIdx.x] = s[0];
}
// Launched with blockDim.x != 256 => wrong results or crash

Seen in / checked by

GPUDrano explicitly includes "block-size independence" analysis.(GitHub)

Checking approach

Static analysis (GPUDrano): analyze kernel code for implicit blockDim dependencies.
Require explicit declaration of block-size assumptions in kernel metadata.

Verification strategy

Policies should not implicitly assume block shapes unless the verifier can guarantee them. If a policy depends on block-level structure, require declaring it (metadata) and validate at attach time. Add verifier rules that forbid hard-coded assumptions about blockDim unless explicitly declared.

Verification scope analysis

Scope & Assurance: Extension-local (E) if block-agnostic; Combined (C) if assumes blockDim. Contract-based for blockDim-dependent policies: conditional soundness (sound if declared requirements match actual launch config) with high completeness (policies can declare requirements; undeclared policies assumed block-agnostic).

Production guarantee: Two approaches enable verification: (A) Block-agnostic design: policies use only lane-local or warp-level logic, avoiding blockDim dependencies entirely, making them safe for any launch config; (B) Contract-based: policies declare block-size requirements in metadata, and the runtime validates at attach time. The verifier rejects policies with hardcoded block-size constants unless explicitly declared.

Offline/CI tools: GPUDrano explicitly includes "block-size independence" analysis for detecting implicit blockDim dependencies in kernel code.

Residual gap: Policies with undeclared blockDim dependencies may fail silently with different launch configs. The contract approach shifts responsibility to policy authors to declare requirements correctly. Recommended design: make policy APIs block-agnostic (use relative indices, not absolute sizes).

9) Launch Config Assumptions [Correctness, GPU-specific]

What it is / why it matters

Many CUDA kernels assume certain launch configurations (e.g., single block, specific grid dimensions). Violating these assumptions leads to incorrect results or races that are hard to diagnose. This entry focuses on runtime launch configuration assumptions (gridDim, number of blocks), distinct from compile-time hardcoded block-size dependencies within the kernel code.

Bug example

__global__ void reduce(float* out, float* in, int n) {
  __shared__ float s[256];
  int tid = threadIdx.x;
  int i = blockIdx.x * blockDim.x + tid;
  s[tid] = (i < n) ? in[i] : 0.0f;
  __syncthreads();
  for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
    if (tid < stride) s[tid] += s[tid + stride];
    __syncthreads();
  }
  if (tid == 0) {
    *out = s[0];  // BUG: assumes gridDim.x == 1, writes final result directly
  }              // if gridDim.x > 1, multiple blocks race on *out
}
// Called with <<<N/256, 256>>> where N > 256 => data race, wrong result

Seen in / checked by

Wu et al.'s discussion of detected bugs includes developer responses that kernels "should not be called with more than one block" and suggests adding assertions like assert(gridDim.x == 1).(arXiv)

Checking approach

Contract checking: encode launch preconditions (gridDim, blockDim assumptions) and enforce them at runtime or statically.
Add runtime assertions for grid/block dimension assumptions.

Verification strategy

If policy code assumes a particular block/warp mapping (e.g., keys use threadIdx.x directly), you can end up with correctness or performance regressions when kernels run under different launch configs. If a policy depends on warp- or block-level structure, require declaring it (metadata) and validate at attach time.

Verification scope analysis

Scope & Assurance: Combined (C): launch configuration is host-determined, not visible to policy verifier. Contract-based assurance: conditional soundness (sound only if contracts are correctly specified and validated) with completeness depending on contract expressiveness.

Production guarantee: This bug class fundamentally requires contracts: Extension-local verification cannot see launch parameters. The policy declares preconditions (e.g., "requires gridDim.x == 1" or "requires blockDim.x >= 128"), and the runtime validates at attach/launch time. Policies without explicit requirements are assumed to work with any config.

Offline/CI tools: Wu et al.'s empirical study found real bugs where developers noted kernels "should not be called with more than one block": they suggest adding runtime assertions like assert(gridDim.x == 1). Convert such requirements into contract metadata for policy verification.

Residual gap: Contract-based verification shifts responsibility to policy authors to declare requirements correctly. This is one of the few bug classes where Combined verification is unavoidable, but contracts provide a clean interface without requiring complex joint analysis of kernel + policy.

10) Missing Volatile/Fence [Correctness, GPU-specific]

What it is / why it matters

GPU code often relies on compiler and memory-model subtleties. GKLEE reports a real-world category: forgetting to mark a shared memory variable as volatile, producing stale reads/writes due to compiler optimization or caching behavior. This is a GPU-flavored instance of memory visibility/ordering bugs that can be hard to reproduce.(Lingming Zhang)

Bug example

__shared__ int flag;          // should sometimes be volatile / properly fenced
if (tid == 0) flag = 1;
__syncthreads();
while (flag == 0) { }         // may spin if compiler hoists load / visibility issues

Seen in / checked by

GKLEE explicitly lists "forgot volatile" as a discovered bug type.(Lingming Zhang)
Simulee and other tools' race detection can surface some of these issues when they manifest as data races.(zhangyuqun.github.io)

Checking approach

Symbolic exploration (GKLEE-style): explore memory access orderings and detect stale read scenarios.(Lingming Zhang)
Pattern-based linting: flag spin-wait loops on shared memory without volatile or fence.

Verification strategy

Avoid exposing raw shared/global memory communication to policies; instead provide helpers with explicit semantics (e.g., "atomic increment" or "write once" patterns), and verify policies don't implement ad-hoc synchronization loops. Forbid spin-waiting on shared memory in policy code.

Verification scope analysis

Scope & Assurance: Extension-local (E), By-construction. Banning spin-wait loops and raw shared/global memory communication eliminates volatile/fence bugs entirely, providing perfect soundness with high completeness (legitimate polling patterns are rare in policy code).

Production guarantee: The verifier bans spin-wait loops (while(flag == 0)), flag polling patterns, and raw shared/global memory communication. All inter-thread communication must go through atomic helpers with explicit semantics (e.g., "atomic increment" or "write once" patterns). This eliminates volatile/fence bugs by forbidding the patterns that cause them.

Offline/CI tools: GKLEE explicitly lists "forgot volatile" as a discovered bug type via symbolic exploration. Simulee and other race detectors can surface these issues when they manifest as data races.

Residual gap: ITS (Independent Thread Scheduling) changes assumptions about warp-lockstep execution, making traditional volatile assumptions unreliable: code that worked on pre-Volta architectures may race on newer GPUs. The safest approach is to ban ad-hoc synchronization entirely rather than trying to verify memory model subtleties.

11) Shared-Memory Data Races (`shared`) [Correctness, GPU-specific]

What it is / why it matters

Threads in a block access on-chip shared memory concurrently; missing/incorrect synchronization causes races. This is a classic CUDA bug class (AuCS/Wu).

Bug example

__global__ void k(int* g) {
  __shared__ int s;
  int t = threadIdx.x;
  if (t == 0) s = 1;
  if (t == 1) s = 2;   // write-write race on s
  __syncthreads();
  g[t] = s;
}

Seen in / checked by

GPUVerify explicitly targets data-race freedom and defines intra-group / inter-group races.(Nathan Chong)
GKLEE reports finding races (and related deadlocks) via symbolic exploration.(Lingming Zhang)
Simulee detects data race bugs in real projects and uses a CUDA-aware notion of race.(zhangyuqun.github.io)
Wu et al. classify data race under "improper synchronization" as a CUDA-specific root cause.(arXiv)
Compute Sanitizer racecheck is a runtime shared-memory hazard detector.(Shinhwei)

Checking approach

Static verifier route (GPUVerify-style): enforce "race-free under SIMT" by proving that any two potentially concurrent lanes/threads cannot perform conflicting accesses without proper synchronization.(Nathan Chong)
Dynamic route (Simulee-style): instrument / simulate memory accesses and flag conflicting pairs; good for bug-finding and regression tests.(zhangyuqun.github.io)

Verification strategy

If policies have any shared state, require warp-uniform side effects or single-lane side effects (e.g., lane0 updates) plus explicit atomics. A conservative verifier rule is: policy code cannot write shared memory except via restricted helpers that are race-safe (e.g., per-warp aggregation).

Option A – warp-/block-uniform single-writer rules (e.g., "only lane 0 updates").
Option B – atomic-only helpers for shared objects.
Option C – per-thread/per-warp sharding (each lane updates its own slot).

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-sound. Shared-memory races depend only on the policy's access patterns and synchronization, providing strong soundness via structural restrictions (per-lane sharding or lane0-only writes eliminate races by construction) with medium completeness (complex shared-memory algorithms rejected).

Production guarantee: Three options, all Extension-local: (A) Ban shared-memory writes entirely; (B) Require per-lane sharding: each lane writes its own slot, no conflicts possible; (C) Require lane0-only writes with atomic helpers. All three approaches make races impossible by construction without requiring complex GPUVerify-style interleaving proofs.

Offline/CI tools: GPUVerify explicitly targets data-race freedom as a core verification goal and defines intra-group/inter-group races; ESBMC-GPU checks data races via bounded model checking; Compute Sanitizer racecheck is a runtime shared-memory hazard detector; Simulee detects data race bugs using CUDA-aware race definitions; Wu et al. classify data race under "improper synchronization" as a CUDA-specific root cause.

Residual gap: GPUVerify-style proofs are possible but complex for arbitrary code; structural restrictions are simpler and equally sound for policy use cases. Policies needing complex shared-memory algorithms should use ringbuffers instead, avoiding shared memory entirely.

12) Redundant Barriers (unnecessary `__syncthreads`) [Performance, GPU-specific]

What it is / why it matters

A redundant barrier is a performance-pathology class: removing the barrier does not introduce a race, so the barrier was unnecessary overhead.

Bug example

__global__ void k(int* out) {
  __shared__ int s[256];
  int t = threadIdx.x;
  s[t] = t;             // no cross-thread dependence here
  __syncthreads();      // redundant
  out[t] = s[t];
}

Seen in / checked by

Wu et al.: defines "redundant barrier function."(arXiv)
Simulee: detects redundant barrier bugs and reports numbers across projects.(zhangyuqun.github.io)
AuCS: repairs synchronization bugs, including redundant barriers.(Shinhwei)
GPURepair tooling also exists to insert/remove barriers to fix races and remove unnecessary ones.(GitHub)

Checking approach

Static/dynamic dependence analysis: determine whether any read-after-write / write-after-read across threads is protected by the barrier; if not, barrier is removable (Simulee/AuCS angle).(zhangyuqun.github.io)

Verification strategy

Since barriers are allowed in policy code (with uniform placement enforced by #1), redundant barriers become a performance concern. Use static dependence analysis to detect barriers where no cross-thread data dependence exists between the preceding writes and subsequent reads. The verifier can warn about or reject redundant barriers to enforce bounded overhead as a correctness property, ensuring policies do not introduce unnecessary synchronization cost.

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-heuristic. Static dependence analysis can identify barriers that protect no cross-thread memory dependence, flagging them as redundant. This provides good detection coverage for common patterns.

Production guarantee: The verifier performs dependence analysis on barrier sites: if no read-after-write or write-after-read across threads is protected by a barrier, the barrier is flagged as redundant and rejected. Combined with the policy overhead budget, this ensures barriers are only used when structurally necessary for shared-memory coordination.

Offline/CI tools: Simulee detects redundant barriers through evolutionary simulation; Wu et al. define "redundant barrier function" as a key synchronization bug type; GPURepair uses GPUVerify as an oracle to repair data races/barrier divergence and can remove unnecessary barriers.

Residual gap: Some barriers may appear redundant in isolation but are necessary for correctness under specific scheduling scenarios. Conservative analysis may retain some unnecessary barriers; profiling tools can identify remaining optimization opportunities at the kernel level.

13) Host-Device Asynchronous Data Races (API ordering bugs) [Correctness, GPU-specific]

What it is / why it matters

CUDA exposes async kernel launches/memcpy/events; host code can race with device work if synchronization is missing. This is a major real-world bug source in heterogeneous programs and is not covered by pure kernel-only verifiers.

Bug example

int* d_data;
cudaMalloc(&d_data, N * sizeof(int));
kernel<<<grid, block>>>(d_data);
// missing cudaDeviceSynchronize() here
int* h_data = (int*)malloc(N * sizeof(int));
cudaMemcpy(h_data, d_data, N * sizeof(int), cudaMemcpyDeviceToHost);  // race with kernel

Seen in / checked by

CuSan is an open-source detector for "data races between (asynchronous) CUDA calls and the host," using Clang/LLVM instrumentation plus ThreadSanitizer.(GitHub)

Checking approach

Dynamic detection (CuSan-style): instrument host-side CUDA API calls and detect ordering violations at runtime.

Verification strategy

If policies interact with host-visible buffers or involve asynchronous map copies, define a strict lifetime & ordering contract (e.g., "policy writes are only consumed after a guaranteed sync point"). For testing, integrate CuSan into CI for host-side integration tests of the runtime/loader.

Verification scope analysis

Scope & Assurance: Host+Device/System (H), Dynamic-only. These races involve host-side API calls (cudaMemcpy, kernel launch, synchronization) interacting with device execution: the policy verifier provides no soundness guarantees for this bug class (host API ordering is out of scope); completeness is N/A as this is fundamentally a host-side problem.

Production guarantee: The policy verifier cannot provide guarantees for this bug class. It can only ensure policy code doesn't introduce additional async semantics (e.g., policy writes are only visible after guaranteed sync points). Define strict lifetime & ordering contracts for policy-accessible buffers.

Offline/CI tools: CuSan is the primary tool: an open-source detector for "data races between (asynchronous) CUDA calls and the host," using Clang/LLVM instrumentation plus ThreadSanitizer. Integrate CuSan into CI for host-side integration tests of the runtime/loader.

Residual gap: Dynamic detection depends on test coverage: executed paths only. For production, implement runtime checks in the loader/driver for obvious violations (e.g., policy accessing freed memory, missing sync before host read). This is the H-track core tool requirement.

14) Atomic Contention [Performance, GPU-amplified]

What it is / why it matters

Heavy atomic contention is a classic "performance bug that behaves like a DoS" under massive parallelism. Even when correctness is preserved, contention on a single address can cause extreme slowdowns (orders of magnitude). With millions of threads, a single hot atomic can serialize execution and cause tail latency explosion.

Bug example

__global__ void k(int* counter) {
  // All threads atomically increment the same location => extreme contention
  atomicAdd(counter, 1);
}
// Called with <<<1000, 1024>>> => 1M threads contending on one address

Seen in / checked by

GPUAtomicContention: an open-source benchmark suite (2025) explicitly measuring atomic performance under contention and across different memory scopes (block/device/system) and access patterns.(GitHub)

Checking approach

Budget-based verification: limit atomic frequency per warp/block.
Benchmarking: use atomic contention benchmarks to calibrate safe budgets.
Static analysis: identify hot atomic targets and warn about contention risk.

Verification strategy

Treat "atomic frequency + contention risk" as a verifier-enforced budget: e.g., allow at most one global atomic per warp, or require warp-aggregated updates. For evaluation, you can reuse the open benchmark suite to calibrate "safe budgets" per GPU generation. Consider requiring warp-level reduction before global atomics to reduce contention by 32x.

Verification scope analysis

Scope & Assurance: Combined → Extension-local (C→E) via budgetization, Static-sound. Contention severity depends on both policy behavior (atomic frequency) and kernel behavior (concurrent atomics to same address), but this reduces to Extension-local by treating atomics as a budget, providing strong soundness for policy's contribution with medium completeness (high-throughput atomic patterns hit budget limits).

Production guarantee: The verifier treats "atomic frequency + contention risk" as a budget: (1) limit to N global atomics per warp per invocation; (2) require warp-aggregation (one atomic per warp instead of per-lane) for 32x contention reduction by construction; (3) forbid unbounded atomic loops. The budget provides bounded-overhead guarantees for policy's contribution regardless of kernel behavior.

Offline/CI tools: GPUAtomicContention is an open-source benchmark suite (2025) explicitly measuring atomic performance under contention across different memory scopes (block/device/system) and access patterns: use it to calibrate "safe budgets" per GPU generation.

Residual gap: Total system contention depends on concurrent workloads: the verifier bounds policy's contribution, not system-wide slowdown. "Making Powerful Enemies on NVIDIA GPUs" demonstrates adversarial kernels can systematically amplify interference through shared resource contention, making tight system-wide bounds impossible to guarantee statically.

15) Non-Barrier Deadlocks [Safety, GPU-amplified]

What it is / why it matters

Besides barrier divergence (which is specifically about __syncthreads under divergent control flow), SIMT lockstep can create deadlocks in other patterns that are unusual on CPUs: spin-waiting, lock contention within a warp, and named-barrier misuse. Warp-specialized kernels often use named barriers or structured synchronization patterns between warps/roles (producer/consumer). Bugs include: (a) spin deadlock due to missing signals, (b) unsafe barrier reuse ("recycling") across iterations, (c) races between producers/consumers.

Bug example (spin deadlock)

__global__ void k(int* flag, int* data) {
  // Block 0 expects Block 1 to set flag, but no global sync exists
  if (blockIdx.x == 0) while (atomicAdd(flag, 0) == 0) { }  // may spin forever
  if (blockIdx.x == 1) { data[0] = 42; /* forgot to set flag */ }
}

Bug example (named-barrier misuse, sketch)

// Producer writes buffer then signals barrier B
// Consumer waits on B then reads buffer
// Bug: consumer waits on wrong barrier instance / reused incorrectly in loop

Seen in / checked by

iGUARD notes that lockstep execution can deadlock if threads within a warp use distinct locks.(Aditya K Kamath)
GKLEE reports finding deadlocks via symbolic exploration of GPU kernels.(Lingming Zhang)
ESBMC-GPU models and checks deadlock too.(GitHub)
WEFT verifies deadlock freedom, safe barrier recycling, and race freedom for producer-consumer synchronization (named barriers).(zhangyuqun.github.io)

Checking approach

Protocol verification (WEFT-style): for specific synchronization patterns, prove deadlock freedom + race freedom + safe reuse. Model barrier instances across loop iterations and prove safe reuse.(zhangyuqun.github.io)
Symbolic exploration (GKLEE-style): explore possible interleavings and detect deadlock states.(Lingming Zhang)

Verification strategy

Ban blocking primitives in policy code (locks, spin loops, waiting on global conditions). Add a verifier rule: no unbounded loops / no "wait until" patterns. If you absolutely need synchronization, force "single-lane, nonblocking" patterns and bounded retries. Policies must not interact with named barriers (no waits, no signals). This aligns with the availability story: policies must not create device stalls.

Verification scope analysis

Scope & Assurance: Extension-local (E), By-construction. Deadlock patterns (spin-wait, lock contention, named-barrier misuse) are structural properties of policy code; banning blocking primitives makes deadlocks structurally impossible with perfect soundness and high completeness (blocking patterns are rarely needed in policy code).

Production guarantee: The verifier bans: (1) while(condition) loops that could spin indefinitely; (2) lock primitives and mutex-like patterns; (3) named-barrier operations (waits, signals); (4) waiting on global conditions; (5) any construct that could block warp/block execution. If synchronization is needed, force "single-lane, nonblocking" patterns with bounded retries.

Offline/CI tools: ESBMC-GPU models and checks deadlock via bounded model checking; WEFT verifies deadlock freedom, safe barrier recycling, and race freedom for producer-consumer synchronization with named barriers; GKLEE reports finding deadlocks via symbolic exploration. iGUARD notes that lockstep execution can deadlock if threads within a warp use distinct locks.

Residual gap: Policies with legitimate bounded-retry patterns must be structured with explicit iteration counts to prove termination. iGUARD notes that ITS breaks warp-lockstep assumptions: threads in the same warp can now deadlock on locks if they take different branches. Banning blocking primitives is the only sound approach without complex ITS-aware analysis.

16) Kernel Non-Termination / Infinite Loops [Safety, GPU-amplified]

What it is / why it matters

Infinite loops can hang GPU execution. In practice, non-termination is especially dangerous because GPU preemption/recovery can be coarse.

Bug example

__global__ void k(int* flag) {
  while (*flag == 0) { }  // infinite loop if flag never set
  // or: while (true) { /* missing break */ }
}

Seen in / checked by

CL-Vis explicitly calls out infinite loops (together with barrier divergence) as GPU-specific bug types to detect/handle.(Computing and Informatics)

Checking approach

Static bounds analysis: prove loop termination or enforce compile-time bounded loops.
Runtime watchdog: timeout-based detection (coarse but practical).

Verification strategy

This is where "bounded overhead = correctness" is easiest to justify: enforce a strict instruction/iteration bound for policy code (like eBPF on CPU). If policies may contain loops, require compile-time bounded loops only, with conservative upper bounds.

Verification scope analysis

Scope & Assurance: Extension-local (E) for policy; kernel non-termination is out of scope. Static-sound, where bounded loops or instruction budget guarantees policy termination with strong soundness but low completeness (data-dependent loop bounds rejected even if always terminating).

Production guarantee: The eBPF approach works: (1) all loops must have compile-time bounded iteration counts; OR (2) ban loops entirely; OR (3) enforce a total instruction budget. The verifier proves termination by construction without analyzing the kernel. Policies may contain loops only if bounds can be statically determined.

Offline/CI tools: ESBMC-GPU can find non-termination paths within context bounds; CL-Vis explicitly calls out infinite loops (together with barrier divergence) as GPU-specific bug types to detect; runtime watchdogs provide coarse timeout-based detection (engineering stopgap, not completeness).

Residual gap: The verifier guarantees policy termination, not kernel termination. If the kernel itself has infinite loops, the policy verifier cannot and should not try to detect this; that's a kernel bug requiring kernel-level tools. This is "bounded overhead = correctness" at its most justified.

17) Global-Memory Data Races [Correctness, CPU-shared]

What it is / why it matters

Races on global memory are a fundamental correctness issue. Unlike shared memory (block-local), global memory is accessible by all threads across all blocks, making races harder to reason about. Many GPU race detectors historically focused on shared memory and ignored global-memory races.

Bug example

__global__ void k(int* g, int n) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  // Multiple threads may write to same location without sync
  if (tid < n) g[tid % 16] += 1;  // race if multiple threads hit same index
}

Seen in / checked by

ScoRD explicitly argues that many GPU race detectors focus on shared memory and ignore global-memory races.(CSA - IISc Bangalore)
iGUARD targets races in global memory introduced by advanced CUDA features.(Aditya K Kamath)
GKLEE reports global memory races via symbolic exploration.(Lingming Zhang)

Checking approach

Static verification: extend race-freedom proofs to global memory accesses.
Dynamic detection: instrument global memory accesses and track conflicting pairs.

Verification strategy

If policies can write to global memory (maps, counters, logs), require either: (1) warp-uniform single-writer rules, (2) atomic-only helpers, or (3) per-thread/per-warp sharding. Ban unprotected global writes from policies.

Verification scope analysis

Scope & Assurance: Combined → Extension-local (C→E) via state isolation, Static-sound. If policies can write arbitrary kernel global memory, race analysis requires knowing kernel access patterns (Combined). However, restricting policies to write only policy-owned objects reduces this to Extension-local, providing strong soundness with isolation, low completeness for kernel-modifying policies (direct kernel writes require Combined analysis).

Production guarantee: Restricting policies to write only policy-owned objects (maps, ringbuffers) enables Extension-local verification: (1) policy-owned objects use known-safe access patterns (atomics, per-warp sharding); (2) the verifier guarantees race-freedom for policy state without inspecting the kernel; (3) ban unprotected global writes from policies. Three safe patterns: warp-uniform single-writer rules, atomic-only helpers, or per-thread/per-warp sharding.

Offline/CI tools: ScoRD explicitly argues that many GPU race detectors focus on shared memory and ignore global-memory races, and provides detection with scope awareness; iGUARD targets races in global memory introduced by advanced CUDA features via NVBit instrumentation; GKLEE reports global memory races via symbolic exploration. Note: Compute Sanitizer racecheck is primarily a shared-memory hazard detector; do not expect it to fully cover global races.

Residual gap: Policies needing to modify kernel data structures directly cannot be verified locally; this capability should be restricted or require explicit kernel-side contracts. ScoRD/iGUARD emphasize global-memory races are underdetected by existing tools; state isolation sidesteps this entirely for policy code.

18) Memory Safety (Out-of-Bounds / Misaligned / Use-After-Free / Use-After-Scope / Uninitialized) [Safety, CPU-shared]

What it is / why it matters

Classic memory safety includes both spatial (OOB, misaligned) and temporal (UAF, UAS) violations. Temporal bugs exist on GPUs too: pointers can outlive allocations (host frees while kernel still uses, device-side stack frame returns, etc.).

Bug example (OOB)

__global__ void k(float* a, int n) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  a[tid + 1024] = 0.0f;   // OOB write
}

Bug example (Use-After-Scope)

__device__ int* bad() {
  int local[8];
  return local;          // returns pointer to dead stack frame (UAS)
}
__global__ void k() {
  int* p = bad();
  int x = p[0];          // UAS read
}

Seen in / checked by

Compute Sanitizer memcheck precisely detects OOB/misaligned accesses (and can detect memory leaks).(NVIDIA Docs)
Oclgrind reports invalid memory accesses in its simulator.(GitHub)
ESBMC-GPU checks pointer safety and array bounds as part of its model checking.(GitHub)
GKLEE's evaluation includes out-of-bounds global memory accesses as error cases.(Lingming Zhang)
Wu et al.: "unauthorized memory access" appears in root-cause characterization.(arXiv)
cuCatch explicitly targets temporal violations using tagging mechanisms and discusses UAF/UAS detection.(d1qx31qr3h6wln.cloudfront.net)
Guardian: PTX-level instrumentation + interception to fence illegal memory accesses under GPU sharing.(arXiv)

Checking approach

Bounds-check instrumentation (Guardian/cuCatch-style): insert base+bounds checks (or partition-fencing) around loads/stores.(arXiv)
Temporal tagging + runtime checks (cuCatch-style): tag allocations and validate before deref.(d1qx31qr3h6wln.cloudfront.net)
Static verification (ESBMC-GPU): model checking for pointer safety and array bounds.(GitHub)
PTX-level instrumentation (Guardian-style): insert bounds checks and interception to fence illegal accesses.(arXiv)
Tagging mechanisms (cuCatch-style): track allocation ownership and validate access rights.(d1qx31qr3h6wln.cloudfront.net)

Verification strategy

This is the "classic verifier" portion: keep eBPF-like pointer tracking, bounds checks, and restricted helpers. Easiest for policies is to ban arbitrary pointer dereferences and force all memory access through safe helpers (maps/ringbuffers). Ideally: policies cannot allocate/free; all policy-visible objects are managed by the extension runtime and remain valid across policy execution (no UAF/UAS by construction). Also add a testing story: run policy-enabled kernels under Compute Sanitizer memcheck in CI for regression.

Verification scope analysis

Scope & Assurance: Extension-local (E) for policy memory. Static-sound for spatial safety (helper-only access with tracked bounds); By-construction for temporal safety (runtime-managed objects, no policy malloc/free). Strong soundness with low completeness (raw pointer arithmetic rejected).

Production guarantee: The eBPF approach: (1) ban arbitrary pointer dereferencing; (2) all memory access through verified helpers (map lookup, ringbuffer write); (3) verifier tracks pointer provenance and bounds; (4) policy-visible objects are runtime-managed (no policy malloc/free): UAF/UAS impossible by construction because objects remain valid for the policy's lifetime. This provides strong memory safety for policy code without analyzing the kernel.

Offline/CI tools: Compute Sanitizer memcheck precisely detects OOB/misaligned accesses and memory leaks; cuCatch explicitly targets temporal violations using tagged base&bounds mechanisms and discusses UAF/UAS detection (some deterministic, some probabilistic); ESBMC-GPU checks pointer safety and array bounds via bounded model checking; GKLEE's evaluation includes out-of-bounds global memory accesses as error cases; Wu et al. characterize "unauthorized memory access" in their root-cause analysis; Guardian provides PTX-level instrumentation + interception for multi-tenant memory isolation.

Residual gap: Policy memory safety doesn't protect against kernel bugs. For multi-tenant fault isolation in spatial sharing (streams/MPS), Guardian-style PTX instrumentation or hardware isolation is needed to prevent one tenant's OOB from crashing others: policy verification alone is insufficient for system-wide isolation.

Multi-tenant implications

In spatial sharing (streams/MPS), kernels share a GPU address space. An OOB access by one application can crash other co-running applications (fault isolation issue). Guardian's motivation explicitly calls out this problem and designs PTX-level fencing + interception as a fix.(arXiv) This directly supports the "availability is correctness" story: if policies run in privileged/shared contexts, you must prevent policy code from generating OOB accesses. Either: (a) only allow map helpers (no raw memory), or (b) instrument policy memory ops with bounds checks (Guardian-style PTX rewriting).

Bug example (multi-tenant OOB, conceptual)

// Tenant A kernel writes OOB and corrupts Tenant B memory in same context.

Bug example (Uninitialized Memory)

__global__ void k(float* out, float* in, int n) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  // 'in' was cudaMalloc'd but never initialized or memset
  out[tid] = in[tid] * 2.0f;  // reading uninitialized memory
}

Uninitialized Memory: additional notes

Accessing device global memory without initialization leads to nondeterministic behavior. This is a frequent source of heisenbugs because GPU concurrency amplifies nondeterminism. Compute Sanitizer initcheck reports cases where device global memory is accessed without being initialized.(NVIDIA Docs) For policies, require explicit initialization semantics (e.g., map lookup returns "not found" unless initialized; forbid reading uninitialized slots).

19) Arithmetic Errors (overflow, division by zero) [Correctness/Safety, CPU-shared]

What it is / why it matters

Arithmetic errors can corrupt keys/indices and cascade into memory safety/perf disasters.

Bug example

__global__ void k(int* out, int* in, int divisor) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  out[tid] = in[tid] / divisor;  // div-by-zero if divisor == 0

  int idx = tid * 1000000;       // overflow for large tid
  out[idx] = 1;                  // corrupted index => OOB
}

Seen in / checked by

ESBMC-GPU explicitly lists arithmetic overflow and division-by-zero among the properties it checks for CUDA programs (alongside races/deadlocks/bounds).(GitHub)

Checking approach

Model checking (ESBMC-GPU): static verification of arithmetic properties.
Lightweight runtime checks: guard div/mod operations.

Verification strategy

Optional but reviewer-friendly: add lightweight verifier checks for div-by-zero and dangerous shifts, and constrain pointer arithmetic (already typical in eBPF verifiers). For "perf correctness," overflow in index computations is a common hidden cause of random/uncoalesced patterns.

Verification scope analysis

Scope & Assurance: Extension-local (E), Static-sound. Arithmetic errors depend only on the policy's operations and input value ranges, providing strong soundness via range analysis with medium completeness (complex arithmetic may require explicit assertions).

Production guarantee: The verifier performs lightweight static checks: (1) division: require static proof that divisor ≠ 0, or insert runtime guards; (2) overflow: use saturating arithmetic, or prove bounds on operands; (3) dangerous shifts: validate shift amounts; (4) index arithmetic: track value ranges to catch OOB before memory access. This is already typical in eBPF verifiers and adds minimal overhead to policy verification.

Offline/CI tools: ESBMC-GPU explicitly lists arithmetic overflow and division-by-zero among the properties it checks for CUDA programs (alongside races/deadlocks/bounds) via bounded model checking.

Residual gap: Policies with complex arithmetic that happens to be safe may need explicit assertions or be conservatively rejected. Cascade risk: arithmetic errors often cascade into memory safety bugs (corrupted indices → OOB) or performance bugs (overflow in index computations causing random/uncoalesced patterns). The verifier should track value ranges through index computations proactively to catch these before they become downstream violations.

Summary: Improper Synchronization as a Root-Cause Category (Wu et al.'s Three-Way Taxonomy)

Wu et al.'s empirical study explicitly groups CUDA-specific synchronization issues into three concrete bug types: data race, barrier divergence, and redundant barrier functions. They also highlight that these often manifest as inferior performance and flaky tests. Simulee is used to find these categories in real projects.(arXiv)

This is exactly the "verification story" hook: a GPU extension verifier can claim that policy code cannot introduce these synchronization root causes because:

barriers are only allowed at provably uniform control flow points,
warp-uniform side effects enforced,
bounded helper calls,
and a restricted memory model for policies.

Summary: Verification Scope and Assurance Types

The verification scope and assurance type dimensions reveal crucial insights for GPU extension framework design.

By Verification Scope

Extension-local (E): 14 of 19 classes:
Bugs #1, #2, #4, #6, #7, #10, #11, #12, #15, #16, #18, #19 can be eliminated purely by restricting policy code, without inspecting the host kernel. Additionally, bugs #3, #5, #14, #17 can be reduced from Combined to Extension-local through state isolation.

Combined (C): 2 classes requiring contracts:
Bugs #8 (block-size dependence) and #9 (launch config assumptions) fundamentally depend on kernel launch parameters. These require contract-based validation at attach time.

Host+Device (H): 1 class requiring host-side tools:
Bug #13 (host↔device async races) cannot be addressed by device-side verification. Requires CuSan/TSan and careful API design.

By Assurance Type

Assurance Type	Bug Classes	Soundness	Completeness
By-construction	#2, #10, #15	Perfect	High
Static-sound	#1, #3, #4, #5, #6, #11, #14, #16, #17, #18, #19	Strong	Low-Medium
Static-heuristic	#7, #12	Good	Medium
Contract-based	#8, #9	Conditional	Depends on contracts
Dynamic-only	#13	Executed paths only	Coverage-dependent

The Three-Stage Verification Pipeline

Stage 1: Load-time static verifier (core, analogous to eBPF verifier)

The load-time verifier employs three tiers of analysis, ranging from outright bans on genuinely dangerous constructs to static analysis that preserves useful functionality:

Tier A — By-construction bans (3 classes, no legitimate policy use):

Ban warp sync primitives (#2) — mask correctness is unverifiable without ITS-aware analysis
Ban spin-wait / polling loops (#10) — causes stale reads and ad-hoc synchronization
Ban blocking primitives: locks, mutexes, named barriers (#15) — prevents non-barrier deadlocks

Tier B — Static-sound analysis (11 classes, allow but verify safe usage):

Verification capability	Bug classes covered	What it does
Uniform control-flow analysis	#1 barrier divergence, #4 warp-divergence race, #6 control-flow divergence	Prove barriers are at uniform points; side-effects on uniform paths
Memory access pattern analysis	#5 uncoalesced access, #7 bank conflicts	Check stride patterns; reject non-conforming index expressions
Race-freedom structural rules	#11 shared-mem races, #17 global-mem races	Per-lane sharding / lane0-only / atomic helpers + state isolation
Scope enforcement	#3 atomic scope	Force device-scope for policy atomics + state isolation
Pointer/memory safety	#18 memory safety	Restrict pointer operations, analogous to eBPF pointer verification
Loop termination	#16 non-termination	Enforce bounded iteration counts
Range analysis	#19 arithmetic errors	Track value ranges to prevent overflow cascading into OOB
Resource budgets	#14 atomic contention	Limit atomic counts / enforce warp-aggregation

Tier C — Static-heuristic detection (2 classes, performance warnings/rejections):

#7 bank conflicts → check shared-memory index stride against bank mapping
#12 redundant barriers → dependence analysis to determine if a barrier protects actual cross-thread dependencies

Stage 2: Attach-time contract validation (2 classes)

#8 block-size dependence → policy declares preconditions (e.g., requires: blockDim.x >= 128), validated when attaching to a specific kernel
#9 launch config assumptions → validate grid/block dimensions satisfy policy preconditions

Stage 3: CI/Offline + Runtime (complementary coverage)

#13 host↔device async races → CuSan/TSan dynamic detection, beyond device-side verification scope
GPUVerify/ESBMC-GPU for kernel+extension combined analysis (when source is available)
Compute Sanitizer suite for dynamic regression testing
iGUARD/Simulee for advanced race detection
Runtime overhead enforcement for multi-tenant isolation (Guardian-style)

The eBPF Lesson Applied to GPUs

Just as eBPF succeeds by restricting extension capabilities to what can be verified without inspecting the kernel, a GPU extension verifier should:

Ban only what is genuinely dangerous and unnecessary — warp sync, spin-wait, and blocking primitives have no legitimate use in policy code
Use static analysis to allow useful features safely — barriers, shared memory, and atomics are valuable; verify their safe usage rather than banning them
Isolate policy state to reduce Combined bugs to Extension-local
Enforce warp-uniformity for side effects, bounding SIMT-amplified overhead
Use budgets for performance-affecting resources (atomics, memory ops)
Require contracts only for unavoidably Combined properties (#8, #9)

The key design principle is not to ban everything that could go wrong, but to apply the right level of restriction for each risk: outright bans for constructs with no legitimate policy use, static verification for useful but dangerous features, and heuristic detection for performance concerns. This preserves policy expressiveness while maintaining soundness for safety-critical GPU extensions.

Architectures for Agent Systems: A Survey of Isolation, Integration, and Governance

云微 — Tue, 03 Feb 2026 07:36:18 +0000

Large Language Model (LLM) based agent systems – software that leverages LLMs to autonomously plan and execute multi-step tasks using external tools – are rapidly moving from proof-of-concept demos into enterprise deployment. These agents promise to automate coding, IT operations, data analysis, and more, but deploying them in production raises new challenges in security, reliability, and integration. Over the last half-year, the community has converged on key strategies: strong isolation for executing untrusted actions, standardized protocols for tool integration, and governance frameworks to align agent behavior with enterprise policies. This survey provides a systematic review of recent developments (roughly the latter half of 2025), including agent sandbox architectures, emerging standards like MCP, open-source projects, industry initiatives, and research advances. We focus on the pain points encountered when bringing agent systems to production and how the latest solutions address (or still fall short on) those needs.

1. Agent System Architecture in the Enterprise

An enterprise-ready agent system typically consists of several layers: (i) an LLM-based reasoning core (the "agent" that decides which actions to take), (ii) an interface to invoke external tools or services (e.g. via APIs, command-line, databases), and (iii) an execution environment or runtime where the agent's tool actions (like running code or shell commands) actually occur. Surrounding these are components for memory/state storage, orchestration (especially if multiple agents work together), and monitoring & control (for safety and compliance). The overarching architectural challenge is that these systems are highly dynamic and open-ended: the agent may generate arbitrary code or tool requests at runtime, often based on unpredictable input. This requires a different approach to software architecture than traditional deterministic services.

Isolation and Safety by Design. Unlike a bounded microservice, an AI agent might decide to execute unvetted code or make system-altering calls. A core architectural principle emerging in 2025 is to sandbox the agent's actions – running them in an isolated environment that protects the host system and network. For example, the open-source Agent Sandbox for Kubernetes was introduced as a new Kubernetes primitive to run AI agents safely. Instead of letting LLM-generated code run in a standard container (which could still abuse the host kernel or other pods), Agent Sandbox uses lightweight VMs (gVisor-based userland kernel, with optional Kata Containers support) to create a secure barrier between the agent's code and the cluster node's OS. This isolates potentially malicious or errant code from interfering with other applications or the host. The Sandbox is managed via a custom Kubernetes resource (CRD) called Sandbox, which represents a single, stateful, long-lived pod with a stable identity and persistent storage. This design reflects a shift from treating agent workloads as ephemeral stateless functions to treating them as session-oriented services that may hold state over time. Indeed, the Agent Sandbox supports features like pausing and resuming the VM, automatically reviving it if a network reconnect is needed, and even memory sharing across sandboxes for efficiency. It also provides a templating and pool mechanism – SandboxTemplate and SandboxClaim – to manage pools of pre-warmed sandbox pods. Pre-warming is crucial because launching a fresh isolated VM can be slow; by keeping a pool of ready-to-go sandboxes, startup latency for a new agent session is dramatically reduced (Google reports sub-second startup latency, a ~90% improvement over cold-starting sandboxes). In Google's GKE, this is paired with a new Pod Snapshots feature that can checkpoint and restore running sandbox pods (even GPU workloads), cutting startup from minutes to seconds and avoiding idle resource waste. In short, the sandbox architecture is purpose-built for autonomous agents: it provides stronger isolation than ordinary containers, yet supports persistent state and fast elasticity to accommodate long-running, interactive agent tasks at scale.

Stateful Singleton Runtimes. Traditional cloud apps often scale by running many stateless instances behind a load balancer, but agent use-cases (like an AI coding assistant or an autonomous scheduler) often manifest as a single specialized "worker" with memory (such as cached tools or context) that persists across many tool calls. The Kubernetes Agent Sandbox explicitly targets these singleton, stateful workloads – not just for AI agents but also things like CI/CD build agents or single-node databases that require stable identity and disk state. This reflects a broader industry recognition: agent applications need new runtime primitives that can maintain continuity of state and identity across a session (for example, so the agent can incrementally build on previous tool outputs, or maintain an authenticated session to a service). Recent designs propose durable execution for agents – the ability to pause an agent's process, snapshot its memory or file system, and later resume or even migrate it. The GKE Agent Sandbox + Pod Snapshot combo is an early real-world example of this, effectively treating an agent's environment as a checkpointable virtual machine. We anticipate emerging orchestration support where an agent can be hibernated when idle and quickly reawakened when needed, balancing responsiveness with efficient resource use.

Tool Interface Layer. The other critical piece of architecture is how agents interface with external tools and data. Historically, each AI assistant platform invented its own plugin system or API schema (e.g. OpenAI's Plugins, LangChain's tool abstractions). This led to a fragmented ecosystem where tools had to be rewritten for each agent framework. Over 2025, a consensus has grown around Model Context Protocol (MCP) as a standard interface between AI models (the clients) and tools or services (the servers). MCP was released by Anthropic in late 2024 and by 2025 it has become "the universal standard protocol for connecting AI models to tools, data, and applications". Conceptually, MCP defines a simple JSON-RPC-based client-server protocol by which an AI agent can discover available tools and invoke them with arguments, and receive results/observations. The tools can be anything: database queries, file system operations, web requests, code compilation – each exposed by an MCP server that the agent connects to. The power of a common protocol is that it transforms the integration problem from M×N (every model integrating with every tool) to M+N modularity. A tool developer can create an MCP server once, and any compliant agent (whether it's OpenAI's, Anthropic's, or an open-source project) can use it. This dramatically reduces duplicated effort and makes the system more maintainable. GitHub engineers describe MCP as creating a "USB-C for AI" – a universal port for tools. In practice, MCP connections can be local (via stdio pipes) or remote (HTTP+SSE streams), and are typically stateful sessions, which aligns well with the idea of agent tools that maintain context (e.g. a database connection that stays open, or a browser that retains cookies).

Orchestration and Multi-Agent Workflows. Many real tasks may be too complex for a single agent or might benefit from specialized agents collaborating. The architecture is therefore expanding to support multi-agent systems where agents communicate or coordinate. Some protocols, like Agent-to-Agent (A2A) messaging, are emerging to standardize inter-agent communication (for instance, Google's Agent2Agent protocol and Microsoft's adoption of A2A in their framework). In a multi-agent setup, you might have one agent that specializes in planning, another in executing code, another in validation, etc., passing context or subtasks among them. Orchestration frameworks now often support deterministic workflows (where the chain of sub-tasks is predefined, akin to a business process) alongside LLM-driven orchestration (where agents dynamically decide how to break down and assign tasks). For example, Microsoft's new open-source Agent Framework explicitly supports both Agent Orchestration (LLM-driven, creative, adaptive) and Workflow Orchestration (fixed logic, for reliable repeatability) within one runtime. This framework, released in late 2025, consolidates previous research prototypes (like Semantic Kernel's planner and AutoGen from MSR) into an enterprise-ready SDK. It emphasizes connectors to enterprise systems, open standards (MCP, A2A, OpenAPI), and built-in telemetry, approvals, and long-running durability to meet enterprise needs. The trend here is that agents are being treated as first-class components of software systems, with the same expectations for monitoring, security, and lifecycle management as microservices or human-in-the-loop workflows.

Summary: The architecture of modern agent systems is coalescing around a modular, layered design. A secure sandboxed execution layer ensures that any generated code or commands run in isolation with controlled privileges. A standardized tool interface layer (MCP and similar protocols) decouples agent reasoning from the implementation of tools, enabling a rich ecosystem of reusable capabilities. On top of these, orchestration mechanisms allow composing multiple agents and tools into larger autonomous workflows, while providing hooks for humans and existing DevOps processes to supervise and intervene when needed. In the following sections, we delve deeper into three crucial aspects of enterprise agent systems: (a) the sandbox and runtime isolation mechanisms, (b) the emerging standards and ecosystems of tools/plugins, and (c) the security, governance, and observability considerations that are top-of-mind as organizations deploy these systems.

2. Isolated Execution Environments for Agents (Sandboxing)

Running untrusted or machine-generated code has always been risky – the difference now is that with LLM agents the code is being generated and executed on the fly, without a human vetting each command. This opens the door to accidental failures or even malicious exploits if the agent is tricked or if its outputs are unsafe. As a result, sandboxing has become a foundational requirement for agent systems. Sandboxing in this context means confining the agent's actions (code execution, file system writes, network calls, etc.) to an environment where it can't harm other processes or breach data it shouldn't access.

Table 1: Research / OSS Projects (Papers, Benchmarks, Open-Source Runtimes)

Name	Category	Sandbox/Isolation Boundary	Key Capabilities	Reference
Kubernetes SIGs: agent-sandbox	OSS (K8s Primitives/Controller)	Sandbox CRD in Kubernetes (with Template/Claim/WarmPool)	Manage "isolated + stateful + singleton" workloads; standardized API for agent runtime	GitHub
AIO Sandbox (agent-infra/sandbox)	OSS (All-in-One Environment)	Single Docker container (integrated multi-tools)	Browser/Shell/File/MCP/VSCode Server unified; unified workspace for agents & dev	GitHub
Alibaba OpenSandbox	OSS (Universal Sandbox Platform)	Unified protocol + multi-language SDK + sandbox runtime	Universal sandbox foundation for command/file/code/browser/agent execution	GitHub
E2B (e2b-dev/E2B)	OSS (Cloud Sandbox Infrastructure)	Cloud-isolated sandbox (SDK controlled)	Run AI-generated code in cloud; Python/JS SDK; for agent code interpreter	GitHub
E2B Desktop (e2b-dev/desktop)	OSS (Virtual Desktop Sandbox)	Isolated virtual desktop environment	"Computer Use" agent: desktop GUI, customizable dependencies, per-sandbox isolation	GitHub
LLM Sandbox (vndee/llm-sandbox)	OSS (Lightweight Code Sandbox)	Containerized isolation (configurable security policies)	Run LLM-generated code; customizable security policies and isolated container environments	GitHub
SkyPilot Code Sandbox (alex000kim/…)	OSS (Self-hosted Execution Service)	SkyPilot deployment + Docker sandboxing	Self-hosted, multi-language execution, token auth, MCP integration (for agent tools)	GitHub
Microsandbox (zerocore-ai/microsandbox)	OSS (microVM Execution Environment)	Hardware-isolated microVM (fast startup)	Run untrusted workloads via microVM; emphasis on isolation strength and startup speed	GitHub
ERA (BinSquare/ERA)	OSS (Local microVM Sandbox)	Local microVM ("microVM with container ease-of-use")	Run untrusted/AI-generated code locally with hardware-level isolation	GitHub
SandboxAI (substratusai/sandboxai)	OSS (Runtime)	Isolated sandbox	Secure execution runtime for AI-generated Python code and shell commands	GitHub
Python MCP Sandbox (JohanLi233/mcp-sandbox)	OSS (MCP Server)	Docker container isolation	Expose "secure Python execution" as a tool to agent/LLM clients via MCP	GitHub
Code Sandbox MCP (Automata-Labs-team/…)	OSS (MCP Server)	Docker container isolation	MCP server: provide containerized secure code execution environment for AI applications	GitHub
ToolSandbox (Apple)	Research + OSS (Evaluation Benchmark)	Evaluation sandbox with "stateful tool execution + user simulator"	Evaluate LLM tool-use: state dependencies, multi-turn dialogue, dynamic evaluation; open-source	arXiv
ToolEmu	Research (Risk Evaluation Framework)	LM-emulated sandbox (simulate tool execution with LM)	Use LM to simulate tool execution for scalable agent risk testing; includes automatic safety evaluator	OpenReview
HAICOSYSTEM	Research + OSS (Safety Evaluation Ecosystem)	Modular interaction sandbox (human-agent-tool multi-turn simulation)	Multi-domain scenario simulation and multi-dimensional risk evaluation (operational/content/social/legal); code platform	arXiv
EnterpriseBench	Research (Enterprise Environment Evaluation Sandbox)	"Evaluation environment" for enterprise tasks/tools/data	Evaluate LLM agents in enterprise scenarios (task execution, tool dependencies, data retrieval)
Managing Linux servers with LLM-based AI agents	Research (Empirical Evaluation)	Dockerized Linux sandbox	Let agents execute server tasks in Dockerized Linux environment and evaluate performance	ScienceDirect
Multi-Programming Language Sandbox for LLMs	Research (Multi-language Execution Sandbox)	Container-isolated sub-sandbox	Multi-language compilation/execution isolation (sub-sandbox isolated from main environment)	arXiv
awesome-sandbox (restyler/awesome-sandbox)	OSS (Ecosystem Overview/List)	N/A (aggregation)	Systematic curated list & analysis of "code sandboxing solutions"; good entry point for long-tail coverage	GitHub

Note: Achieving exhaustive coverage is impractical (especially given the long tail of the MCP ecosystem), so this table covers mainstream/representative projects plus ecosystem indexes. The awesome-sandbox list serves as an entry point for additional coverage.

Table 2: Commercial / Cloud Service Projects (Agent Sandbox / Code Sandbox / Runtime)

Product/Service	Vendor	Isolation/Execution Model	Key Capabilities	Reference
Code Interpreter (Tools)	OpenAI	Managed Python sandbox execution	Model writes and runs Python; for data analysis/coding/math	OpenAI Platform
Code Interpreter (Assistants on Azure)	Microsoft Azure OpenAI	Managed Python sandbox execution	Assistants API runs Python in sandbox environment (per Azure docs)	Microsoft Learn
E2B (Managed Cloud)	E2B	Managed cloud sandbox (enterprise agent cloud)	Sandbox as agent runtime; emphasis on concurrency and execution infrastructure	E2B
Daytona	Daytona	Managed/platform sandbox infrastructure	"Stateful infra for AI agents"; ultra-fast creation and isolated execution	Daytona
Agent Sandbox	Novita AI	Managed agent runtime	Low startup latency, high concurrency; code execution/network access/browser automation	Novita AI
Sandboxes (Desktop / GUI)	Bunnyshell	Firecracker microVM virtual desktop	For GUI/Computer Use: isolated desktop, VNC/noVNC, desktop automation API	Bunnyshell
Agent Sandbox on GKE	Google Cloud (GKE)	Deploy/run Agent Sandbox controller on GKE	Isolated execution of untrusted commands in cluster; official installation and usage guide	Google Cloud Documentation
AgentCore "agent sandbox"	AWS Bedrock AgentCore	Console testing sandbox	AWS docs: test agents in agent sandbox	AWS Documentation
Modal Sandboxes	Modal	Modal platform sandbox execution unit	Official example: build code-executing agent with Modal Sandboxes + LangGraph	Modal
Vercel Sandbox	Vercel	Vercel managed execution environment (Sandbox product)	For scalable execution (fluid compute/pay-per-active-CPU, etc.)	Vercel
Docker Sandboxes (Experimental)	Docker	Local containerized sandbox (for coding agents)	Docker official: use local isolated environments to run coding agents, enforce boundaries	Docker

Agent Sandbox on Kubernetes. The Kubernetes-based Agent Sandbox, spearheaded by Google and open-sourced as a SIG project in late 2025, exemplifies state-of-the-art sandbox design. A sandbox instance is essentially a microVM (micro virtual machine) launched per agent session, managed through K8s APIs. Internally it leverages technologies like gVisor (userspace kernel) to intercept syscalls and Kata Containers (lightweight VM isolation) to provide a robust security boundary. This means even if an agent's code tries to perform a malicious syscall or exploit a kernel bug, it's constrained within a sandbox kernel that has minimal privileges on the host. The sandbox also limits network access by default on GKE (only allowing what's necessary for the agent tools), reducing the risk of an agent scanning internal networks or exfiltrating data. At KubeCon NA 2025, Google showcased how they can schedule thousands of sandbox pods in parallel, thanks to the lightweight nature of gVisor, and how pre-warmed sandbox pools enable sub-second startup latencies even with the isolation. This addresses the performance concern that isolation often introduces: by carefully engineering snapshot/restore and pooling, the overhead can be kept low enough for interactive use.

From an API standpoint, the Sandbox CRD provides features tailored to long-running agent processes: you can specify resource limits, attach persistent volumes for agent state, and use the Kubernetes scheduler to place sandboxes on appropriate nodes (e.g. ones with GPU if the agent needs it). It also has life-cycle controls like scheduled deletion (to clean up sandboxes after use) and the mentioned pause/resume. Collectively, these features fulfill OWASP's top recommendation for mitigating agent risks: "system isolation, access segregation, permission management, command validation, and other safeguards". In fact, OWASP added an entry to its Top 10 for LLMs called "Agent Tool Interaction Manipulation" – the risk of an AI agent being induced to misuse its tools or perform unintended actions. The primary defense listed is to run the agent in a locked-down environment with fine-grained permission controls on what it can do. By confining an agent to a Kubernetes sandbox with only specific Kubernetes API access (or none at all beyond its tools) and no broad host access, even a compromised agent will have limited blast radius.

Local Sandboxing Solutions. Not all organizations use Kubernetes or need cloud-scale multi-tenancy; for individual developers or on-prem deployment, there are lighter-weight sandbox solutions emerging. One notable project is ERA (by BinSquare), which provides a local sandbox for running AI-generated code with "microVM security guarantees plus containers ease of use". ERA uses technologies like krunvm (firecracker microVM runner) under the hood, orchestrated in a way that feels like using Docker containers. The idea is to give developers a quick way to test AI-written scripts safely on their laptop or CI pipeline, without having to set up full Kubernetes. Similarly, some frameworks allow using WebAssembly (Wasm) sandboxes for certain tasks (since Wasm can restrict file and network access for code running within it). The InfoQ article on sandboxing mentions Lightning AI's LitSandbox and a library called container-use as alternatives, which likely explore isolating Python execution or providing wrapper APIs that simulate a sandbox. While these are not yet as standardized as the Kubernetes Agent Sandbox, they indicate a broad interest in making sandboxing accessible across environments.

Integration with Agent Frameworks. Modern agent frameworks are starting to build in assumptions about sandboxing. For example, LangChain (one of the earliest agent libraries) historically would just execute Python code or bash commands directly on the host, which is obviously dangerous in production. By late 2025, we see frameworks like LangGraph 1.0 (the evolution of LangChain's agent module) emphasizing "durable and safe" execution, and CrewAI (another open-source agent framework) adding features for asynchronous tool execution and monitoring to potentially plug into sandboxed runtimes. Microsoft's Agent Framework integrates with their Azure Foundry services, which likely means an agent's code execution can be routed to a managed sandbox (e.g. an isolated Azure Function or container instance) – in their blog they highlight "enterprise-grade deployment from the beginning", including security and compliance hooks. We also see new tools like Aspire's AI agent isolation module (by Microsoft) which aims to allow developers to run multiple agent instances in parallel without conflict, hinting at port isolation and MCP proxy layers. All these efforts point to execution isolation becoming a default part of agent system design. It's no longer assumed that an agent's code runs in the same process as the host application or with full OS privileges – instead, agents run in a contained, observable slot, much like how web browsers run untrusted JavaScript in a sandboxed process.

Transactional and Fault-Tolerant Execution. A sophisticated angle to sandboxing is making execution fault-tolerant. If an agent's action fails or does something unwanted, can we roll it back? One recent research prototype, Fault-Tolerant Sandboxing for AI Coding Agents, introduced a transactional file system wrapper for agent execution. It intercepts file system writes and system changes during an agent's tool use, and if the agent misbehaves or a policy violation is detected, the sandbox can rollback to a clean snapshot. In their experiments, 100% of unsafe actions were intercepted and rolled back, at a cost of ~14.5% performance overhead. However, they note a key limitation: this works for local state (files, processes) but not for external side-effects. If the agent made a cloud API call that created resources or sent emails, a local rollback doesn't undo those. This is pushing the conversation toward distributed transaction semantics for agents – treating a sequence of tool API calls as a saga that might need compensating actions if aborted. While not solved yet, it's a recognized gap (researchers call for integrating compensating transactions for external tools to truly sandbox at the multi-system level). For now, sandboxing primarily ensures the agent's local environment can be reset to a safe state even if one step goes awry.

Human Takeover and Hybrid Sandboxes. An intriguing development in sandbox design is support for human-in-the-loop interventions not just via yes/no approval prompts, but via full manual control of the sandbox. The idea is that if an agent reaches a step where it is stuck or needs privileged action (like entering a password or solving a tricky problem), a human operator can seamlessly take over the agent's sandbox session, do what's needed, and then hand control back to the AI. The research prototype AgentBay embodies this concept: it provides a unified isolated session that the AI agent can control via API (e.g. issuing OS commands, browser actions) and that a human can remote into graphically at any moment. AgentBay implements a custom Adaptive Streaming Protocol (ASP) to make this possible with very low latency. Unlike traditional screen sharing (RDP/VNC), ASP dynamically switches between sending high-level commands and video frames, adjusting to network conditions and whether the AI or human is currently in charge. The result is a much smoother experience for the human supervisor, even on weaker networks. In tests, allowing a human to intervene in AgentBay's sandbox improved task success rates by over 48% on complex benchmarks, showing the value of fluid HITL (Human-In-The-Loop) control. This approach directly addresses enterprise needs for control: rather than the agent being a black-box automation that might get stuck, it becomes a cooperative automation that an analyst or engineer can jump into whenever needed, without compromising the isolation or requiring the task to be restarted. We foresee future enterprise agent platforms offering a "panic button" or agent assist mode that spawns a secure VNC/Browser session for an operator, all actions logged, then closes back to autonomous mode.

In summary, sandboxing in agent systems has evolved into a multi-faceted capability: it's not only about securing the environment (with VMs, syscall filters, network restrictions), but also about managing the agent's lifecycle and state (persistent storage, snapshots, warm pools) and facilitating controlled handoffs (pause/resume and human takeover). The investments by major players – e.g. Google building Agent Sandbox as a CNCF project – indicate that these sandboxing techniques will likely become standard infrastructure in cloud platforms. Just as Kubernetes gave us primitives for scalable microservices, we are now getting primitives for safe autonomous agent execution on the cloud and the edge.

3. Tool Ecosystem and Standardization: From Plugins to MCP

In parallel with sandboxing the runtime, the industry has tackled the tool integration problem for agents. Early agent implementations often hard-coded a set of tools or required developers to write custom "plugin" adapters for each use case. This doesn't scale when enterprises might want agents to access dozens of internal APIs, databases, and third-party services. The last six months have seen a strong push toward standardizing how agents discover and use tools, yielding a more interoperable ecosystem.

3.1 Model Context Protocol (MCP) and the AAIF

Model Context Protocol (MCP) has emerged as the de facto standard protocol in this space. As mentioned, MCP defines a client-server schema where the AI agent (client) can list what tools a server offers, call those tools with JSON arguments, and receive results. It also covers things like authentication handshakes (e.g. OAuth flows to let an agent "login" to use a tool on a user's behalf) and streaming responses (for tools that send incremental results). By late 2025, MCP's momentum was cemented by the formation of the Agentic AI Foundation (AAIF) under the Linux Foundation. In December 2025, the Linux Foundation announced AAIF with MCP as a founding contribution alongside OpenAI's AGENTS.md and Block's Goose. The goal is to provide a neutral, open governance home for these agent standards so that no single company controls them. The AAIF launch PR notes MCP had already exploded in adoption: over 10,000 MCP servers published covering everything from dev tools to Fortune 500 internal integrations, and support built into major AI platforms including Claude, ChatGPT, GitHub Copilot, Google Gemini, VS Code, Cursor, and many others. This is remarkable considering MCP was only open-sourced in late 2024 – it resonated because it addressed an urgent pain point: without it, every AI vendor and every enterprise would be duplicating integrations. By rallying around MCP, the community effectively agreed on a "lingua franca" between agents and tools.

From an enterprise perspective, MCP brings several benefits:

Interoperability: A tool (say a database query interface) can be implemented once as an MCP server and then used by different agents (Anthropic's, OpenAI's, self-hosted ones) without custom adapters. This has analogies to drivers or connectors in classical software – build it once, use anywhere.
Security and Auditability: MCP messages are structured (JSON) and typically go through a client library in the agent runtime, where they can be logged and inspected. This makes it easier to audit what the agent asked a tool to do, as opposed to the agent running free-form shell commands that are hard to intercept. The protocol includes a capability advertisement step (the server tells what it can do), which can be checked against policies. It also often requires an auth handshake (e.g. OAuth) for the agent to gain access to the tool on behalf of a user, which means existing identity systems can mediate access.
Modularity and Future-proofing: As InfoQ summarized, MCP shifts integration from a tangled web into a modular architecture, reducing the "plugin fatigue" problem and making it easier to add new tools or swap out models. It also levels the playing field – small open-source projects can publish MCP servers that become as easily usable as those from big vendors, fostering a community ecosystem of tools.
Neutral Governance: With AAIF, companies like AWS, Google, Microsoft, Anthropic, and OpenAI are all at the same table (indeed all are listed as platinum members). This reduces the risk that MCP splinters into competing versions; it's likely to become analogous to HTML or SQL – a baseline standard that everyone implements, with maybe some extensions.

It's worth noting that MCP is evolving to cover more than just "traditional API calls." Recent extensions include Agent-to-Agent messaging (so an agent can expose itself as a tool to others via MCP) and binary data support (for image and file transfer). The AGENTS.md standard, also under AAIF, complements MCP by providing a way for software projects to declare to agents how to interact with them. AGENTS.md is essentially a README for AI agents, placed in a code repo to describe the project, its build/test tools, key contexts, and constraints. Over 60k open-source repos have adopted AGENTS.md to guide coding agents. By standardizing this, when an agent (like GitHub Copilot or Cursor) is working on a new codebase, it can automatically read AGENTS.md to understand the project's specific commands (e.g. how to run tests) rather than relying on general knowledge. This reduces errors and makes code-writing agents more reliable across different environments.

MCP Tool Ecosystem. Many companies and open-source teams have published MCP servers for their systems. For instance, GitHub released an official GitHub MCP Server that exposes GitHub operations (issues, PRs, repo contents, etc.) via MCP. This allows an agent to perform GitHub actions (like creating an issue or commenting on a PR) in a safe way – the server enforces GitHub's API policies and scopes. Similarly, we have MCP servers for databases (SQL tools), cloud resources (AWS, Azure MCP servers), information lookups (Wikipedia, web search), and even OS-level tasks (there are MCP servers that wrap shell commands or Docker). A typical enterprise might run a suite of internal MCP servers: one for their ticketing system, one for their customer database, one for DevOps (Kubernetes control like the mcp-server-kubernetes we saw). By doing so, they create a catalog of approved tools that their AI agents can use. Some companies are building MCP Gateways or registries to manage this catalog, which we'll discuss in the security section.

Local-First and Offline Agents. While MCP often assumes a client (agent) connecting to a server over HTTP, it's flexible enough to work in "all local" scenarios too (using stdio pipes). The Goose framework (contributed by Block to AAIF) is described as a "local-first AI agent framework". Goose uses MCP for tool extensions – meaning you can run goose agents on your laptop, and they can spin up local MCP servers for local tools (say, accessing a local filesystem or application) without needing cloud connectivity. This is important for cases where data privacy requires everything to remain on-prem or on-device. It also means an enterprise could package up an agent + tool suite to run entirely in an isolated network (e.g. an AI agent that helps with internal network diagnostics, running in a secure enclave with no internet access, but with MCP hooking into internal systems). The push toward standardization via MCP doesn't imply centralization in the cloud – on the contrary, it can democratize who provides tools (open-source implementations, self-hosted services, etc.) as long as they speak the protocol.

Beyond MCP: Other Standards. While MCP is currently the frontrunner, there are other noteworthy efforts. OpenAPI-based tool use: some agent frameworks allow importing any OpenAPI spec and will auto-generate an "agent tool" from it. For example, Microsoft's Agent Framework highlights that any REST API with an OpenAPI definition can be instantly turned into a tool, with the framework handling schema parsing and secure invocation. This is complementary to MCP: one could imagine MCP servers automatically exposing an OpenAPI, or vice versa. Another is the concept of capability description languages – OpenAI's Function Calling spec is one example, where the model is told function signatures and it outputs JSON for calls. Some researchers propose more formal schemas for tool affordances. At the moment, however, MCP seems to be converging those threads: it provides a structured way for an agent to query "what can I do?" and then invoke a function with arguments, which is essentially function calling over a channel. It's likely we'll see alignment or bridging between OpenAPI, JSON-RPC, and whatever else emerges, to avoid fragmenting this again.

In essence, if sandboxing addresses the agent's "body," MCP addresses the agent's "arms and legs". It standardizes how the agent reaches out to interact with the world. This was a necessary step for agents to become truly useful in enterprise settings, because no single vendor can supply every integration. By lowering the integration barrier, companies can leverage a far broader set of tools. However, as we'll discuss next, giving an AI agent access to many tools also broadens the attack surface and governance burden – thus, standardization and security have to go hand in hand.

4. Security, Governance, and Trust in Agent Systems

Deploying autonomous agents in an enterprise inherently raises the question: how do we trust them? Unlike a deterministic script, an AI agent can come up with unexpected actions, and it might be influenced by inputs (or adversaries) in ways we can't fully predict. Over the past months, a significant focus of both practitioners and researchers has been on closing the "trust gap" – ensuring that agents do what they're supposed to and nothing more, or at least that we can detect and mitigate when they misbehave. Several key themes have emerged: permission and policy models, supply chain security of tools, prompt injection defenses, auditing and observability, and fail-safe mechanisms. We'll examine each in turn.

4.1 Prompt Injection and Confused Deputy Problems

Prompt injection – where an external input is crafted to manipulate the agent's LLM into ignoring its instructions or performing unintended actions – has proven to be a very real threat. In the context of agent tools, prompt injection can become a "confused deputy" attack: the LLM is the deputy that has privileges (access to tools) and the attacker exploits it via crafted input (a prompt) to misuse those privileges. A simple example: an attacker might embed a malicious command in a user-provided email, which the agent then dutifully executes with its shell tool. Real incidents and proofs-of-concept have shown this is not just theoretical. The consensus in discussions (e.g. on Hacker News) is that prompt injection is analogous to XSS (cross-site scripting) in web apps – you cannot fully eliminate it just by sanitizing inputs, because the model's behavior with arbitrary text is hard to constrain. Thus, relying solely on prompt-based safeguards (like "don't execute if user says to do something bad") is brittle.

The more robust approach is structural: limit what the agent can do even if it's tricked. This means enforcing policy at the tool invocation layer. For instance, if the agent tries to run a shell command, have a policy that disallows rm -rf or network calls to sensitive endpoints. If it uses a database tool, ensure it cannot query tables it shouldn't. This is where sandboxing and permission models overlap. In a sandbox, you can intercept system calls – e.g. prevent file writes outside a certain directory, or limit network access to only whitelisted domains. With MCP, you can implement an allow-deny policy per tool – e.g. forbid a certain combination of API calls or detect if the arguments look suspicious (like a SQL query that's dumping all user data).

One concrete advancement is the research AgentBound framework, which proposes attaching a declarative access control policy to MCP servers. Inspired by Android's app permissions, AgentBound allows a tool to declare what host resources it needs (files, network targets, etc.), and an admin can approve or limit those. At runtime, an enforcement engine monitors the agent's calls and blocks anything outside the allowed scope. Impressively, AgentBound's evaluation auto-generated policies for 296 popular MCP servers with about 80.9% accuracy from the code, and could block the majority of malicious actions with negligible overhead. This suggests that intelligent tooling can help manage the policy burden: we can analyze a tool's code to infer "this tool should only ever need to access X API or Y file", then use that as a sandbox rule.

Another line of defense is schema validation. Many tools expect inputs of a certain form (JSON with specific fields, numbers in ranges, etc.). If the agent's output deviates, it can indicate either a prompt injection or a model error. Rigorously validating the agent's action format before executing it can catch some attacks or mistakes. In fact, OWASP's recommendation of command validation falls here – e.g. if an agent tries to execute sudo rm -rf /, the sandbox or tool wrapper should detect that and refuse.

It's widely acknowledged that prompt injection cannot be fully solved at the model level, so enterprise systems are layering these runtime controls. Some are even exploring two-model setups: one model generates a plan or interprets user input without any tools (and thus with no privileges), then a separate "execution model" with tools enabled but a much more constrained input (only the sanitized plan). This is analogous to separating policy decision and policy enforcement. However, this approach is in its infancy – researchers have noted it's tricky to ensure the two models stay in sync and that the first model doesn't inadvertently become a covert channel for bad instructions.

4.2 Tool Supply Chain Security

As the MCP tool ecosystem grows, a new class of security concerns appears: the tools themselves may have vulnerabilities or could be malicious. We've effectively extended our "attack surface" to any code that implements a tool API. In July 2025, security researchers disclosed critical flaws in some community-developed MCP servers:

The MCP Server for Kubernetes (an MCP tool that allowed agents to run kubectl commands on a cluster) had a command injection flaw. It constructed shell commands from user input without sanitization, so an attacker could embed | or && to execute arbitrary commands on the host. Not only that, the advisory demonstrated a prompt injection chain: if an agent was asked to read a pod's logs (which contained malicious instructions), the agent might then call a vulnerable kubectl tool with those instructions, leading to RCE (Remote Code Execution) on the MCP server host. This is a vivid example of how an innocuous high-level task (read logs) can cascade into a full compromise via weaknesses in the tool implementation. It underscores that agent security is only as strong as the weakest tool in its arsenal.
Another advisory for mcp-package-docs (a tool for reading package documentation) had a similar shell injection issue. Essentially, many early tools naively used exec() on strings, a practice long known to be dangerous in any software context.
The AI coding assistant Cursor found an even more subtle exploit: an agent could be tricked into writing a malicious MCP server configuration to disk (effectively "installing" a new tool) which would then be loaded and executed, giving the attacker code execution on the system. In response, Cursor had to forbid agents from writing to certain config directories.

These incidents highlight supply chain risk: when you install an MCP server from NPM or pip, do you know it's safe? Could it have a dependency hijacked to steal data? Traditional supply chain best practices – code signing, vetting maintainers, vulnerability scanning – all apply here. But additionally, the dynamic nature of agent tool use requires new thinking. For example, an agent might fetch a tool definition (schema) from somewhere at runtime – that channel could be compromised (a malicious tool listing that lies about what it does). To address this, the community is discussing tool registries with verification. Imagine an "App Store" for MCP tools where each tool is reviewed, sandboxed, and cryptographically signed. The Linux Foundation AAIF might play a role in hosting a global registry, or there may be vendor-specific ones.

Some researchers call for transparency logs and a "SBOM" (Software Bill of Materials) approach for agent tools. For instance, an enterprise might want a log of every tool version the agent ever used, so if one is later found malicious they can audit past agent runs. They also want assurance that the tool code running is exactly the code that was audited. This is akin to how modern browsers handle extensions: with strict signing and review processes.

On the defense side, one idea is dynamic tool vetting – before an agent uses a new tool, run that tool in a test mode on known benign inputs to see if it behaves correctly, or run it in a shadow sandbox with instrumented monitoring to detect unexpected actions. This is analogous to how app stores do a review, but potentially automated and at runtime. For now, this is an open research problem; we haven't seen full implementations yet, but it's identified in literature as a needed control.

In summary, securing the tool ecosystem requires both preventive measures (secure coding practices for tool developers, automated scans for dangerous patterns like execSync on inputs) and mitigations (running tools with least privilege, e.g. a tool that only needs to read a database should not also have OS write access). The principle of least privilege should apply at every level: the agent only has access to certain tools, the tool only has access to certain system resources. Achieving this in practice means plumbing through the user's identity and intent: e.g., if an agent is acting on behalf of Alice, the database tool should run under Alice's credentials or a role with her permissions, not a superuser. This is an area where enterprise IAM (Identity and Access Management) integration is critical – mapping the human user's identity to the agent's allowed actions. Recent work is exploring how to tie enterprise SSO/OAuth tokens into agent sessions in a fine-grained way, so that an agent cannot escalate its privileges beyond what the user would normally have through regular apps.

4.3 Monitoring, Auditing, and Policy Enforcement

Observability is notoriously difficult for AI systems because of their nondeterminism and unstructured outputs. But for agents, observability is non-negotiable in enterprise settings. Operators need to be able to ask: "What sequence of steps did the agent take? Why did it take a certain action? What tool calls were made with what parameters? Did anything unusual happen?" To that end, agent platforms are incorporating extensive logging and tracing capabilities:

Structured Traces: There's a push to use standards like OpenTelemetry to trace agent execution like any microservice call graph. Each agent action (e.g. "called Tool X with params Y, got result Z") can be a span in a trace. This allows using existing APM (Application Performance Monitoring) tools to visualize agent workflows. Some commercial platforms now show a real-time step-by-step trace of the agent's reasoning and tool use (often known as an "Agent console" or debug pane).
Semantic Logging: Beyond raw tool call logs, there's interest in capturing higher-level events. For example, flag if an agent's plan changed drastically mid-execution (could indicate it got confused or was manipulated), or if it requested an unusually large amount of data from a tool. Logging the content of prompts and responses is tricky (for privacy reasons), but logging the intents and outcomes is feasible. Additionally, cryptographic logging (hash chaining the logs) has been suggested so that forensic analysis can trust that logs weren't tampered with.
Auditing for Compliance: In sectors like finance or healthcare, any automated system needs audit trails for compliance. If an agent made a change to a customer's record, we need to know who/what prompted that and that it was authorized. Solutions here include linking agent actions to a user session and storing that context (e.g. "Agent acted on behalf of Alice, in response to request R, at time T"). Some enterprises restrict certain tools to manual-confirmation mode where a human must approve the agent's action in a dashboard (common for things like executing a trade or sending an email). Ensuring the agent properly presents the action for approval (and doesn't hide the true intent) is an active UX/security challenge.
Policy Engines: Enterprises are beginning to employ policy-as-code systems (like Open Policy Agent or custom rule engines) to govern agent behavior. For example, a policy might be: "Agents cannot call the production database tool with a WHERE clause missing a limit, unless the user is in admin role." When an agent attempts such a query, the policy engine can intercept and either block it or route it for approval. This ties into MCP Gateway architectures, where instead of the agent connecting directly to tool servers, it connects to a Gateway proxy that mediates all calls. Microsoft's preview of an MCP Gateway shows features like session persistence (to keep agent-tool sessions sticky) and a central place to enforce auth, rate limiting, etc. We can foresee these gateways becoming very sophisticated, implementing org-wide guardrails (e.g. no agent can call external web APIs that are not in a vetted list, to prevent data exfiltration).
Evaluation and Testing: An emerging practice is to treat agents like code and develop evaluation suites for them. Before deploying an agent update (new model version or new tool), run a battery of scenarios (some normal, some adversarial) to see how it behaves. In late 2025, multiple benchmarks for agent safety were released to facilitate this. The MCP-SafetyBench is one such benchmark: it tests LLM agents on realistic multi-step tasks across five domains (web browsing, financial analysis, code repo management, navigation, and web search) while injecting 20 types of attacks (from prompt tampering to tool output manipulation). The sobering result: no current model is remotely immune to MCP-based attacks – even top-tier models had 30–48% of tasks compromised. They also found a negative correlation between task performance and security: models that are more capable at completing tasks also tend to be more exploitable, presumably because they more eagerly follow any instruction including malicious ones. This points to a fundamental safety-utility trade-off. Enterprises must calibrate how "aggressive" or autonomous they want the agent to be. Some are introducing adjustable risk settings – e.g. a slider from conservative (fewer tools, more confirmations) to aggressive (full autonomy, high risk). A metric called NRP (Normalized Risk-Performance) was proposed to quantify this balance. Ultimately, continuous evaluation will be key: as new attacks are discovered, adding them to test suites and ensuring the agent (with all its tools and policies) can handle or resist them.

4.4 Identity, Authentication, and Governance

A less glamorous but absolutely crucial aspect is identity and access management (IAM) for agents. When an agent performs an action, whose authority is it under? In a multi-user environment (say an AI assistant in a company), the agent might have to act as different users at different times. Traditional OAuth wasn't designed for a scenario where an LLM is effectively a headless client acting interactively on behalf of a user. Over the past months, developers have hit practical snags integrating OAuth with MCP. For example, the OAuth Dynamic Client Registration used by MCP (so an agent can automatically register itself to use an API) sometimes fails with enterprise IdPs due to strict URL checks. Some IdPs don't allow dynamic clients at all. There are calls to allow static client credentials or out-of-band provisioning for agents in such cases. This is more of a standards gap than a research one – it's being worked through in the MCP working group.

From an enterprise architecture view, many want the agent to integrate with existing SSO. That means when an employee invokes an agent, the agent should use that employee's OAuth token to access tools. This ensures all actions are attributable and within the user's permissions. It's straightforward for some tools (like an MCP server can simply require a token from the agent), but complex for others (e.g. a shell tool on a server – how to scope that per user?). Some solutions involve impersonation tokens or scoped API keys: e.g. the agent might have a key that only allows certain operations and is tagged to the user.

The concept of "least privilege" comes into sharp focus here: the agent should only have the minimum access needed for the task, and ideally only for the duration needed. Techniques like OAuth token exchange or short-lived credentials are recommended. If an agent is spun up to do a build job, give it a temporary token that expires after, so even if it went rogue, it couldn't do damage later. One recent architecture paper emphasizes integrating enterprise identity with these agents so that all actions flow through the normal IAM checks and logs of the enterprise. That means, for instance, an agent using a Jira tool would appear in the Jira audit logs as "actions performed via AI agent on behalf of Bob". This transparency is needed for trust – people won't use the agent if it's a black box doing things in the shadows.

Governance also extends to deciding which tasks to automate vs require human approval, what data agents are allowed to see, and how to prevent data leakage. Some enterprises restrict agents from accessing production data entirely, using them only on sanitized or test datasets until trust is built. Others put heavy monitoring on outputs (e.g. scanning everything the agent is about to output to a user for sensitive data). These are areas where data loss prevention (DLP) tools intersect with AI. A future vision is that an enterprise agent platform will integrate DLP classifiers that flag if an agent's response likely contains company confidential info, and either redact it or alert a human.

Finally, we must mention user trust and adoption: beyond technical measures, building trust in agents involves user education and incremental rollout. Many organizations start with "read-only" agents (they can suggest actions but not execute them) and then gradually allow more autonomy as confidence grows. By having robust logs and a clear override path, users are more likely to accept the agent's help. Trust is also enhanced by making the agent's reasoning visible (hence the popularity of chain-of-thought traces displayed to users) and by giving users easy ways to correct or stop the agent. In essence, transparency and control are the antidotes to the unpredictability of AI.

The advancements in the last half-year – from sandbox isolation to protocol standardization and new benchmarks – all aim to shrink the trust gap. Yet, open challenges remain (discussed in the next section) before one can confidently say an autonomous agent is as well-understood and controlled as a traditional software microservice.

5. Open Challenges and Future Directions

Despite rapid progress, enterprise agent systems still have unsolved research questions and practical gaps. We conclude by highlighting some of the most pressing ones, as identified by recent discussions and publications, which represent opportunities for future work:

Unified Cross-Layer Security Model: Today we have pieces – OAuth for identity, MCP scopes for tool access, sandbox for OS isolation – but they don't always speak the same language. There is no single policy that says, for example, "User X's agent can read from database Y but not write, and can run code but only use 2 CPU and no internet, and these conditions are cryptographically verified." A comprehensive model that ties user identity, agent capabilities, tool permissions, and sandbox OS permissions into one coherent framework is needed. Early proposals like AgentBound (inspired by mobile app permissions) are a start. In the future, we might see capability tokens that encode all these at once – the agent carries a token which the sandbox and tools all check, limiting what it can do in each context. Formal verification of such models (to prove an agent cannot do X) would greatly enhance trust.
Rollback of External Side Effects: As noted, while we can rollback filesystem changes in a sandbox, we cannot yet rollback an email sent or a transaction made. Developing agent transaction protocols or sagas is an open challenge. One idea is to require critical tools to provide a compensation function – e.g. an MCP server for cloud VMs could have an "undo" for creating a VM (which would delete it). An agent planner could then use these to revert a series of actions if needed. This also ties into training the LLM or using a secondary verifier to decide when to rollback (e.g. if it notices an outcome diverges from expected state). Without solving this, enterprises will be hesitant to let agents perform irreversible operations autonomously.
Advanced Threat Defenses: The taxonomy of potential attacks (context injection, tool poisoning, cross-tool data leaks, etc.) is growing. Defenses like context signing (cryptographically signing tool outputs or important prompts to prevent tampering) have been suggested but not widely implemented. The idea there is: an agent would only trust tool outputs that come with a signature or hash, so an attacker who intercepts or modifies the content (like a man-in-the-middle on an HTTP tool) would fail. Similarly, isolating tools from each other (so one tool can't directly influence another except through the agent's vetted reasoning) is a challenge – currently the agent's memory is the meeting point of all tool data, making it a melting pot where a malicious output in one tool can affect decisions involving another.
Benchmarking and Standards for Evaluation: The community has started benchmarks like MCP-SafetyBench and MSB, but we need continuous evaluation pipelines. Perhaps an open leaderboard where agent developers can submit their agent (with a certain set of tools and policies) to be evaluated against a suite of scenarios, similar to how language models are benchmarked on GLUE or SuperGLUE for NLP. This could drive competition and improvement in safety. Also, evaluation should include cost and latency metrics – an agent that is safe but takes hours or $$$ to complete a task isn't practical. Balancing efficiency with safety will likely lead to innovations like adaptive risk modes (the agent switches to a more cautious approach if it senses something sensitive, trading speed for safety dynamically).
Human-Agent Interaction Paradigms: AgentBay's approach to HITL is one example of making agents more usable in the real world. There is still work to do on when and how an agent should ask for help. If it asks too often, it's not useful; if it asks too rarely, it might make an irrecoverable error. Finding that sweet spot (perhaps through reinforcement learning or feedback from users) is an ongoing area. Also, UI/UX research into how to present agent decisions to users in a clear way will be important (so users can confidently approve or deny actions). In enterprises, this might mean integrating agent controls into existing interfaces – e.g. showing an "AI agent suggestion" in a Jira ticket with a one-click approve.
Cross-Organization Collaboration and Data Sharing: Enterprise agents often need to work across silos – e.g. an agent might coordinate between a supplier's system and the company's internal system. This raises questions of federated trust: how do you let an agent use two domains' tools in a secure way? This touches on things like standardizing how agents convey identity across org boundaries, and how audit logs are shared. The AAIF being under Linux Foundation hints at future inter-company standards to address this, since agents won't stop at the corporate firewall.
Ethical and Compliance Considerations: Beyond security, enterprises must ensure agents comply with regulations and ethical norms. For example, if an agent interacts with personal data, privacy laws apply. How do we audit that an agent didn't retain or leak personal data beyond allowed purposes? Techniques like data tagging and tracking could be employed – marking certain outputs as containing sensitive info and preventing them from being used in contexts that aren't allowed. Ensuring AI explanations for decisions (especially if used in regulated domains) is another angle – if an agent makes a decision that affects a customer, one might need a rationale logged for compliance, which is tricky given the opaque reasoning of LLMs.
Improving Model Robustness: Finally, at the heart is the LLM itself. There's ongoing research into fine-tuning models to be more resistant to manipulation (advantageous to safety but often at odds with capability). Techniques like constitutional AI or adversarial training on tool-use scenarios might yield models that inherently refuse certain dangerous actions or at least flag uncertainty. Also, specialized models for parsing and validating the agent's outputs (e.g. a secondary model that checks if a proposed action seems safe/rational) could be integrated. OpenAI and others are exploring "moderator" models that look at the main model's outputs. In agents, a "policy model" might examine the plan and tool uses and raise red flags for anything that violates training-time learned safe patterns.

Outlook: The next year will likely bring a maturation of the agent ecosystem akin to what 2010-2015 saw for cloud microservices – an explosion of tools and best practices to handle deployment, security, monitoring, and standardization. The formation of AAIF is a strong indicator that industry players see collaboration as the way forward; no one wants a fragmented, Wild West environment when so much is at stake (both in terms of safety and potential business value). We will probably see AgentOps teams emerge in organizations, analogous to MLOps, focused on managing and supervising fleets of agents. They'll use dashboards (like GitHub's Agent HQ mission control) to oversee agent activities across the enterprise. And just as DevOps developed guardrails and CI/CD for code, AgentOps will develop guardrails and continuous evaluation for autonomous AI behaviors.

In conclusion, enterprise agent systems are transitioning from the lab to the real world, carrying with them both excitement (unprecedented automation capabilities) and caution (novel failure modes). Sandbox architectures and protocols like MCP have laid a foundation that makes these systems more modular, controllable, and interoperable than before. Yet, achieving a level of trust comparable to traditional software will require continued innovation in permission modeling, verification, and human oversight integration. The last half-year's progress has been remarkable – what was mostly sci-fi a year ago (multiple AIs collaborating on complex tasks with minimal human input) is now demonstrably feasible. The coming months will likely see pilots turn into production deployments in enterprises, each teaching new lessons. By actively sharing these lessons and converging on open standards and benchmarks, the community can accelerate the safe adoption of agentic AI. The end goal is an ecosystem where AI agents become reliable teammates – tirelessly automating drudgery and navigating complexity – while humans retain ultimate control and understanding of their behavior. The path to get there is challenging, but as this survey shows, the groundwork is rapidly being put in place.

References

Open-Source Agent Sandbox Enables Secure Deployment of AI Agents on Kubernetes - InfoQ News on Agent Sandbox, gVisor/Kata isolation, CRD for stateful agents, OWASP Top 10 for AI Agents
Google launches Agent Sandbox for secure AI agents on Kubernetes - TechInformed on gVisor isolation, pre-warmed pools (90% faster startups), Pod Snapshots
Linux Foundation Announces Formation of Agentic AI Foundation (AAIF) - MCP, Goose, AGENTS.md contributions; cross-industry support
MCP: The Universal Connector for Building Smarter, Modular AI Agents - InfoQ on MCP benefits (M×N to M+N integration, interoperability)
Introducing Microsoft Agent Framework - Microsoft Foundry Blog on open standards (MCP, A2A, OpenAPI) and enterprise readiness
What's new in Microsoft Foundry (Oct/Nov 2025) - Microsoft Agent Framework updates
GitHub launches Agent HQ for AI-powered coding - InfoWorld on managing multiple coding agents with governance, audit, and mission control
CVE-2025-53355: mcp-server-kubernetes command injection vulnerability - GitHub Advisory on unsanitized execSync and prompt-injection exploit via pod logs
Securing AI Agent Execution (arXiv:2510.21236) - Bühler et al. 2025: AgentBound permission framework for MCP tools, auto-policy generation (~80% accuracy)
AgentBay: A Hybrid Interaction Sandbox (arXiv:2512.04367) - Piao et al. 2025: unified sandbox with AI API control + live human takeover (48% higher task success with HITL)
MCP-SafetyBench (OpenReview) - Lan et al. 2025: real MCP server benchmark, 30–48% attack success on tested LLMs
MCP Attacks: Threats, Taxonomy, and Defenses - Adnan Masood on threat taxonomy for tool-using LLMs

eBPF Tutorial: Extending Kernel Subsystems with BPF struct_ops

云微 — Tue, 27 Jan 2026 07:20:44 +0000

Have you ever wanted to extend kernel behavior—like adding a custom scheduler, network protocol, or security policy—but were put off by the complexity of writing and maintaining a full kernel module? What if you could define the logic directly in eBPF, with dynamic updates, safe execution, and programmable control, all without recompiling the kernel or risking system stability?

This is the power of BPF struct_ops. This advanced eBPF feature allows BPF programs to implement the callbacks of a kernel operations structure, effectively letting you "plug in" custom logic to extend kernel subsystems. It goes beyond simple tracing or filtering—you can now implement core kernel operations in BPF. For example, we use it to implement GPU scheduling and memory offloading extensions in GPU drivers (see LPC 2024 talk and gpu_ext project).

In this tutorial, we will explore how to use struct_ops to dynamically extend kernel subsystem behavior. We won't be using the common TCP congestion control example. Instead, we'll take a more fundamental approach that mirrors the extensibility seen with kfuncs. We will create a custom kernel module that defines a new, simple subsystem with a set of operations. This module will act as a placeholder, creating new attachment points for our BPF programs. Then, we will write a BPF program to implement the logic for these operations. This demonstrates a powerful pattern: using a minimal kernel module to expose a struct_ops interface, and then using BPF to provide the full, complex implementation.

The complete source code for this tutorial can be found here: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/struct_ops

Introduction to BPF struct_ops: Programmable Kernel Subsystems

The Challenge: Extending Kernel Behavior Safely and Dynamically

Traditionally, adding new functionality to the Linux kernel, such as a new file system, a network protocol, or a scheduler algorithm, requires writing a kernel module. While powerful, kernel modules come with significant challenges:

Complexity: Kernel development has a steep learning curve and requires a deep understanding of kernel internals.
Safety: A bug in a kernel module can easily crash the entire system. There are no sandboxing guarantees.
Maintenance: Kernel modules must be maintained and recompiled for different kernel versions, creating a tight coupling with the kernel's internal APIs.

eBPF has traditionally addressed these issues for tracing, networking, and security by providing a safe, sandboxed environment. However, most eBPF programs are attached to existing hooks (like tracepoints, kprobes, or XDP) and react to events. They don't typically implement the core logic of a kernel subsystem.

The Solution: Implementing Kernel Operations with BPF

BPF struct_ops bridges this gap. It allows a BPF program to implement the functions within a struct_ops—a common pattern in the kernel where a structure holds function pointers for a set of operations. Instead of these pointers pointing to functions compiled into the kernel or a module, they can point to BPF programs.

This is a paradigm shift. It's no longer just about observing or filtering; it's about implementing. Imagine a kernel subsystem that defines a set of operations like open, read, write. With struct_ops, you can write BPF programs that serve as the implementation for these very functions.

This approach is similar in spirit to how kfuncs allow developers to extend the capabilities of BPF. With kfuncs, we can add custom helper functions to the BPF runtime by defining them in a kernel module. With struct_ops, we take this a step further: we define a whole new set of attach points for BPF programs, effectively creating a custom, BPF-programmable subsystem within the kernel.

The benefits are immense:

Dynamic Implementation: You can load, update, and unload the BPF programs implementing the subsystem logic on the fly, without restarting the kernel or the application.
Safety: The BPF verifier ensures that the BPF programs are safe to run, preventing common pitfalls like infinite loops, out-of-bounds memory access, and system crashes.
Flexibility: The logic is in the BPF program, which can be developed and updated independently of the kernel module that defines the struct_ops interface.
Programmability: Userspace applications can interact with and control the BPF programs, allowing for dynamic configuration and control of the kernel subsystem's behavior.

In this tutorial, we will walk through a practical example of this pattern. We'll start with a kernel module that defines a new struct_ops type, and then we'll write a BPF program to implement its functions.

The Kernel Module: Defining the Subsystem Interface

The first step is to create a kernel module that defines our new BPF-programmable subsystem. This module doesn't need to contain much logic itself. Its primary role is to define a struct_ops type and register it with the kernel, creating a new attachment point for BPF programs. It also provides a mechanism to trigger the operations, which in our case will be a simple proc file.

This approach is powerful because it separates the interface definition (in the kernel module) from the implementation (in the BPF program). The kernel module is stable and minimal, while the complex, dynamic logic resides in the BPF program, which can be updated at any time.

Complete Kernel Module: `module/hello.c`

Here is the complete source code for our kernel module. It defines a struct_ops named bpf_testmod_ops with three distinct operations that our BPF program will later implement.

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/bpf.h>
#include <linux/btf.h>
#include <linux/btf_ids.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/bpf_verifier.h>

/* Define our custom struct_ops operations */
struct bpf_testmod_ops {
    int (*test_1)(void);
    int (*test_2)(int a, int b);
    int (*test_3)(const char *buf, int len);
};

/* Global instance that BPF programs will implement */
static struct bpf_testmod_ops __rcu *testmod_ops;

/* Proc file to trigger the struct_ops */
static struct proc_dir_entry *trigger_file;

/* CFI stub functions - required for struct_ops */
static int bpf_testmod_ops__test_1(void)
{
    return 0;
}

static int bpf_testmod_ops__test_2(int a, int b)
{
    return 0;
}

static int bpf_testmod_ops__test_3(const char *buf, int len)
{
    return 0;
}

/* CFI stubs structure */
static struct bpf_testmod_ops __bpf_ops_bpf_testmod_ops = {
    .test_1 = bpf_testmod_ops__test_1,
    .test_2 = bpf_testmod_ops__test_2,
    .test_3 = bpf_testmod_ops__test_3,
};

/* BTF and verifier callbacks */
static int bpf_testmod_ops_init(struct btf *btf)
{
    /* Initialize BTF if needed */
    return 0;
}

static bool bpf_testmod_ops_is_valid_access(int off, int size,
                        enum bpf_access_type type,
                        const struct bpf_prog *prog,
                        struct bpf_insn_access_aux *info)
{
    /* Allow all accesses for now */
    return true;
}

/* Allow specific BPF helpers to be used in struct_ops programs */
static const struct bpf_func_proto *
bpf_testmod_ops_get_func_proto(enum bpf_func_id func_id,
                   const struct bpf_prog *prog)
{
    /* Use base func proto which includes trace_printk and other basic helpers */
    return bpf_base_func_proto(func_id, prog);
}

static const struct bpf_verifier_ops bpf_testmod_verifier_ops = {
    .is_valid_access = bpf_testmod_ops_is_valid_access,
    .get_func_proto = bpf_testmod_ops_get_func_proto,
};

static int bpf_testmod_ops_init_member(const struct btf_type *t,
                       const struct btf_member *member,
                       void *kdata, const void *udata)
{
    /* No special member initialization needed */
    return 0;
}

/* Registration function */
static int bpf_testmod_ops_reg(void *kdata, struct bpf_link *link)
{
    struct bpf_testmod_ops *ops = kdata;

    /* Only one instance at a time */
    if (cmpxchg(&testmod_ops, NULL, ops) != NULL)
        return -EEXIST;

    pr_info("bpf_testmod_ops registered\n");
    return 0;
}

/* Unregistration function */
static void bpf_testmod_ops_unreg(void *kdata, struct bpf_link *link)
{
    struct bpf_testmod_ops *ops = kdata;

    if (cmpxchg(&testmod_ops, ops, NULL) != ops) {
        pr_warn("bpf_testmod_ops: unexpected unreg\n");
        return;
    }

    pr_info("bpf_testmod_ops unregistered\n");
}

/* Struct ops definition */
static struct bpf_struct_ops bpf_testmod_ops_struct_ops = {
    .verifier_ops = &bpf_testmod_verifier_ops,
    .init = bpf_testmod_ops_init,
    .init_member = bpf_testmod_ops_init_member,
    .reg = bpf_testmod_ops_reg,
    .unreg = bpf_testmod_ops_unreg,
    .cfi_stubs = &__bpf_ops_bpf_testmod_ops,
    .name = "bpf_testmod_ops",
    .owner = THIS_MODULE,
};

/* Proc file write handler to trigger struct_ops */
static ssize_t trigger_write(struct file *file, const char __user *buf,
                 size_t count, loff_t *pos)
{
    struct bpf_testmod_ops *ops;
    char kbuf[64];
    int ret = 0;

    if (count >= sizeof(kbuf))
        count = sizeof(kbuf) - 1;

    if (copy_from_user(kbuf, buf, count))
        return -EFAULT;

    kbuf[count] = '\0';

    rcu_read_lock();
    ops = rcu_dereference(testmod_ops);
    if (ops) {
        pr_info("Calling struct_ops callbacks:\n");

        if (ops->test_1) {
            ret = ops->test_1();
            pr_info("test_1() returned: %d\n", ret);
        }

        if (ops->test_2) {
            ret = ops->test_2(10, 20);
            pr_info("test_2(10, 20) returned: %d\n", ret);
        }

        if (ops->test_3) {
            ops->test_3(kbuf, count);
            pr_info("test_3() called with buffer\n");
        }
    } else {
        pr_info("No struct_ops registered\n");
    }
    rcu_read_unlock();

    return count;
}

static const struct proc_ops trigger_proc_ops = {
    .proc_write = trigger_write,
};

static int __init testmod_init(void)
{
    int ret;

    /* Register the struct_ops */
    ret = register_bpf_struct_ops(&bpf_testmod_ops_struct_ops, bpf_testmod_ops);
    if (ret) {
        pr_err("Failed to register struct_ops: %d\n", ret);
        return ret;
    }

    /* Create proc file for triggering */
    trigger_file = proc_create("bpf_testmod_trigger", 0222, NULL, &trigger_proc_ops);
    if (!trigger_file) {
        /* Note: No unregister function available in this kernel version */
        return -ENOMEM;
    }

    pr_info("bpf_testmod loaded with struct_ops support\n");
    return 0;
}

static void __exit testmod_exit(void)
{
    proc_remove(trigger_file);
    /* Note: struct_ops unregister happens automatically on module unload */
    pr_info("bpf_testmod unloaded\n");
}

module_init(testmod_init);
module_exit(testmod_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("eBPF Example");
MODULE_DESCRIPTION("BPF struct_ops test module");
MODULE_VERSION("1.0");

Understanding the Kernel Module Code

This module may seem complex, but its structure is logical and serves a clear purpose: to safely expose a new programmable interface to the BPF subsystem. Let's break it down.

First, we define the structure of our new operations. This is a simple C struct containing function pointers. This struct bpf_testmod_ops is the interface that our BPF program will implement. Each function pointer defines a "slot" that a BPF program can fill.

struct bpf_testmod_ops {
    int (*test_1)(void);
    int (*test_2)(int a, int b);
    int (*test_3)(const char *buf, int len);
};

Next, we have the core bpf_struct_ops definition. This is a special kernel structure that describes our new struct_ops type to the BPF system. It's the glue that connects our custom bpf_testmod_ops to the BPF infrastructure.

static struct bpf_struct_ops bpf_testmod_ops_struct_ops = {
    .verifier_ops = &bpf_testmod_verifier_ops,
    .init = bpf_testmod_ops_init,
    .init_member = bpf_testmod_ops_init_member,
    .reg = bpf_testmod_ops_reg,
    .unreg = bpf_testmod_ops_unreg,
    .cfi_stubs = &__bpf_ops_bpf_testmod_ops,
    .name = "bpf_testmod_ops",
    .owner = THIS_MODULE,
};

This structure is filled with callbacks that the kernel will use to manage our struct_ops:

.reg and .unreg: These are registration and unregistration callbacks. The kernel invokes .reg when a BPF program tries to attach an implementation for bpf_testmod_ops. Our implementation uses cmpxchg to ensure only one BPF program can be attached at a time. .unreg is called when the BPF program is detached.
.verifier_ops: This points to a structure of callbacks for the BPF verifier. It allows us to customize how the verifier treats BPF programs attached to this struct_ops. For example, we can control which helper functions are allowed. In our case, we use bpf_base_func_proto to allow a basic set of helpers, including bpf_printk, which is useful for debugging.
.init and .init_member: These are for BTF (BPF Type Format) initialization. They are required for the kernel to understand the types and layout of our struct_ops.
.name and .owner: These identify our struct_ops and tie it to our module, ensuring proper reference counting so the module isn't unloaded while a BPF program is still attached.

The module's testmod_init function is where the magic starts. It calls register_bpf_struct_ops, passing our definition. This makes the kernel aware of the new bpf_testmod_ops type, and from this point on, BPF programs can target it.

Finally, to make this demonstrable, the module creates a file in the proc filesystem: /proc/bpf_testmod_trigger. When a userspace program writes to this file, the trigger_write function is called. This function checks if a BPF program has registered an implementation for testmod_ops. If so, it calls the function pointers (test_1, test_2, test_3), which will execute the code in our BPF program. This provides a simple way to invoke the BPF-implemented operations from userspace. The use of RCU (rcu_read_lock, rcu_dereference) ensures that we can safely access the testmod_ops pointer even if it's being updated concurrently.

The BPF Program: Implementing the Operations

With the kernel module in place defining the what (the bpf_testmod_ops interface), we can now write a BPF program to define the how (the actual implementation of those operations). This BPF program will contain the logic that executes when the test_1, test_2, and test_3 functions are called from the kernel.

Complete BPF Program: `struct_ops.bpf.c`

This program provides the concrete implementations for the function pointers in bpf_testmod_ops.

/* SPDX-License-Identifier: GPL-2.0 */
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "module/bpf_testmod.h"

char _license[] SEC("license") = "GPL";

/* Implement the struct_ops callbacks */
SEC("struct_ops/test_1")
int BPF_PROG(bpf_testmod_test_1)
{
    bpf_printk("BPF test_1 called!\n");
    return 42;
}

SEC("struct_ops/test_2")
int BPF_PROG(bpf_testmod_test_2, int a, int b)
{
    int result = a + b;
    bpf_printk("BPF test_2 called: %d + %d = %d\n", a, b, result);
    return result;
}

SEC("struct_ops/test_3")
int BPF_PROG(bpf_testmod_test_3, const char *buf, int len)
{
    char read_buf[64] = {0};
    int read_len = len < sizeof(read_buf) ? len : sizeof(read_buf) - 1;

    bpf_printk("BPF test_3 called with buffer length %d\n", len);

    /* Safely read from kernel buffer using bpf_probe_read_kernel */
    if (buf && read_len > 0) {
        long ret = bpf_probe_read_kernel(read_buf, read_len, buf);
        if (ret == 0) {
            /* Successfully read buffer - print first few characters */
            bpf_printk("Buffer content: '%c%c%c%c'\n",
                   read_buf[0], read_buf[1], read_buf[2], read_buf[3]);
            bpf_printk("Full buffer: %s\n", read_buf);
        } else {
            bpf_printk("Failed to read buffer, ret=%ld\n", ret);
        }
    }

    return len;
}

/* Define the struct_ops map */
SEC(".struct_ops")
struct bpf_testmod_ops testmod_ops = {
    .test_1 = (void *)bpf_testmod_test_1,
    .test_2 = (void *)bpf_testmod_test_2,
    .test_3 = (void *)bpf_testmod_test_3,
};

Understanding the BPF Code

The BPF code is remarkably straightforward, which is a testament to the power of the struct_ops abstraction.

Each function in the BPF program corresponds to one of the operations defined in the kernel module's bpf_testmod_ops struct. The magic lies in the SEC annotations:

SEC("struct_ops/test_1"): This tells the BPF loader that the bpf_testmod_test_1 program is an implementation for a struct_ops operation. The name after the slash isn't strictly enforced to match the function name, but it's a good convention. The key part is the struct_ops prefix.

The implementations themselves are simple:

bpf_testmod_test_1: This function takes no arguments, prints a message to the kernel trace log using bpf_printk, and returns the integer 42.
bpf_testmod_test_2: This function takes two integers, a and b, calculates their sum, prints the operation and result, and returns the sum.
bpf_testmod_test_3: This function demonstrates handling data from userspace. It receives a character buffer and its length. It uses bpf_probe_read_kernel to safely copy the data from the buffer passed by the kernel module into a local buffer on the BPF stack. This is a crucial safety measure, as BPF programs cannot directly access arbitrary kernel memory pointers. After reading, it prints the content.

The final piece is the struct_ops map itself:

SEC(".struct_ops")
struct bpf_testmod_ops testmod_ops = {
    .test_1 = (void *)bpf_testmod_test_1,
    .test_2 = (void *)bpf_testmod_test_2,
    .test_3 = (void *)bpf_testmod_test_3,
};

This is the most critical part for linking everything together.

SEC(".struct_ops"): This special section identifies the following data structure as a struct_ops map.
struct bpf_testmod_ops testmod_ops: We declare a variable named testmod_ops of the type struct bpf_testmod_ops. The name of this variable is important. It must match the name field in the bpf_struct_ops definition within the kernel module (.name = "bpf_testmod_ops"). This is how libbpf knows which kernel struct_ops this BPF program intends to implement.
The structure is initialized by assigning the BPF programs (bpf_testmod_test_1, etc.) to the corresponding function pointers. This maps our BPF functions to the "slots" in the struct_ops interface.

When the userspace loader attaches this struct_ops, libbpf and the kernel work together to find the bpf_testmod_ops registered by our kernel module and link these BPF programs as its implementation.

The Userspace Loader: Attaching and Triggering

The final component is the userspace program. Its job is to load the BPF program, attach it to the struct_ops defined by the kernel module, and then trigger the operations to demonstrate that everything is working.

Complete Userspace Program: `struct_ops.c`

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <bpf/bpf.h>
#include <bpf/libbpf.h>

#include "struct_ops.skel.h"

static volatile bool exiting = false;

void handle_signal(int sig) {
    exiting = true;
}

static int trigger_struct_ops(const char *message) {
    int fd, ret;

    fd = open("/proc/bpf_testmod_trigger", O_WRONLY);
    if (fd < 0) {
        perror("open /proc/bpf_testmod_trigger");
        return -1;
    }

    ret = write(fd, message, strlen(message));
    if (ret < 0) {
        perror("write");
        close(fd);
        return -1;
    }

    close(fd);
    return 0;
}

int main(int argc, char **argv) {
    struct struct_ops_bpf *skel;
    struct bpf_link *link;
    int err;

    signal(SIGINT, handle_signal);
    signal(SIGTERM, handle_signal);

    /* Open BPF application */
    skel = struct_ops_bpf__open();
    if (!skel) {
        fprintf(stderr, "Failed to open BPF skeleton\n");
        return 1;
    }

    /* Load BPF programs */
    err = struct_ops_bpf__load(skel);
    if (err) {
        fprintf(stderr, "Failed to load BPF skeleton: %d\n", err);
        goto cleanup;
    }

    /* Register struct_ops */
    link = bpf_map__attach_struct_ops(skel->maps.testmod_ops);
    if (!link) {
        fprintf(stderr, "Failed to attach struct_ops\n");
        err = -1;
        goto cleanup;
    }

    printf("Successfully loaded and attached BPF struct_ops!\n");
    printf("Triggering struct_ops callbacks...\n");

    /* Trigger the struct_ops by writing to proc file */
    if (trigger_struct_ops("Hello from userspace!") < 0) {
        printf("Failed to trigger struct_ops - is the kernel module loaded?\n");
        printf("Load it with: sudo insmod module/hello.ko\n");
    } else {
        printf("Triggered struct_ops successfully! Check dmesg for output.\n");
    }

    printf("\nPress Ctrl-C to exit...\n");

    /* Main loop - trigger periodically */
    while (!exiting) {
        sleep(2);
        if (!exiting && trigger_struct_ops("Periodic trigger") == 0) {
            printf("Triggered struct_ops again...\n");
        }
    }

    printf("\nDetaching struct_ops...\n");
    bpf_link__destroy(link);

cleanup:
    struct_ops_bpf__destroy(skel);
    return err < 0 ? -err : 0;
}

Understanding the Userspace Code

The userspace code orchestrates the entire process.

Signal Handling: It sets up a signal handler for SIGINT and SIGTERM to allow for a graceful exit. This is crucial for struct_ops because we need to ensure the BPF program is detached properly.
Open and Load: It uses the standard libbpf skeleton API to open and load the BPF application (struct_ops_bpf__open() and struct_ops_bpf__load()). This loads the BPF programs and the struct_ops map into the kernel.
Attach struct_ops: The key step is the attachment:
```
link = bpf_map__attach_struct_ops(skel->maps.testmod_ops);
```
This libbpf function does the heavy lifting. It takes the struct_ops map from our BPF skeleton (skel->maps.testmod_ops) and asks the kernel to link it to the corresponding struct_ops definition (which it finds by the name "bpf_testmod_ops"). If successful, the kernel's reg callback in our module is executed, and the function pointers in the kernel are now pointing to our BPF programs. The function returns a bpf_link, which represents the active attachment.
Triggering: The trigger_struct_ops function simply opens the /proc/bpf_testmod_trigger file and writes a message to it. This action invokes the trigger_write handler in our kernel module, which in turn calls the BPF-implemented operations.
Cleanup: When the user presses Ctrl-C, the exiting flag is set, the loop terminates, and bpf_link__destroy(link) is called. This is the counterpart to the attach step. It detaches the BPF programs, causing the kernel to call the unreg callback in our module. This cleans up the link and decrements the module's reference count, allowing it to be unloaded cleanly. If this step is skipped (e.g., by killing the process with -9), the module will remain "in use" until the kernel's garbage collection cleans up the link, which can take time.

Compilation and Execution

Now that we have all three components—the kernel module, the BPF program, and the userspace loader—let's compile and run the example to see struct_ops in action.

1. Build the Kernel Module

First, navigate to the module directory and compile the kernel module. This requires having the kernel headers installed for your current kernel version.

cd module
make
cd ..

This will produce a hello.ko file, which is our compiled kernel module.

2. Load the Kernel Module

Load the module into the kernel using insmod. This will register our bpf_testmod_ops struct_ops type and create the /proc/bpf_testmod_trigger file.

sudo insmod module/hello.ko

You can verify that the module loaded successfully by checking the kernel log:

dmesg | tail -n 1

You should see a message like: bpf_testmod loaded with struct_ops support.

3. Build and Run the eBPF Application

Next, compile and run the userspace loader, which will also compile the BPF program.

make
sudo ./struct_ops

Upon running, the userspace application will:

Load the BPF programs.
Attach the BPF implementation to the bpf_testmod_ops struct_ops.
Write to /proc/bpf_testmod_trigger to invoke the BPF functions.

You should see output in your terminal like this:

Successfully loaded and attached BPF struct_ops!
Triggering struct_ops callbacks...
Triggered struct_ops successfully! Check dmesg for output.

Press Ctrl-C to exit...
Triggered struct_ops again...

4. Check the Kernel Log for BPF Output

While the userspace program is running, open another terminal and watch the kernel log to see the output from our BPF programs.

sudo dmesg -w

Every time the proc file is written to, you will see messages printed by the BPF programs via bpf_printk:

[ ... ] bpf_testmod_ops registered
[ ... ] Calling struct_ops callbacks:
[ ... ] BPF test_1 called!
[ ... ] test_1() returned: 42
[ ... ] BPF test_2 called: 10 + 20 = 30
[ ... ] test_2(10, 20) returned: 30
[ ... ] BPF test_3 called with buffer length 21
[ ... ] Buffer content: 'Hell'
[ ... ] Full buffer: Hello from userspace!
[ ... ] test_3() called with buffer

This output confirms that the calls from the kernel module are being correctly dispatched to our BPF programs.

5. Clean Up

When you are finished, press Ctrl-C in the terminal running ./struct_ops. The program will gracefully detach the BPF link. Then, you can unload the kernel module.

sudo rmmod hello

Finally, clean up the build artifacts:

make clean
cd module
make clean

Note on Unloading the Module: Gracefully stopping the userspace program is important. It ensures bpf_link__destroy() is called, which allows the kernel module's reference count to be decremented. If the userspace process is killed abruptly (e.g., with kill -9), the kernel module may remain "in use," and rmmod will fail until the BPF link is garbage collected by the kernel, which can take some time.

Troubleshooting Common Issues

When working with advanced features like struct_ops, which involve kernel modules, BTF, and the BPF verifier, you may encounter some tricky issues. This section covers common problems and their solutions, based on the development process of this example.

Issue 1: Failed to find BTF for `struct_ops`

Symptom: The userspace loader fails with an error like:

libbpf: failed to find BTF info for struct_ops/bpf_testmod_ops
Failed to attach struct_ops

Root Cause: This error means the kernel module (hello.ko) was compiled without the necessary BTF (BPF Type Format) information. The BPF system relies on BTF to understand the structure and types defined in the module, which is essential for linking the BPF program to the struct_ops.

Solution:

Ensure vmlinux with BTF is available: The kernel build system needs access to the vmlinux file corresponding to your running kernel to generate BTF for external modules. This file is often not available by default. You may need to copy it from /sys/kernel/btf/vmlinux or build it from your kernel source. A common location for the build system to look is /lib/modules/$(uname -r)/build/vmlinux.
Ensure pahole is up-to-date: BTF generation depends on the pahole tool (part of the dwarves package). Older versions of pahole may lack the features needed for modern BTF generation. Ensure you have pahole v1.16 or newer. If your distribution's version is too old, you may need to compile it from source.
Rebuild the module: After ensuring the dependencies are met, rebuild the kernel module. The Makefile for this example already includes the -g flag, which instructs the compiler to generate debug information that pahole uses to create BTF.

You can verify that BTF information is present in your module with readelf:

readelf -S module/hello.ko | grep .BTF

You should see sections named .BTF and .BTF.ext, indicating that BTF data has been embedded.

Issue 2: Kernel Panic on Module Load

Symptom: The system crashes (kernel panic) immediately after you run sudo insmod hello.ko. The dmesg log might show a NULL pointer dereference inside register_bpf_struct_ops.

Root Cause: The kernel's struct_ops registration logic expects certain callback pointers in the bpf_struct_ops structure to be non-NULL. In older kernel versions or certain configurations, if callbacks like .verifier_ops, .init, or .init_member are missing, the kernel may dereference a NULL pointer, causing a panic. The kernel's code doesn't always perform defensive NULL checks.

Solution: Always provide all required callbacks in your bpf_struct_ops definition, even if they are just empty functions.

// In module/hello.c
static const struct bpf_verifier_ops bpf_testmod_verifier_ops = {
    .is_valid_access = bpf_testmod_ops_is_valid_access,
    .get_func_proto = bpf_testmod_ops_get_func_proto,
};

static struct bpf_struct_ops bpf_testmod_ops_struct_ops = {
    .verifier_ops = &bpf_testmod_verifier_ops,  // REQUIRED
    .init = bpf_testmod_ops_init,              // REQUIRED
    .init_member = bpf_testmod_ops_init_member, // REQUIRED
    .reg = bpf_testmod_ops_reg,
    .unreg = bpf_testmod_ops_unreg,
    /* ... */
};

By explicitly defining these callbacks, you prevent the kernel from attempting to call a NULL function pointer.

Issue 3: BPF Program Fails to Load with "Invalid Argument"

Symptom: The userspace loader fails with an error indicating that a BPF helper function is not allowed.

libbpf: prog 'bpf_testmod_test_1': BPF program load failed: Invalid argument
program of this type cannot use helper bpf_trace_printk#6

Root Cause: BPF programs of type struct_ops run in a different kernel context than tracing programs (like kprobes or tracepoints). As a result, they are subject to a different, often more restrictive, set of allowed helper functions. The bpf_trace_printk helper (which bpf_printk is a macro for) is a tracing helper and is not allowed by default in struct_ops programs.

Solution: While you can't use bpf_printk by default, you can explicitly allow it for your struct_ops type. This is done in the kernel module by implementing the .get_func_proto callback in your bpf_verifier_ops.

// In module/hello.c
static const struct bpf_func_proto *
bpf_testmod_ops_get_func_proto(enum bpf_func_id func_id,
                   const struct bpf_prog *prog)
{
    /* Use base func proto which includes trace_printk and other basic helpers */
    return bpf_base_func_proto(func_id, prog);
}

static const struct bpf_verifier_ops bpf_testmod_verifier_ops = {
    .is_valid_access = bpf_testmod_ops_is_valid_access,
    .get_func_proto = bpf_testmod_ops_get_func_proto, // Add this line
};

The bpf_base_func_proto function provides access to a set of common, basic helpers, including bpf_trace_printk. By adding this to our verifier operations, we tell the BPF verifier that programs attached to bpf_testmod_ops are permitted to use these helpers. This makes debugging with bpf_printk possible.

Summary

In this tutorial, we explored the powerful capabilities of BPF struct_ops by moving beyond common examples. We demonstrated a robust pattern for extending the kernel: creating a minimal kernel module to define a new, BPF-programmable subsystem interface, and then providing the full, complex implementation in a safe, updatable BPF program. This approach combines the extensibility of kernel modules with the safety and flexibility of eBPF.

We saw how the kernel module registers a struct_ops type, how the BPF program implements the required functions, and how a userspace loader attaches this implementation and triggers its execution. This architecture opens the door to implementing a wide range of kernel-level features in BPF, from custom network protocols and security policies to new filesystem behaviors, all while maintaining system stability and avoiding the need to recompile the kernel.

If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

References

Kernel Source for struct_ops: The implementation can be found in kernel/bpf/bpf_struct_ops.c in the Linux source tree.
Kernel Test Module for struct_ops: The official kernel self-test module provides a reference implementation: tools/testing/selftests/bpf/test_kmods/bpf_testmod.c.
BPF Documentation: The official BPF documentation in the kernel source: https://www.kernel.org/doc/html/latest/bpf/

eBPF Tutorial: BPF Workqueues for Asynchronous Sleepable Tasks

云微 — Tue, 20 Jan 2026 07:20:47 +0000

Ever needed your eBPF program to sleep, allocate memory, or wait for device I/O? Traditional eBPF programs run in restricted contexts where blocking operations crash the system. But what if your HID device needs timing delays between injected key events, or your cleanup routine needs to sleep while freeing resources?

This is what BPF Workqueues enable. Created by Benjamin Tissoires at Red Hat in 2024 for HID-BPF device handling, workqueues let you schedule asynchronous work that runs in process context where sleeping and blocking operations are allowed. In this tutorial, we'll explore why workqueues were created, how they differ from timers, and build a complete example demonstrating async callback execution.

The complete source code: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_wq

Introduction to BPF Workqueues: Solving the Sleep Problem

The Problem: When eBPF Can't Sleep

Before BPF workqueues existed, developers had bpf_timer for deferred execution. Timers work great for scheduling callbacks after a delay, perfect for updating counters or triggering periodic events. But there's a fundamental limitation that made timers unusable for certain critical use cases: bpf_timer runs in softirq (software interrupt) context.

Softirq context has strict rules enforced by the kernel. You cannot sleep or wait for I/O - any attempt to do so will cause kernel panics or deadlocks. You cannot allocate memory using kzalloc() with GFP_KERNEL flag because memory allocation might need to wait for pages. You cannot communicate with hardware devices that require waiting for responses. Essentially, you cannot perform any blocking operations that might cause the CPU to wait.

This limitation became a real problem for Benjamin Tissoires at Red Hat when he was developing HID-BPF in 2023. HID devices (keyboards, mice, tablets, game controllers) frequently need operations that timers simply can't handle. Imagine implementing keyboard macro functionality where pressing F1 types "hello" - you need 10ms delays between each keystroke for the system to properly process events. Or consider a device with buggy firmware that needs re-initialization after system wake - you must send commands and wait for hardware responses. Timer callbacks in softirq context can't do any of this.

As Benjamin Tissoires explained in his kernel patches: "I need something similar to bpf_timers, but not in soft IRQ context... the bpf_timer functionality would prevent me to kzalloc and wait for the device."

The Solution: Process Context Execution

In early 2024, Benjamin proposed and developed bpf_wq - essentially "bpf_timer but in process context instead of softirq." The kernel community merged it into Linux v6.10+ in April 2024. The key insight is simple but powerful: by running callbacks in process context (through the kernel's workqueue infrastructure), BPF programs gain access to the full range of kernel operations.

Here's what changes with process context:

Feature	bpf_timer (softirq)	bpf_wq (process)
Can sleep?	❌ No - will crash	✅ Yes - safe to sleep
Memory allocation	❌ Limited flags only	✅ Full `kzalloc()` support
Device I/O	❌ Cannot wait	✅ Can wait for responses
Blocking operations	❌ Prohibited	✅ Fully supported
Latency	Very low (microseconds)	Higher (milliseconds)
Use case	Time-critical fast path	Sleepable slow path

Workqueues enable the classic "fast path + slow path" pattern. Your eBPF program handles performance-critical operations immediately in the fast path, then schedules expensive cleanup or I/O operations to run asynchronously in the slow path. The fast path stays responsive while the slow path gets the capabilities it needs.

Real-World Applications

The applications span multiple domains. HID device handling was the original motivation - injecting keyboard macros with timing delays, fixing broken device firmware dynamically without kernel drivers, re-initializing devices after wake from sleep, transforming input events on the fly. All these require sleepable operations that only workqueues can provide.

Network packet processing benefits from async cleanup patterns. Your XDP program enforces rate limits and drops packets in the fast path (non-blocking), while a workqueue cleans up stale tracking entries in the background. This prevents memory leaks without impacting packet processing performance.

Security monitoring can apply fast rules immediately, then use workqueues to query reputation databases or external threat intelligence services. The fast path makes instant decisions while the slow path updates policies based on complex analysis.

Resource cleanup defers expensive operations. Instead of blocking the main code path while freeing memory, closing connections, or compacting data structures, you schedule a workqueue to handle cleanup in the background.

Implementation: Simple Workqueue Test

Let's build a complete example that demonstrates the workqueue lifecycle. We'll create a program that triggers on the unlink syscall, schedules async work, and verifies that both the main path and workqueue callback execute correctly.

Complete BPF Program: wq_simple.bpf.c

// SPDX-License-Identifier: GPL-2.0
/* Simple BPF workqueue example */
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include "bpf_experimental.h"

char LICENSE[] SEC("license") = "GPL";

/* Element with embedded workqueue */
struct elem {
    int value;
    struct bpf_wq work;
};

/* Array to store our element */
struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 1);
    __type(key, int);
    __type(value, struct elem);
} array SEC(".maps");

/* Result variables */
__u32 wq_executed = 0;
__u32 main_executed = 0;

/* Workqueue callback - runs asynchronously in workqueue context */
static int wq_callback(void *map, int *key, void *value)
{
    struct elem *val = value;
    /* This runs later in workqueue context */
    wq_executed = 1;
    val->value = 42; /* Modify the value asynchronously */
    return 0;
}

/* Main program - schedules work */
SEC("fentry/do_unlinkat")
int test_workqueue(void *ctx)
{
    struct elem init = {.value = 0}, *val;
    struct bpf_wq *wq;
    int key = 0;

    main_executed = 1;

    /* Initialize element in map */
    bpf_map_update_elem(&array, &key, &init, 0);

    /* Get element from map */
    val = bpf_map_lookup_elem(&array, &key);
    if (!val)
        return 0;

    /* Initialize workqueue */
    wq = &val->work;
    if (bpf_wq_init(wq, &array, 0) != 0)
        return 0;

    /* Set callback function */
    if (bpf_wq_set_callback(wq, wq_callback, 0))
        return 0;

    /* Schedule work to run asynchronously */
    if (bpf_wq_start(wq, 0))
        return 0;

    return 0;
}

Understanding the BPF Code

The program demonstrates the complete workqueue workflow from initialization through async execution. We start by defining a structure that embeds a workqueue. The struct elem contains both application data (value) and the workqueue handle (struct bpf_wq work). This embedding pattern is critical - the workqueue infrastructure needs to know which map contains the workqueue structure, and embedding it in the map value establishes this relationship.

Our map is a simple array with one entry, chosen for simplicity in this example. In production code, you'd typically use hash maps to track multiple entities, each with its own embedded workqueue. The global variables wq_executed and main_executed serve as test instrumentation, letting userspace verify that both code paths ran.

The workqueue callback shows the signature that all workqueue callbacks must follow: int callback(void *map, int *key, void *value). The kernel invokes this function asynchronously in process context, passing the map containing the workqueue, the key of the entry, and a pointer to the value. This signature gives the callback full context about which element triggered it and access to the element's data. Our callback sets wq_executed = 1 to prove it ran, and modifies val->value = 42 to demonstrate that async modifications persist in the map.

The main program attached to fentry/do_unlinkat triggers whenever the unlink syscall executes. This gives us an easy way to activate the program - userspace just needs to delete a file. We set main_executed = 1 immediately to mark the synchronous path. Then we initialize an element and store it in the map using bpf_map_update_elem(). This is necessary because the workqueue must be embedded in a map entry.

The workqueue initialization follows a three-step sequence. First, bpf_wq_init(wq, &array, 0) initializes the workqueue handle, passing the map that contains it. The verifier uses this information to validate that the workqueue and its container are properly related. Second, bpf_wq_set_callback(wq, wq_callback, 0) registers our callback function. The verifier checks that the callback has the correct signature. Third, bpf_wq_start(wq, 0) schedules the workqueue to execute asynchronously. This call returns immediately - the main program continues executing while the kernel queues the work for later execution in process context.

The flags parameter in all three functions is reserved for future use and should be 0 in current kernels. The pattern allows future extensions without breaking API compatibility.

Complete User-Space Program: wq_simple.c

// SPDX-License-Identifier: GPL-2.0
/* Userspace test for BPF workqueue */
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/resource.h>
#include <bpf/libbpf.h>
#include "wq_simple.skel.h"

static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
    return vfprintf(stderr, format, args);
}

int main(int argc, char **argv)
{
    struct wq_simple_bpf *skel;
    int err, fd;

    libbpf_set_print(libbpf_print_fn);

    /* Open and load BPF application */
    skel = wq_simple_bpf__open_and_load();
    if (!skel) {
        fprintf(stderr, "Failed to open and load BPF skeleton\n");
        return 1;
    }

    /* Attach tracepoint handler */
    err = wq_simple_bpf__attach(skel);
    if (err) {
        fprintf(stderr, "Failed to attach BPF skeleton\n");
        goto cleanup;
    }

    printf("BPF workqueue program attached. Triggering unlink syscall...\n");

    /* Create a temporary file to trigger do_unlinkat */
    fd = open("/tmp/wq_test_file", O_CREAT | O_WRONLY, 0644);
    if (fd >= 0) {
        close(fd);
        unlink("/tmp/wq_test_file");
    }

    /* Give workqueue time to execute */
    sleep(1);

    /* Check results */
    printf("\nResults:\n");
    printf("  main_executed = %u (expected: 1)\n", skel->bss->main_executed);
    printf("  wq_executed = %u (expected: 1)\n", skel->bss->wq_executed);

    if (skel->bss->main_executed == 1 && skel->bss->wq_executed == 1) {
        printf("\n✓ Test PASSED!\n");
    } else {
        printf("\n✗ Test FAILED!\n");
        err = 1;
    }

cleanup:
    wq_simple_bpf__destroy(skel);
    return err;
}

Understanding the User-Space Code

The userspace program orchestrates the test and verifies results. We use the skeleton API from libbpf which embeds the compiled BPF bytecode in a C structure, making loading trivial. The wq_simple_bpf__open_and_load() call compiles (if needed), loads the BPF program into the kernel, and creates all maps in one operation.

After loading, wq_simple_bpf__attach() attaches the fentry program to do_unlinkat. From this point, any unlink syscall will trigger our BPF program. We deliberately trigger this by creating and immediately deleting a temporary file. The open() creates /tmp/wq_test_file, we close the fd, then unlink() deletes it. This deletion enters the kernel's do_unlinkat function, triggering our fentry probe.

Here's the critical timing aspect: workqueue execution is asynchronous. Our main BPF program schedules the work and returns immediately. The kernel queues the callback for later execution by a kernel worker thread. This is why we sleep(1) - giving the workqueue time to execute before we check results. In production code, you'd use more sophisticated synchronization, but for a simple test, sleep is sufficient.

After the sleep, we read global variables from the BPF program's .bss section. The skeleton provides convenient access through skel->bss->main_executed and skel->bss->wq_executed. If both are 1, we know the synchronous path (fentry) and async path (workqueue callback) both executed successfully.

Understanding Workqueue APIs

The workqueue API consists of three essential functions that manage the lifecycle. bpf_wq_init(wq, map, flags) initializes a workqueue handle, establishing the relationship between the workqueue and its containing map. The map parameter is crucial - it tells the verifier which map contains the value with the embedded bpf_wq structure. The verifier uses this to ensure memory safety across async execution. Flags should be 0 in current kernels.

bpf_wq_set_callback(wq, callback_fn, flags) registers the function to execute asynchronously. The callback must have the signature int callback(void *map, int *key, void *value). The verifier checks this signature at load time and will reject programs with mismatched signatures. This type safety prevents common async programming errors. Flags should be 0.

bpf_wq_start(wq, flags) schedules the workqueue to run. This returns immediately - your BPF program continues executing synchronously. The kernel queues the callback for execution by a worker thread in process context at some point in the future. The callback might run microseconds or milliseconds later depending on system load. Flags should be 0.

The callback signature deserves attention. Unlike bpf_timer callbacks which receive (void *map, __u32 *key, void *value), workqueue callbacks receive (void *map, int *key, void *value). Note the key type difference - int * vs __u32 *. This reflects the evolution of the API and must be matched exactly or the verifier rejects your program. The callback runs in process context, so it can safely perform operations that would crash in softirq context.

When to Use Workqueues vs Timers

Choose bpf_timer when you need microsecond-precision timing, operations are fast and non-blocking, you're updating counters or simple state, or implementing periodic fast-path operations like statistics collection or packet pacing. Timers excel at time-critical tasks that must execute with minimal latency.

Choose bpf_wq when you need to sleep or wait, allocate memory with kzalloc(), perform device or network I/O, or defer cleanup operations that can happen later. Workqueues are perfect for the "fast path + slow path" pattern where critical operations happen immediately and expensive processing runs asynchronously. Examples include HID device I/O (keyboard macro injection with delays), async map cleanup (preventing memory leaks), security policy updates (querying external databases), and background processing (compression, encryption, aggregation).

The fundamental trade-off is latency vs capability. Timers have lower latency but restricted capabilities. Workqueues have higher latency but full process context capabilities including sleeping and blocking I/O.

Compilation and Execution

Navigate to the bpf_wq directory and build:

cd bpf-developer-tutorial/src/features/bpf_wq
make

The Makefile compiles the BPF program with the experimental workqueue features enabled and generates a skeleton header.

Run the simple workqueue test:

sudo ./wq_simple

Expected output:

BPF workqueue program attached. Triggering unlink syscall...

Results:
  main_executed = 1 (expected: 1)
  wq_executed = 1 (expected: 1)

✓ Test PASSED!

The test verifies that both the synchronous fentry probe and the asynchronous workqueue callback executed successfully. If the workqueue callback didn't run, wq_executed would be 0 and the test would fail.

Historical Timeline and Context

Understanding how workqueues came to exist helps appreciate their design. In 2022, Benjamin Tissoires started work on HID-BPF, aiming to let users fix broken HID devices without kernel drivers. By 2023, he realized bpf_timer limitations made HID device I/O impossible - you can't wait for hardware responses in softirq context. In early 2024, he proposed bpf_wq as "bpf_timer in process context," collaborating with the BPF community on the design. The kernel merged workqueues in April 2024 as part of Linux v6.10. Since then, they've been used for HID quirks, rate limiting, async cleanup, and other sleepable operations.

The key quote from Benjamin's patches captures the motivation perfectly: "I need something similar to bpf_timers, but not in soft IRQ context... the bpf_timer functionality would prevent me to kzalloc and wait for the device."

This real-world need drove the design. Workqueues exist because device handling and resource management require sleepable, blocking operations that timers fundamentally cannot provide.

Summary and Next Steps

BPF workqueues solve a fundamental limitation of eBPF by enabling sleepable, blocking operations in process context. Created specifically to support HID device handling where timing delays and device I/O are essential, workqueues unlock powerful new capabilities for eBPF programs. They enable the "fast path + slow path" pattern where performance-critical operations execute immediately while expensive cleanup and I/O happen asynchronously without blocking.

Our simple example demonstrates the core workqueue lifecycle: embedding a bpf_wq in a map value, initializing and configuring it, scheduling async execution, and verifying the callback runs in process context. This same pattern scales to production use cases like network rate limiting with async cleanup, security monitoring with external service queries, and device handling with I/O operations.

If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.

References

Original Kernel Patches: Benjamin Tissoires' HID-BPF and bpf_wq patches (2023-2024)
Linux Kernel Source: kernel/bpf/helpers.c - workqueue implementation
Tutorial Repository: https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/features/bpf_wq

Example adapted from Linux kernel BPF selftests with educational enhancements. Requires Linux kernel 6.10+ for workqueue support. Complete source code available in the tutorial repository.

DEV Community: 云微

An Empirical Study: AI Agent Rules Need Context and Layered Enforcement

Developers Have Already Written the Policies

The Enforcement Gap Begins with Context

One Rule Crosses Several Enforcement Layers

Compiling Intent into Enforceable State

Recovery Reveals What Enforcement Alone Misses

Where Layered Enforcement Stops

Is eBPF enough for AI agent safety?

Can behavioral baselines replace policy?

Does AgentSight enforce ActPlane policies?

What remains outside ActPlane's coverage?

References

ACRFence: Preventing Semantic Rollback Attacks in Agent Checkpoint-Restore

A Simple Transfer Example

Root Cause: Local Rollback, External Progress

Attack 1: Action Replay

Attack 2: Authority Resurrection

Why This Is Not One Framework's Bug

ACRFence: Replay-or-Fork at the Tool Boundary

How This Differs from Idempotency and Durable Execution

Limitations and Next Steps

Conclusion

References

Runtime Observability and Enforcement for Opaque AI Agents with eBPF: Beyond Sandboxes and Approvals

Why Now: Complexity Up, Guardrails Behind

The Accountability Gap

Three Layers, Three Questions

Why Independence Matters

Approval fatigue

Harness opacity

The trust boundary is an ownership boundary

What OS-Level Monitoring Captures

Deployment Reality

AgentSight and ActPlane: Observe, Then Enforce

Practical Checklist

Closing

References

When CPU Noise Slows Down GPU Inference: Measuring Scheduler and IRQ Impact with eBPF

Why CPU Scheduling Shows Up in GPU Inference

Tracing the Launch Path

CUDA API Tracing

Scheduler Event Tracing

IRQ Tracing

Benchmark and Environment

Analysis Method

RQ1: Does CPU Scheduler Significantly Impact GPU Performance in Clean Environments?

Results

RQ2: What Is the Impact of IRQ Interrupts on GPU Performance?

Results

RQ3: How Do Noisy Neighbors Affect GPU Performance?

Results

Scenario Analysis

RQ4: Can CPU Pinning Effectively Mitigate Scheduler Impact?

Results

What the Results Mean

Comparison with Meta's sched_ext Findings

Limitations

Practical Recommendations

Conclusion

References

eBPF Tutorial by Example 50: Composable Traffic Control with TCX Links

Introduction to TCX: Why Classic TC Attachment Needed a Rethink

The Problem: Qdisc Plumbing and Unsafe Ownership

The Solution: Link-Based Multi-Program Management

Writing the eBPF Program

Section Names: SEC("tcx/ingress")

Global Variables as Counters

Return Codes: TCX_NEXT vs TCX_PASS

User-Space Loader: Attaching and Querying the Chain

Step 1: Attach the First Program

Step 2: Insert the Second Program Before the First

Step 3: Query the Chain

Step 4: Generate Traffic and Read Counters

Compilation and Execution

How This Differs from Lesson 20 (Classic TC)

Summary

References

eBPF Tutorial by Example: BPF Token for Delegated Privilege and Secure Program Loading

Introduction to BPF Token: Solving the Privilege Problem

Section Names: `SEC("tcx/ingress")`

Return Codes: `TCX_NEXT` vs `TCX_PASS`

The Namespace Orchestrator: `token_userns_demo`

1. `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` - Socket Address Hooks

2. `BPF_PROG_TYPE_CGROUP_DEVICE` - Device Access Control

3. `BPF_PROG_TYPE_CGROUP_SYSCTL` - Sysctl Access Control