TooFastTooCurious

Posted on Apr 14 • Originally published at juliet.sh

Building Runtime Enforcement for Kubernetes with eBPF

#kubernetes #security #ebpf #devops

Originally published on the Juliet Security blog.

Most Kubernetes security tools stop at scan time. They'll flag a critical CVE in a container image or complain that a pod runs as root. What they won't do is tell you that someone just spawned a shell in your production namespace, opened a connection to a mining pool, or loaded a kernel module to break out of the container sandbox.

Juliet started as a graph-based security platform. We map attack paths, score blast radius, prioritize findings. Useful stuff. But customers kept circling back to the same ask: can you actually stop the bad thing, or just tell me it happened?

So we built runtime enforcement. This post walks through the design, the tradeoffs we made, and the production incident that changed how we think about safety.

Replacing Falco

We started with Falco as a sidecar. It watches syscalls through eBPF, writes alerts to a FIFO pipe, and our node-agent reads from the other end of that pipe.

The pipe was the problem. If our agent started before Falco, the pipe didn't exist yet. If Falco restarted, the pipe broke. If the pipe filled up because our reader fell behind, Falco would block. We burned more hours managing that pipe than we spent building actual security features.

On top of that, Falco's rule language was too coarse for what we needed. We wanted to match events against customer-defined policies with namespace scoping, image pattern matching, and per-process exception lists. Translating between our internal policy model and Falco's YAML rules created a fragile middle layer that broke in subtle ways.

We ripped it out and embedded the eBPF sensor directly in our Go agent using cilium/ebpf.

What we trace

We hook 22 syscalls across five categories:

Category	Syscalls	What we catch
Process execution	`execve`, `execveat`	Shells, exploit toolkits, crypto miners
File access	`openat`, `unlinkat`, `memfd_create`	Reads of /etc/shadow, log deletion, fileless payloads
Network	`connect`, `listen`, `accept4`	C2 callbacks, cloud metadata grabs, rogue listeners
Container escape	`ptrace`, `mount`, `setns`, `unshare`, `init_module`, `finit_module`	Namespace tricks, host filesystem mounts, module loading
Privilege escalation	`chmod`, `fchmodat`, `capset`	Setuid flips, capability changes

Each tracepoint handler writes a fixed 304-byte struct into a 2MB ring buffer. The struct uses a C union for the syscall-specific payload (file path, network address, or process metadata), so every event is identical in size regardless of type. This keeps the ring buffer math simple and avoids variable-length parsing on the hot path.

Filtering where it matters: in the kernel

This was the single best decision we made. Instead of sending every syscall event to userspace and filtering there, we filter inside the BPF program using two maps:

monitored_syscalls: a hash map of syscall numbers that active policies actually care about. If nobody has a network policy enabled, connect and listen events never leave the kernel. When a customer toggles policies on or off, we update this map and the change takes effect on the next syscall.

container_cgroups: a fast lookup by cgroup ID to decide whether a process belongs to a monitored container. For runtimes we haven't populated the map for, we fall back to checking PID namespace depth (task->nsproxy->pid_ns_for_children->level > 0). Containers always have level > 0; host processes sit at level 0. This works across Docker, containerd, and CRI-O without any userspace coordination.

The payoff: overhead scales with the number of policies you enable, not the number of syscalls we could theoretically trace.

Turning PIDs into something useful

A raw eBPF event gives you a PID and a 16-character process name. That's not enough to make a security decision. You need the container name, the pod, the namespace, the image, and the service account.

We use three caches that each pull from a different source:

PID LRU reads /proc/<pid>/cgroup to get the container ID. 10K entries, 5-minute TTL.
CRI cache talks to containerd over gRPC and watches container start/stop events.
K8s cache watches the pod API for the local node.

If one cache goes down, the other two still contribute what they can. If all three are broken, events still carry the PID, container ID, and process name from the kernel. We never stall the pipeline waiting for metadata. An event with partial enrichment moves through and the policy matcher treats it conservatively (no enforcement on events we can't fully identify).

Matching policies fast

Every two minutes, the agent syncs policies from the API and compiles them into a lookup structure:

CompiledPolicy {
    SyscallSet:    {59: true, 322: true}     // execve, execveat
    ProcessNames:  {"bash": true, "sh": true}
    PathPrefixes:  ["/tmp/", "/var/run/"]
    NetCIDRs:      [169.254.169.254/32]
    Scope:         {IncludeNamespaces: {"production": true}}
    Exceptions:    [{process_name: "nginx"}]
}

Policies are bucketed by syscall category. When an event comes in, we look up its category (derived from the syscall number), get the handful of candidate policies (usually 3-8), and check each one. The hot path uses pre-allocated maps and does zero heap allocation.

If two policies both match and one says "alert" while the other says "kill", the kill wins. We always pick the highest-severity enforce-mode match.

Why we kill instead of block

We enforce by sending SIGKILL from userspace. The alternative is BPF LSM, where the eBPF program returns -EPERM and the kernel refuses the syscall before it completes.

LSM is objectively better at prevention. But we chose kill for three reasons:

Portability. BPF LSM requires kernel 5.7+ with CONFIG_BPF_LSM=y. A lot of production clusters still run Amazon Linux 2 or RHEL 8. We didn't want to cut out half our addressable market.

Failure mode. If an LSM policy has a bug and matches kubelet or containerd, the node goes down. You can't start new pods, can't pull images, can't recover without SSH access. With SIGKILL, the worst case is that a process dies and the kubelet restarts it. Annoying, but the node stays up.

We tested the failure mode on ourselves. Not on purpose.

How we broke staging

Three weeks into our enforcement beta, we turned on enforce mode in staging. Within minutes, Harbor (our container registry) started throwing 500 errors. Pulls failed. Deployments queued up. The cluster ground to a halt.

Here's what happened: we had a policy that flags processes running as root. That's a reasonable thing to detect. But our enforcement engine applied it globally, across every namespace on the node. Harbor's Postgres process runs as root. So does Cilium's agent. So does RabbitMQ. The enforcement engine dutifully killed all of them.

We turned enforcement off, traced the kills in our metrics, and realized the fix was obvious in hindsight: enforcement needs to be scoped to specific namespaces.

func isInScope(namespace string, scope CompiledScope) bool {
    if len(scope.IncludeNamespaces) > 0 {
        return scope.IncludeNamespaces[namespace]
    }
    if len(scope.ExcludeNamespaces) > 0 {
        return !scope.ExcludeNamespaces[namespace]
    }
    return true
}

Three rules came out of that incident:

If an event has no namespace metadata (enrichment failed or it's a host process), never enforce. Default to audit.
If a namespace isn't in the policy's scope, downgrade from kill to audit. Still record the event, just don't act on it.
The UI now requires you to specify at least one namespace when you set a policy to enforce mode. No more global enforcement.

Seven things we check before every kill

After the Harbor mess, we added layers of protection to the response actor. Every kill request goes through all seven:

No container ID, no kill. If we can't confirm it's a container process, we leave it alone.
Simulate mode. Logs what would happen without sending the signal. You should always run a new policy in simulate for a few days first.
Protected namespaces. kube-system is off-limits by default.
PID 0 and PID 1 are untouchable. We will never kill init.
Self-preservation. The agent will not kill its own process.
Rate limiting per pod. 10 kills per pod in a 60-second window. After that, we stop and flag it. This prevents kill-restart spirals.
Namespace scope. The policy must explicitly include the event's namespace.

Each attempt gets tagged with a result code: killed, failed, skipped_namespace, skipped_pid1, suppressed, or simulated. All of these show up in Prometheus, so you can see exactly what enforcement did on every node.

Moving events without drowning

A busy node can produce thousands of syscall events per second. Sending each one to the API individually would saturate the network and hammer ClickHouse. So we built a five-stage pipeline:

Sensor (ring buffer, polls every 100ms)
  -> Response Actor (kill decisions happen here, < 200ms)
    -> Coalescer (groups by rule+container+process, 5s window)
      -> Batcher (flushes at 500 events or 5s, whichever hits first)
        -> Forwarder (gzip, retry with backoff, disk spool if API is down)

The important thing: kills happen in stage 2. We don't batch enforcement. If a process needs to die, it dies within 200ms of the syscall, not after a 5-second batch window.

Coalescing cuts volume by 10x to 100x on noisy workloads. If bash keeps spawning in the same container and hitting the same policy, we collapse 100 events into one record with event_count: 100.

If the API goes offline, the forwarder writes batches to a local disk spool (capped at 100MB, oldest files evicted first). When the API comes back, a drain loop picks up the files and replays them. We'd rather lose some events than let backpressure freeze the enforcement path.

Handling different kernels

eBPF with CO-RE needs BTF data. Modern kernels (5.8+) ship it at /sys/kernel/btf/vmlinux. Plenty of production kernels don't.

Our fallback chain:

Use host kernel BTF if it exists
Try an embedded BTFhub archive that matches the kernel release
If nothing works, run in status-only mode. The agent reports its health and syncs policies, but doesn't hook any syscalls.

ARM64 adds another wrinkle. Those kernels don't have dup2 or chmod as separate syscalls; they use dup3 and fchmodat instead. We attach tracepoints on a best-effort basis: skip what's missing, log a warning, only bail out if literally nothing attaches.

What the numbers look like

50 pods on a node, all 40 built-in policies active in audit mode:

Metric	Value
CPU (steady state)	200-300 mCPU
Memory	500-800 MB
Raw events per second	50-200
After coalescing	5-20 per second
Network to API	50-500 KB every 5s
Time from syscall to ClickHouse	5-11 seconds
Time from syscall to kill	under 200ms

Storage latency is deliberately higher than enforcement latency. Killing a process can't wait for batch compression. Writing it to a database can.

Things we'd change

Scope enforcement from day one. Global enforcement without namespace scoping cost us a staging outage and a scramble to patch. If you're building enforcement for anything, make scope a required field before you write your first kill call.

Move coalescing earlier for audit-only events. Right now every event hits the response actor, even if it's just going to be logged. For audit policies, we could coalesce first and skip the per-event response check entirely. That would cut CPU on nodes with chatty workloads.

Ship a heartbeat from the start. For months we inferred agent health from when it last uploaded an SBOM or synced policies. If a node had no new images and runtime was off, the agent looked dead even though it was fine. A 60-second heartbeat ping would have saved us a lot of false alarms.

Where this is going

We're looking at BPF LSM as an opt-in mode for clusters running kernel 5.7+. SIGKILL handles most cases well, but some compliance regimes want proof that the syscall was blocked, not just that the process was terminated afterward.

We're also wiring up alert routing so enforcement events go straight to Slack and PagerDuty instead of sitting in a dashboard waiting to be noticed.

Building runtime enforcement changed Juliet from a scanner into a platform. It also taught us more about production safety than anything else we've shipped. If you're curious, juliet.sh.

Questions about any of the runtime stuff? Reach us at contact@juliet.sh.

DEV Community