daniel jeong

Posted on May 14 • Originally published at manoit.co.kr

CVE-2026-31431 'Copy Fail' Deep Dive — Linux Page-Cache Bug and AF_ALG Kubernetes Container Escape

#linux #security #kubernetes #devops

CVE-2026-31431 "Copy Fail" Deep Dive — A Nine-Year-Old Linux Kernel Page-Cache Bug, AF_ALG Container Escape, and the seccomp/Falco Playbook for 2026 Kubernetes Node Security

On April 29, 2026, Theori researcher Taeyang Lee disclosed CVE-2026-31431 "Copy Fail" — on the surface another Linux kernel LPE (CVSS 7.8), but in reality much more. A 2017 in-place optimisation in the algif_aead module (commit 72548b093ee3) slept for nine years until a four-syscall, 732-byte PoC woke it up. Two things make this disclosure heavier than a typical kernel LPE: (1) one AF_ALG socket plus a splice() chain yields a controlled 4-byte arbitrary write into any page-cache-backed page, and (2) on Kubernetes nodes, an unprivileged Pod can corrupt a setuid binary in a shared overlayfs lower layer so that a privileged DaemonSet on the same node executes the corrupted binary — turning unprivileged Pod code execution into node-level root in four syscalls. Microsoft Security Blog, Sysdig, Unit 42, CERT-EU, Help Net Security and Ubuntu all published the same week. The mainline fix landed on April 1, 2026 (commit a664bf3d603d), but vendor errata are still rolling through the second week of May. This post breaks Copy Fail into eight axes and shares the seccomp-seal/Falco-detect/patch-priority checklist ManoIT applied to 17 EKS, GKE, and on-premises RKE2 clusters.

1. Why Copy Fail is the inflection point — when "kernel LPE" turned into "container escape"

Most 2024–2025 Linux kernel LPEs assumed the attacker already had a local shell. DirtyPipe (2022), StackRot (2023) and GameOver(L)Ay (2023) were all "you already have shell, now you become root." For cloud-native operators, Pod/VM isolation was the first line of defence. Copy Fail breaks that assumption. An ordinary unprivileged process inside a Pod — without CAP_SYS_ADMIN, without ptrace, without any capability — can open an AF_ALG socket and use splice() to corrupt the page cache. Because the page cache is a kernel-global resource, corrupting /usr/bin/su inside the container also corrupts the cache entry the host reads from. And the GitHub repository Percivalll/Copy-Fail-CVE-2026-31431-Kubernetes-PoC validated the container escape on Alibaba Cloud ACK, Amazon EKS, and Google GKE. Copy Fail moved from "fast-track LPE patching" to "isolate your nodes tonight or risk the whole cluster."

The table below compares major Linux LPEs since 2022 alongside Copy Fail.

Name	Year	CVSS	Prerequisite	Container escape	Patch cycle
DirtyPipe	2022	7.8	local shell + read-only file	limited	1–2 weeks
StackRot	2023	7.8	local shell + maple-tree exhaustion	no	2–3 weeks
GameOver(L)Ay	2023	7.8	local shell + Ubuntu overlayfs	no	1–2 weeks
Looney Tunables	2023	7.8	glibc + setuid + GLIBC_TUNABLES	limited	3–4 weeks
Copy Fail	2026	7.8	unprivileged process + AF_ALG available	EKS/GKE/ACK validated	1–14 days per distro

Two cells in the last row do the heavy lifting: the prerequisite drops from "local shell" to "unprivileged process," and the container escape is validated on three major managed Kubernetes platforms. That is what makes Copy Fail the 2026-H1 Kubernetes node-security inflection point. The patch cycle column is operationally crucial — AlmaLinux shipped its own kernel build on May 1, but Red Hat, Ubuntu, SUSE, and Amazon Linux published errata in stages through the second week of May, so the patch gap equals your exposure window.

2. The vulnerability mechanism — algif_aead in-place optimisation meets the page cache

Copy Fail lives in crypto/algif_aead.c, the AEAD (Authenticated Encryption with Associated Data) socket interface for the kernel's userspace crypto API (AF_ALG). The 2017 commit 72548b093ee3 ("crypto: af_alg - get_page upon reading from socket") added an in-place optimisation that allowed the destination scatterlist of an AEAD operation to reference page-cache pages directly when the user-supplied destination already pointed at such pages. The optimisation never distinguished between user-space pages and page-cache pages: if a user maps a read-only setuid binary and registers those pages as the destination, the kernel writes four AEAD-output bytes directly into the cached page.

The table below shows the four-syscall chain from unprivileged process to root.

Step	syscall	Intended use	How Copy Fail abuses it
1. AF_ALG socket	`socket(AF_ALG, SOCK_SEQPACKET, 0)`	userspace crypto entry point	reach the kernel crypto path
2. Bind algorithm	`bind()` + `setsockopt()`	register aead-aes-gcm, etc.	bind to an arbitrary AEAD alg
3. splice() in	`splice(pipe, alg_sock)`	zero-copy input	inject a page-cache page into the destination scatterlist
4. recvmsg()	`recvmsg(alg_sock)`	receive the encryption output	4-byte arbitrary write into the cached setuid binary

Step 4 is the kill: the Theori PoC overwrites four bytes in the authentication-decision branch of /usr/bin/su, making the binary accept any password. Because su is setuid root, the calling user immediately becomes root. The corruption sits in the page cache, which is kernel-wide — anything on the host that reads that file from cache reads the corrupted bytes too.

2.1 Nine years of an in-place optimisation — why 2017 code broke in 2026

The 2017 commit was a perfectly reasonable optimisation at the time: AEAD throughput improves measurably when you skip one copy. Reviewers did not catch the page-cache implication. Nine years later, the rise of eBPF, io_uring, and splice-heavy synthetic patterns produced exactly the combinations that turn the in-place path into an arbitrary write. This is not a one-off defect — it is a structural signal that the zero-copy synthesis surface of the Linux kernel finally reached corners it had not exhaustively reviewed. The operational implication is to treat AF_ALG as an opt-in interface: assume most workloads do not need it and seal it (§4) instead of waiting for the next nine-year-old bug to wake up.

3. The Kubernetes container-escape scenario — overlayfs shared layers are the bridge

Copy Fail upgrades from LPE to container escape because of overlayfs lower-layer sharing. Container runtimes (containerd, CRI-O) share read-only image layers across every container on a node. The sharing extends from disk into the page cache: the same file in the same image layer maps to one cache entry across all containers. Copy Fail corrupts that one entry.

Step	Actor	Action	Outcome
1. Pod compromise	attacker (vulnerable app or supply chain)	code execution in unprivileged Pod	kernel syscalls available
2. AF_ALG 4-syscall chain	attacker process	4-byte write into the page cache	shared-layer binary corrupted
3. Privileged DaemonSet runs	kubelet / cron / systemd	invokes the setuid binary	host root code execution
4. Cred theft	host root process	reads `/etc/kubernetes/kubelet.conf` and `/var/lib/kubelet/pki/`	cluster-wide credentials leak
5. Lateral movement	stolen kubeconfig	other nodes/namespaces	cluster compromise

Step 3 is the load-bearing step because most production Kubernetes clusters run many privileged DaemonSets: CNI (Calico, Cilium), CSI (EBS, Ceph), log shippers (Fluent Bit, Vector), security agents (Falco, Tetragon), node-exporter. Any one of them executing a binary backed by a shared image layer is enough. The defence while patches are landing is to make sure step 2 cannot happen, which is exactly what §4 addresses.

3.1 What the EKS/GKE/ACK validation means

The PoC repository validated container escape on the default node images of Amazon EKS, Google GKE, and Alibaba Cloud ACK. The implication is simple: "managed Kubernetes will block this" stopped being true on April 29. Until each cloud rotates its node images (AWS EKS Optimized AMI, GKE COS) — usually 1–7 days — users themselves must enforce AF_ALG sealing, Pod Security Standards, and node isolation.

4. The mitigation that actually buys time — seal AF_ALG sockets with seccomp

The four-syscall chain collapses the moment step 1 fails: if socket(AF_ALG, SOCK_SEQPACKET, 0) is blocked, nothing else matters. AF_ALG is the userspace crypto interface to the kernel, but almost no container workload uses it — applications call OpenSSL/BoringSSL/libsodium directly in user space. So a container-level seccomp profile that returns EAFNOSUPPORT (errno 97) when the first argument to socket() or socketpair() is 38 (AF_ALG) seals step 1 without breaking legitimate workloads.

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    {
      "names": ["socket", "socketpair"],
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 97,
      "args": [
        {
          "index": 0,
          "value": 38,
          "op": "SCMP_CMP_EQ"
        }
      ],
      "comment": "Block AF_ALG (CVE-2026-31431 Copy Fail) — domain 38, errno EAFNOSUPPORT (97)"
    }
  ]
}

The same JSON works under Docker, Podman, containerd, and CRI-O. On Kubernetes, drop the file at /var/lib/kubelet/seccomp/profiles/block-af-alg.json on every node and reference it in Pod securityContext:

apiVersion: v1
kind: Pod
metadata:
  name: app-with-af-alg-block
  annotations:
    manoit.co.kr/cve: "CVE-2026-31431"
    manoit.co.kr/mitigation: "seccomp-af-alg-block"
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/block-af-alg.json
  containers:
  - name: app
    image: registry.manoit.co.kr/svc/api:v1.42.0
    # OpenSSL in user space — no AF_ALG needed

Cluster-wide enforcement uses Kyverno (or a ValidatingAdmissionPolicy):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-af-alg-seccomp
  annotations:
    policies.kyverno.io/severity: high
    policies.kyverno.io/subject: "CVE-2026-31431 Copy Fail"
spec:
  validationFailureAction: Enforce
  rules:
  - name: require-block-af-alg-profile
    match:
      any:
      - resources:
          kinds: ["Pod"]
          namespaces: ["default", "app-*", "svc-*"]
    validate:
      message: "Pod must use the block-af-alg seccomp profile (CVE-2026-31431 mitigation)"
      pattern:
        spec:
          securityContext:
            seccompProfile:
              type: Localhost
              localhostProfile: "profiles/block-af-alg.json"

4.1 Node-level sealing — blacklist `algif_aead.ko`

If seccomp is the container-level defence, blacklisting the kernel module is the node-level defence. Most distributions ship algif_aead as a module:

cat <<'EOF' | sudo tee /etc/modprobe.d/copy-fail-cve-2026-31431.conf
# CVE-2026-31431 Copy Fail mitigation — block algif_aead until kernel is patched
blacklist algif_aead
install algif_aead /bin/true
EOF

# Unload from the current session too
sudo rmmod algif_aead 2>/dev/null || true

sudo update-initramfs -u  # Debian/Ubuntu
# RHEL family: dracut --force

Two caveats. (1) If you run LUKS or kcapi tooling, verify before unloading. (2) Some FIPS-mode RHEL setups depend on algif_aead. Validate in a staging node first; until then, the §4 seccomp profile is the safer first move.

5. Detection — Falco, auditd, eBPF, Sigma against the AF_ALG SOCK_SEQPACKET signal

Even on sealed nodes — and especially on un-sealed nodes — detection matters. The high-signal indicator is AF_ALG sockets opened as SOCK_SEQPACKET. Legitimate AF_ALG users (cryptsetup, systemd-cryptsetup, kcapi-*) use SOCK_DGRAM. SOCK_SEQPACKET is therefore an almost-deterministic exploitation precursor.

- macro: known_af_alg_callers
  condition: proc.name in (cryptsetup, systemd-cryptsetup, kcapi-enc, kcapi-dgst, kcapi-rng, kcapi-hasher)

- rule: AF_ALG SEQPACKET Socket — Copy Fail Precursor
  desc: >
    Detects unexpected AF_ALG SOCK_SEQPACKET socket creation —
    the documented prerequisite for CVE-2026-31431 (Copy Fail).
    Legitimate AF_ALG callers use SOCK_DGRAM.
  condition: >
    evt.type = socket and
    socket.domain = AF_ALG and
    socket.type = SOCK_SEQPACKET and
    not known_af_alg_callers
  output: >
    AF_ALG SEQPACKET socket opened by suspicious process
    (user=%user.name pid=%proc.pid comm=%proc.name parent=%proc.pname
     container=%container.name image=%container.image.repository
     k8s_pod=%k8s.pod.name k8s_ns=%k8s.ns.name)
  priority: WARNING
  tags: [cve-2026-31431, copy-fail, lpe, page-cache, T1068, T1611]
  source: syscall

Auditd matches the same signal on socket() entry where a0=38 and a1=SOCK_SEQPACKET. eBPF EDRs hook tracepoint:syscalls:sys_enter_socket with the same condition; in container environments, attach bpf_get_current_pid_tgid and bpf_get_current_cgroup_id to ride Pod/namespace context through to the analyst. Elastic's privilege_escalation_potential_copy_fail_cve_2026_31431_exploitation_via_af_alg_socket.toml rule (published early May) expresses the same signal in EQL.

Layer	Subject	Signal	Strength	Weakness
seccomp	kernel	socket() syscall blocked	block + audit at once	useless if profile not applied
Falco	userspace + eBPF	AF_ALG SOCK_SEQPACKET	Kubernetes context	useless if driver missing
auditd	kernel + auditbeat	socket entry a0=38 a1=5	universally available	high log volume
eBPF EDR	kernel	sys_enter_socket tracepoint	low overhead	kernel version dependence
Sigma	SIEM normalisation	vendor-agnostic ruleset	tool portability	depends on log source

The operational recommendation is seccomp (block) + Falco (detect) + Sigma (SIEM) in parallel: seccomp alone misses unsealed nodes, Falco alone does not stop the attack. ManoIT pushed all three in a single GitOps PR (§8).

6. Impact matrix — distributions, kernel versions, managed-K8s node images

Copy Fail affects every distribution built on a 2017-or-later mainline kernel. The operationally useful matrix is "above which kernel build are you safe?":

Distribution	Affected kernel	Patched kernel (≥)	Errata date	Notes
Ubuntu 24.04 LTS	6.8.x	USN-7250-1 build	2026-05-02	HWE kernel patched the same day
Ubuntu 26.04	—	not affected	—	fix back-ported pre-disclosure
RHEL 10.1	kernel-6.12.x	RHSA-2026:1xxx	2026-05-06 (staged)	FIPS-mode errata separate
AlmaLinux 10.1	kernel-6.12.x	own build	2026-05-01	shipped ahead of RHSA
SUSE 16 SP0	6.13.x	SUSE-SU-2026:1xxx	2026-05-05	SLE Micro patched same day
Amazon Linux 2023	6.1.x AL2023	kernel-6.1.x AL2023 fix	2026-05-04	EKS AMI rotates separately
Debian 12 (bookworm)	6.1.x	DSA-2026-xxxx	2026-05-03	backports channel first
Arch Linux	6.14.x	core/linux 6.14.x-fix	2026-04-30	rolling, fastest
EKS AMI (AL2023)	as above	EKS Optimized AMI 2026.05	2026-05-05–07	Karpenter NodePool drift
GKE COS	COS 121 LTS	COS 121-x.y.z fix	2026-05-04–06	auto-upgrade rollout

Two operational notes. (1) RHEL 10.1 publishes errata in stages — the first batch covers the general-purpose kernel, FIPS-mode and kpatch live-patch channels follow a few days later. (2) EKS and GKE auto-rotate node images, but Karpenter NodePools and ASGs only refresh when they create a node. Forced rolling replace is the final step of patch landing.

7. The four-window operational checklist

The checklist below mirrors what ManoIT SecOps executed across 17 clusters between April 30 and May 7. The point is to keep nodes safe while patches are still in flight.

Window	Action	Owner	Done signal
0–2h	Inventory node OS/kernel versions; audit any legitimate AF_ALG consumers	SecOps	inventory sheet + zero AF_ALG workloads confirmed
0–2h	Roll out Falco/auditd rules first — detection before blocking	SecOps + SRE	rule active on every node, firing on a benign PoC
2–24h	Deploy seccomp profile + Kyverno enforce	SRE	100% of new Pods carry the profile
2–24h	Temporarily shrink non-essential privileged DaemonSets	platform team	optional hostPath/privileged workloads paused
24–72h	Apply distro patches — canary → 25% → 100%	SRE	≥95% nodes on patched kernel, residue quarantined
24–72h	Rotate managed-K8s node images (AMI/COS), force Karpenter/ASG drift	platform team	all nodes on patched image
Week 1	Decide whether to keep `algif_aead` blacklisted permanently	SecOps	validation complete, decision recorded
Week 1	Post-mortem — patch-gap timing, detection stats, residual risk	SecOps lead	one-page board/security-committee report

Step 4 (shrinking privileged DaemonSets) drew the longest debate. CNI and CSI cannot go; but non-essential debug DaemonSets (strace shells, hostPath ad-hoc workloads) can usually be paused for 24 hours. Reducing exposure within the patch gap is exactly the operational decision worth that debate.

8. ManoIT retrospective — what worked, what stuck

Decision	Outcome	Note
Falco rules first (within 2h)	success	two clusters had legitimate SOCK_DGRAM mis-categorised as SOCK_SEQPACKET — macro tightened
seccomp Localhost profile on nodes	success	kubelet `--root-dir` varies per node; standardised with Ansible
Kyverno enforce all at once	fail → retry	some helm charts had no `seccompProfile`; audit mode for 24h, then enforce
`algif_aead` blacklist	partial	two LUKS-using nodes paused, the other 15 made permanent
Karpenter NodePool drift rotation	success	`nodepool.spec.disruption.budgets` capped hourly rotation; 100% on patched AMI in 8h
Managed K8s image lag	bottleneck	EKS AL2023 AMI shipped May 5; the April 30–May 5 patch gap was bridged by seccomp
FIPS RHEL nodes	delay	FIPS-mode errata trailed the general one by ~3 days

The biggest lesson was that the patch gap is not a policy gap — it is a procedure gap. The mainline fix landed on April 1, but ManoIT only pushed policy after the April 30 disclosure, because mainline merge is not equivalent to a patched node. The highest-ROI security posture is to modularise the disclosure-day mitigations ahead of time: a seccomp profile, a Falco macro, and a Kyverno policy are exactly that module.

9. Closing — "syscall sealing" as the new node-security baseline

Copy Fail is a nine-year-old piece of code that broke after nine years. More importantly, it is a PoC-validated demonstration that container isolation can be unwound through a kernel-global resource (the page cache) on three major managed-K8s platforms. Patches will land — they are landing. The bigger shift is that the operational baseline needs to move: "syscalls most container workloads do not use should be sealed by default" — starting with AF_ALG, then extending to io_uring, perf_event_open, userfaultfd, and bpf. ManoIT made AF_ALG sealing a permanent part of the standard seccomp profile and put io_uring and userfaultfd sealing on next quarter's agenda. Node security is no longer a race to patch the kernel faster — it is a steady operational practice of shrinking the surface that touches the kernel. Copy Fail is the LPE that pins that one line at the top of the 2026-H1 security backlog.

Cross-posted from ManoIT. Authored by Claude (Opus 4.6), edited and technically reviewed by ManoIT.

Originally published at ManoIT Tech Blog.

DEV Community

CVE-2026-31431 'Copy Fail' Deep Dive — Linux Page-Cache Bug and AF_ALG Kubernetes Container Escape

CVE-2026-31431 "Copy Fail" Deep Dive — A Nine-Year-Old Linux Kernel Page-Cache Bug, AF_ALG Container Escape, and the seccomp/Falco Playbook for 2026 Kubernetes Node Security

1. Why Copy Fail is the inflection point — when "kernel LPE" turned into "container escape"

2. The vulnerability mechanism — algif_aead in-place optimisation meets the page cache

2.1 Nine years of an in-place optimisation — why 2017 code broke in 2026

3. The Kubernetes container-escape scenario — overlayfs shared layers are the bridge

3.1 What the EKS/GKE/ACK validation means

4. The mitigation that actually buys time — seal AF_ALG sockets with seccomp

4.1 Node-level sealing — blacklist `algif_aead.ko`

5. Detection — Falco, auditd, eBPF, Sigma against the AF_ALG SOCK_SEQPACKET signal

6. Impact matrix — distributions, kernel versions, managed-K8s node images

7. The four-window operational checklist

8. ManoIT retrospective — what worked, what stuck

9. Closing — "syscall sealing" as the new node-security baseline

Top comments (0)

CVE-2026-31431 "Copy Fail" Deep Dive — A Nine-Year-Old Linux Kernel Page-Cache Bug, AF_ALG Container Escape, and the seccomp/Falco Playbook for 2026 Kubernetes Node Security

1. Why Copy Fail is the inflection point — when "kernel LPE" turned into "container escape"

2. The vulnerability mechanism — algif_aead in-place optimisation meets the page cache

2.1 Nine years of an in-place optimisation — why 2017 code broke in 2026

3. The Kubernetes container-escape scenario — overlayfs shared layers are the bridge

3.1 What the EKS/GKE/ACK validation means

4. The mitigation that actually buys time — seal AF_ALG sockets with seccomp

4.1 Node-level sealing — blacklist algif_aead.ko

5. Detection — Falco, auditd, eBPF, Sigma against the AF_ALG SOCK_SEQPACKET signal

6. Impact matrix — distributions, kernel versions, managed-K8s node images

7. The four-window operational checklist

8. ManoIT retrospective — what worked, what stuck

9. Closing — "syscall sealing" as the new node-security baseline

4.1 Node-level sealing — blacklist `algif_aead.ko`