<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: TooFastTooCurious</title>
    <description>The latest articles on DEV Community by TooFastTooCurious (@toofasttoocurious).</description>
    <link>https://dev.to/toofasttoocurious</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3845394%2Fb319af4c-834d-4d05-ab08-2fecd8e5fa4e.jpeg</url>
      <title>DEV Community: TooFastTooCurious</title>
      <link>https://dev.to/toofasttoocurious</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/toofasttoocurious"/>
    <language>en</language>
    <item>
      <title>Building Runtime Enforcement for Kubernetes with eBPF</title>
      <dc:creator>TooFastTooCurious</dc:creator>
      <pubDate>Tue, 14 Apr 2026 14:54:45 +0000</pubDate>
      <link>https://dev.to/toofasttoocurious/building-runtime-enforcement-for-kubernetes-with-ebpf-25ll</link>
      <guid>https://dev.to/toofasttoocurious/building-runtime-enforcement-for-kubernetes-with-ebpf-25ll</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on the &lt;a href="https://juliet.sh/blog/building-runtime-enforcement-for-kubernetes-with-ebpf" rel="noopener noreferrer"&gt;Juliet Security blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most Kubernetes security tools stop at scan time. They'll flag a critical CVE in a container image or complain that a pod runs as root. What they won't do is tell you that someone just spawned a shell in your production namespace, opened a connection to a mining pool, or loaded a kernel module to break out of the container sandbox.&lt;/p&gt;

&lt;p&gt;Juliet started as a graph-based security platform. We map attack paths, score blast radius, prioritize findings. Useful stuff. But customers kept circling back to the same ask: can you actually stop the bad thing, or just tell me it happened?&lt;/p&gt;

&lt;p&gt;So we built runtime enforcement. This post walks through the design, the tradeoffs we made, and the production incident that changed how we think about safety.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replacing Falco
&lt;/h2&gt;

&lt;p&gt;We started with Falco as a sidecar. It watches syscalls through eBPF, writes alerts to a FIFO pipe, and our node-agent reads from the other end of that pipe.&lt;/p&gt;

&lt;p&gt;The pipe was the problem. If our agent started before Falco, the pipe didn't exist yet. If Falco restarted, the pipe broke. If the pipe filled up because our reader fell behind, Falco would block. We burned more hours managing that pipe than we spent building actual security features.&lt;/p&gt;

&lt;p&gt;On top of that, Falco's rule language was too coarse for what we needed. We wanted to match events against customer-defined policies with namespace scoping, image pattern matching, and per-process exception lists. Translating between our internal policy model and Falco's YAML rules created a fragile middle layer that broke in subtle ways.&lt;/p&gt;

&lt;p&gt;We ripped it out and embedded the eBPF sensor directly in our Go agent using &lt;a href="https://github.com/cilium/ebpf" rel="noopener noreferrer"&gt;cilium/ebpf&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we trace
&lt;/h2&gt;

&lt;p&gt;We hook 22 syscalls across five categories:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Syscalls&lt;/th&gt;
&lt;th&gt;What we catch&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Process execution&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;execve&lt;/code&gt;, &lt;code&gt;execveat&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Shells, exploit toolkits, crypto miners&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File access&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;openat&lt;/code&gt;, &lt;code&gt;unlinkat&lt;/code&gt;, &lt;code&gt;memfd_create&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Reads of /etc/shadow, log deletion, fileless payloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;connect&lt;/code&gt;, &lt;code&gt;listen&lt;/code&gt;, &lt;code&gt;accept4&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;C2 callbacks, cloud metadata grabs, rogue listeners&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Container escape&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ptrace&lt;/code&gt;, &lt;code&gt;mount&lt;/code&gt;, &lt;code&gt;setns&lt;/code&gt;, &lt;code&gt;unshare&lt;/code&gt;, &lt;code&gt;init_module&lt;/code&gt;, &lt;code&gt;finit_module&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Namespace tricks, host filesystem mounts, module loading&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privilege escalation&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;chmod&lt;/code&gt;, &lt;code&gt;fchmodat&lt;/code&gt;, &lt;code&gt;capset&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Setuid flips, capability changes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each tracepoint handler writes a fixed 304-byte struct into a 2MB ring buffer. The struct uses a C union for the syscall-specific payload (file path, network address, or process metadata), so every event is identical in size regardless of type. This keeps the ring buffer math simple and avoids variable-length parsing on the hot path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Filtering where it matters: in the kernel
&lt;/h2&gt;

&lt;p&gt;This was the single best decision we made. Instead of sending every syscall event to userspace and filtering there, we filter inside the BPF program using two maps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;monitored_syscalls&lt;/code&gt;&lt;/strong&gt;: a hash map of syscall numbers that active policies actually care about. If nobody has a network policy enabled, &lt;code&gt;connect&lt;/code&gt; and &lt;code&gt;listen&lt;/code&gt; events never leave the kernel. When a customer toggles policies on or off, we update this map and the change takes effect on the next syscall.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;container_cgroups&lt;/code&gt;&lt;/strong&gt;: a fast lookup by cgroup ID to decide whether a process belongs to a monitored container. For runtimes we haven't populated the map for, we fall back to checking PID namespace depth (&lt;code&gt;task-&amp;gt;nsproxy-&amp;gt;pid_ns_for_children-&amp;gt;level &amp;gt; 0&lt;/code&gt;). Containers always have level &amp;gt; 0; host processes sit at level 0. This works across Docker, containerd, and CRI-O without any userspace coordination.&lt;/p&gt;

&lt;p&gt;The payoff: overhead scales with the number of policies you enable, not the number of syscalls we could theoretically trace.&lt;/p&gt;

&lt;h2&gt;
  
  
  Turning PIDs into something useful
&lt;/h2&gt;

&lt;p&gt;A raw eBPF event gives you a PID and a 16-character process name. That's not enough to make a security decision. You need the container name, the pod, the namespace, the image, and the service account.&lt;/p&gt;

&lt;p&gt;We use three caches that each pull from a different source:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;PID LRU&lt;/strong&gt; reads &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/cgroup&lt;/code&gt; to get the container ID. 10K entries, 5-minute TTL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CRI cache&lt;/strong&gt; talks to containerd over gRPC and watches container start/stop events.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;K8s cache&lt;/strong&gt; watches the pod API for the local node.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If one cache goes down, the other two still contribute what they can. If all three are broken, events still carry the PID, container ID, and process name from the kernel. We never stall the pipeline waiting for metadata. An event with partial enrichment moves through and the policy matcher treats it conservatively (no enforcement on events we can't fully identify).&lt;/p&gt;

&lt;h2&gt;
  
  
  Matching policies fast
&lt;/h2&gt;

&lt;p&gt;Every two minutes, the agent syncs policies from the API and compiles them into a lookup structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;CompiledPolicy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;SyscallSet&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="m"&gt;59&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;322&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;     &lt;span class="c"&gt;// execve, execveat&lt;/span&gt;
    &lt;span class="n"&gt;ProcessNames&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"bash"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"sh"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;PathPrefixes&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"/tmp/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"/var/run/"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;NetCIDRs&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;      &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;169.254.169.254&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="m"&gt;32&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;Scope&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;         &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;IncludeNamespaces&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"production"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
    &lt;span class="n"&gt;Exceptions&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="n"&gt;process_name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"nginx"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Policies are bucketed by syscall category. When an event comes in, we look up its category (derived from the syscall number), get the handful of candidate policies (usually 3-8), and check each one. The hot path uses pre-allocated maps and does zero heap allocation.&lt;/p&gt;

&lt;p&gt;If two policies both match and one says "alert" while the other says "kill", the kill wins. We always pick the highest-severity enforce-mode match.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we kill instead of block
&lt;/h2&gt;

&lt;p&gt;We enforce by sending &lt;code&gt;SIGKILL&lt;/code&gt; from userspace. The alternative is BPF LSM, where the eBPF program returns &lt;code&gt;-EPERM&lt;/code&gt; and the kernel refuses the syscall before it completes.&lt;/p&gt;

&lt;p&gt;LSM is objectively better at prevention. But we chose kill for three reasons:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Portability.&lt;/strong&gt; BPF LSM requires kernel 5.7+ with &lt;code&gt;CONFIG_BPF_LSM=y&lt;/code&gt;. A lot of production clusters still run Amazon Linux 2 or RHEL 8. We didn't want to cut out half our addressable market.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure mode.&lt;/strong&gt; If an LSM policy has a bug and matches kubelet or containerd, the node goes down. You can't start new pods, can't pull images, can't recover without SSH access. With SIGKILL, the worst case is that a process dies and the kubelet restarts it. Annoying, but the node stays up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We tested the failure mode on ourselves.&lt;/strong&gt; Not on purpose.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we broke staging
&lt;/h2&gt;

&lt;p&gt;Three weeks into our enforcement beta, we turned on enforce mode in staging. Within minutes, Harbor (our container registry) started throwing 500 errors. Pulls failed. Deployments queued up. The cluster ground to a halt.&lt;/p&gt;

&lt;p&gt;Here's what happened: we had a policy that flags processes running as root. That's a reasonable thing to detect. But our enforcement engine applied it globally, across every namespace on the node. Harbor's Postgres process runs as root. So does Cilium's agent. So does RabbitMQ. The enforcement engine dutifully killed all of them.&lt;/p&gt;

&lt;p&gt;We turned enforcement off, traced the kills in our metrics, and realized the fix was obvious in hindsight: enforcement needs to be scoped to specific namespaces.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;isInScope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="n"&gt;CompiledScope&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IncludeNamespaces&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IncludeNamespaces&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExcludeNamespaces&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExcludeNamespaces&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three rules came out of that incident:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If an event has no namespace metadata (enrichment failed or it's a host process), never enforce. Default to audit.&lt;/li&gt;
&lt;li&gt;If a namespace isn't in the policy's scope, downgrade from kill to audit. Still record the event, just don't act on it.&lt;/li&gt;
&lt;li&gt;The UI now requires you to specify at least one namespace when you set a policy to enforce mode. No more global enforcement.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Seven things we check before every kill
&lt;/h2&gt;

&lt;p&gt;After the Harbor mess, we added layers of protection to the response actor. Every kill request goes through all seven:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No container ID, no kill.&lt;/strong&gt; If we can't confirm it's a container process, we leave it alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simulate mode.&lt;/strong&gt; Logs what would happen without sending the signal. You should always run a new policy in simulate for a few days first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Protected namespaces.&lt;/strong&gt; &lt;code&gt;kube-system&lt;/code&gt; is off-limits by default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PID 0 and PID 1 are untouchable.&lt;/strong&gt; We will never kill init.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-preservation.&lt;/strong&gt; The agent will not kill its own process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limiting per pod.&lt;/strong&gt; 10 kills per pod in a 60-second window. After that, we stop and flag it. This prevents kill-restart spirals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Namespace scope.&lt;/strong&gt; The policy must explicitly include the event's namespace.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each attempt gets tagged with a result code: &lt;code&gt;killed&lt;/code&gt;, &lt;code&gt;failed&lt;/code&gt;, &lt;code&gt;skipped_namespace&lt;/code&gt;, &lt;code&gt;skipped_pid1&lt;/code&gt;, &lt;code&gt;suppressed&lt;/code&gt;, or &lt;code&gt;simulated&lt;/code&gt;. All of these show up in Prometheus, so you can see exactly what enforcement did on every node.&lt;/p&gt;

&lt;h2&gt;
  
  
  Moving events without drowning
&lt;/h2&gt;

&lt;p&gt;A busy node can produce thousands of syscall events per second. Sending each one to the API individually would saturate the network and hammer ClickHouse. So we built a five-stage pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sensor (ring buffer, polls every 100ms)
  -&amp;gt; Response Actor (kill decisions happen here, &amp;lt; 200ms)
    -&amp;gt; Coalescer (groups by rule+container+process, 5s window)
      -&amp;gt; Batcher (flushes at 500 events or 5s, whichever hits first)
        -&amp;gt; Forwarder (gzip, retry with backoff, disk spool if API is down)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important thing: kills happen in stage 2. We don't batch enforcement. If a process needs to die, it dies within 200ms of the syscall, not after a 5-second batch window.&lt;/p&gt;

&lt;p&gt;Coalescing cuts volume by 10x to 100x on noisy workloads. If &lt;code&gt;bash&lt;/code&gt; keeps spawning in the same container and hitting the same policy, we collapse 100 events into one record with &lt;code&gt;event_count: 100&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If the API goes offline, the forwarder writes batches to a local disk spool (capped at 100MB, oldest files evicted first). When the API comes back, a drain loop picks up the files and replays them. We'd rather lose some events than let backpressure freeze the enforcement path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling different kernels
&lt;/h2&gt;

&lt;p&gt;eBPF with CO-RE needs BTF data. Modern kernels (5.8+) ship it at &lt;code&gt;/sys/kernel/btf/vmlinux&lt;/code&gt;. Plenty of production kernels don't.&lt;/p&gt;

&lt;p&gt;Our fallback chain:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use host kernel BTF if it exists&lt;/li&gt;
&lt;li&gt;Try an embedded BTFhub archive that matches the kernel release&lt;/li&gt;
&lt;li&gt;If nothing works, run in status-only mode. The agent reports its health and syncs policies, but doesn't hook any syscalls.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;ARM64 adds another wrinkle. Those kernels don't have &lt;code&gt;dup2&lt;/code&gt; or &lt;code&gt;chmod&lt;/code&gt; as separate syscalls; they use &lt;code&gt;dup3&lt;/code&gt; and &lt;code&gt;fchmodat&lt;/code&gt; instead. We attach tracepoints on a best-effort basis: skip what's missing, log a warning, only bail out if literally nothing attaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the numbers look like
&lt;/h2&gt;

&lt;p&gt;50 pods on a node, all 40 built-in policies active in audit mode:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU (steady state)&lt;/td&gt;
&lt;td&gt;200-300 mCPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;500-800 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raw events per second&lt;/td&gt;
&lt;td&gt;50-200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After coalescing&lt;/td&gt;
&lt;td&gt;5-20 per second&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network to API&lt;/td&gt;
&lt;td&gt;50-500 KB every 5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time from syscall to ClickHouse&lt;/td&gt;
&lt;td&gt;5-11 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time from syscall to kill&lt;/td&gt;
&lt;td&gt;under 200ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Storage latency is deliberately higher than enforcement latency. Killing a process can't wait for batch compression. Writing it to a database can.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things we'd change
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scope enforcement from day one.&lt;/strong&gt; Global enforcement without namespace scoping cost us a staging outage and a scramble to patch. If you're building enforcement for anything, make scope a required field before you write your first kill call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Move coalescing earlier for audit-only events.&lt;/strong&gt; Right now every event hits the response actor, even if it's just going to be logged. For audit policies, we could coalesce first and skip the per-event response check entirely. That would cut CPU on nodes with chatty workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ship a heartbeat from the start.&lt;/strong&gt; For months we inferred agent health from when it last uploaded an SBOM or synced policies. If a node had no new images and runtime was off, the agent looked dead even though it was fine. A 60-second heartbeat ping would have saved us a lot of false alarms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is going
&lt;/h2&gt;

&lt;p&gt;We're looking at BPF LSM as an opt-in mode for clusters running kernel 5.7+. SIGKILL handles most cases well, but some compliance regimes want proof that the syscall was blocked, not just that the process was terminated afterward.&lt;/p&gt;

&lt;p&gt;We're also wiring up alert routing so enforcement events go straight to Slack and PagerDuty instead of sitting in a dashboard waiting to be noticed.&lt;/p&gt;

&lt;p&gt;Building runtime enforcement changed Juliet from a scanner into a platform. It also taught us more about production safety than anything else we've shipped. If you're curious, &lt;a href="https://juliet.sh" rel="noopener noreferrer"&gt;juliet.sh&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Questions about any of the runtime stuff? Reach us at &lt;a href="mailto:contact@juliet.sh"&gt;contact@juliet.sh&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>security</category>
      <category>ebpf</category>
      <category>devops</category>
    </item>
    <item>
      <title>Axios was compromised for 3 hours - how to find it in your running Kubernetes clusters</title>
      <dc:creator>TooFastTooCurious</dc:creator>
      <pubDate>Tue, 31 Mar 2026 18:32:15 +0000</pubDate>
      <link>https://dev.to/toofasttoocurious/axios-was-compromised-for-3-hours-how-to-find-it-in-your-running-kubernetes-clusters-dfj</link>
      <guid>https://dev.to/toofasttoocurious/axios-was-compromised-for-3-hours-how-to-find-it-in-your-running-kubernetes-clusters-dfj</guid>
      <description>&lt;p&gt;On March 31, 2026, a compromised maintainer account was used to publish two malicious versions of &lt;a href="https://github.com/axios/axios" rel="noopener noreferrer"&gt;axios&lt;/a&gt;, the most popular JavaScript HTTP client on npm with over 100 million weekly downloads. Versions 1.14.1 and 0.30.4 contained a hidden dependency that deployed a cross-platform remote access trojan (RAT) to any machine that ran &lt;code&gt;npm install&lt;/code&gt; during a three-hour window.&lt;/p&gt;

&lt;p&gt;The malicious versions were pulled from npm by 03:29 UTC. But npm lockfiles only protect your source repos. If a container image was built during that window, the compromised package is baked into the image and running in your cluster right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happened
&lt;/h2&gt;

&lt;p&gt;The attacker gained publishing access to the official axios npm package, likely through a compromised maintainer account. Instead of modifying axios source code directly, they added a malicious dependency — &lt;code&gt;plain-crypto-js@4.2.1&lt;/code&gt; — to the package.json. That package had a "clean" version published 18 hours earlier to establish a plausible history on the registry.&lt;/p&gt;

&lt;p&gt;On &lt;code&gt;npm install&lt;/code&gt;, the malicious package ran a &lt;code&gt;postinstall&lt;/code&gt; hook that executed a double-obfuscated dropper script. The dropper detected the host OS, downloaded a platform-specific RAT from a C2 server at &lt;code&gt;sfrclak[.]com:8000&lt;/code&gt;, and then deleted all traces of the postinstall script.&lt;/p&gt;

&lt;p&gt;The RAT capabilities include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;macOS:&lt;/strong&gt; Binary at &lt;code&gt;/Library/Caches/com.apple.act.mond&lt;/code&gt; disguised as an Apple daemon. Accepts commands for arbitrary binary injection, shell execution, and filesystem enumeration. Beacons every 60 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Windows:&lt;/strong&gt; PowerShell RAT disguised as Windows Terminal at &lt;code&gt;%PROGRAMDATA%\wt.exe&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linux:&lt;/strong&gt; Python RAT at &lt;code&gt;/tmp/ld.py&lt;/code&gt; launched as an orphaned background process.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why your lockfile isn't enough
&lt;/h2&gt;

&lt;p&gt;Most of the incident response guidance focuses on checking lockfiles and running &lt;code&gt;snyk test&lt;/code&gt; against your source repository. That's necessary but incomplete.&lt;/p&gt;

&lt;p&gt;The gap: container images. If any image in your cluster was built between 00:21 and 03:29 UTC on March 31, the build may have pulled axios 1.14.1 or 0.30.4. That image is now running in your cluster with the RAT baked in, regardless of whether you've since fixed your lockfile.&lt;/p&gt;

&lt;p&gt;This matters because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD pipelines that build images overnight&lt;/strong&gt; in UTC-aligned schedules were squarely in the window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-stage Docker builds&lt;/strong&gt; that run &lt;code&gt;npm install&lt;/code&gt; without a committed lockfile (or with &lt;code&gt;npm install&lt;/code&gt; instead of &lt;code&gt;npm ci&lt;/code&gt;) would have pulled the latest malicious version&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Images already deployed don't get rescanned&lt;/strong&gt; unless you explicitly trigger it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Checking your source repo is step one. Checking what's actually running in your clusters is step two, and most organizations skip it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to check your Kubernetes clusters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Find affected images
&lt;/h3&gt;

&lt;p&gt;If you generate SBOMs from your container images (via Syft, Trivy, or similar), query them for the compromised versions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Scan a running image for the compromised package&lt;/span&gt;
grype &amp;lt;image&amp;gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"axios.*1&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;14&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;1|axios.*0&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;30&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;4|plain-crypto-js"&lt;/span&gt;

&lt;span class="c"&gt;# Or generate an SBOM and search it&lt;/span&gt;
syft &amp;lt;image&amp;gt; &lt;span class="nt"&gt;-o&lt;/span&gt; json | jq &lt;span class="s1"&gt;'.artifacts[] | select(.name == "axios" and (.version == "1.14.1" or .version == "0.30.4"))'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Check for the RAT indicators on nodes
&lt;/h3&gt;

&lt;p&gt;If you have node-level access, check for the platform-specific IOCs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Linux nodes — check for the Python RAT&lt;/span&gt;
kubectl get nodes &lt;span class="nt"&gt;-o&lt;/span&gt; name | xargs &lt;span class="nt"&gt;-I&lt;/span&gt;&lt;span class="o"&gt;{}&lt;/span&gt; kubectl debug &lt;span class="o"&gt;{}&lt;/span&gt; &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;busybox &lt;span class="nt"&gt;--&lt;/span&gt; find /tmp &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s2"&gt;"ld.py"&lt;/span&gt;

&lt;span class="c"&gt;# Check for outbound connections to the C2&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; name | xargs &lt;span class="nt"&gt;-I&lt;/span&gt;&lt;span class="o"&gt;{}&lt;/span&gt; kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"cat /proc/net/tcp 2&amp;gt;/dev/null | grep '&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;printf&lt;/span&gt; &lt;span class="s2"&gt;"%X"&lt;/span&gt; 142.11.206.73 | &lt;span class="nb"&gt;fold&lt;/span&gt; &lt;span class="nt"&gt;-w2&lt;/span&gt; | &lt;span class="nb"&gt;tac&lt;/span&gt; | &lt;span class="nb"&gt;tr&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;'"&lt;/span&gt; 2&amp;gt;/dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Check network policies for C2 egress
&lt;/h3&gt;

&lt;p&gt;The RAT beacons to &lt;code&gt;142.11.206.73:8000&lt;/code&gt;. If you have network policy enforcement (Cilium, Calico), check whether any pod has made outbound connections to that IP:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# If using Cilium with Hubble&lt;/span&gt;
hubble observe &lt;span class="nt"&gt;--to-ip&lt;/span&gt; 142.11.206.73 &lt;span class="nt"&gt;--verdict&lt;/span&gt; FORWARDED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Block the compromised package at admission
&lt;/h3&gt;

&lt;p&gt;If you run an admission controller with OPA policies, add a rule to reject images containing the compromised dependency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="n"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"Pod"&lt;/span&gt;
    &lt;span class="n"&gt;container&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;containers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# Flag images built during the compromise window&lt;/span&gt;
    &lt;span class="c1"&gt;# (requires SBOM-aware admission — see your scanner's docs)&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"Container %s may contain compromised axios — verify image SBOM"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Or skip the manual work
&lt;/h3&gt;

&lt;p&gt;The steps above work, but they're per-image and per-node. If you're running dozens of namespaces across multiple clusters, doing this manually doesn't scale.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://juliet.sh?ref=devto" rel="noopener noreferrer"&gt;Juliet&lt;/a&gt; continuously generates SBOMs from every container image running in your clusters and builds a graph of how vulnerabilities connect to workloads, RBAC permissions, network policies, and secrets. For an incident like this, you open Explorer and type what you're looking for in plain English:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwv0z559qmukqsj4cgijw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwv0z559qmukqsj4cgijw.png" alt="Juliet Explorer — natural language query for compromised axios versions" width="800" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Juliet converts the natural language query into structured filters across your entire cluster graph and returns every match:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3w1cqc7604oyqoiec8go.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3w1cqc7604oyqoiec8go.png" alt="Juliet Explorer — results showing compromised axios 0.30.4 found across container images" width="800" height="257"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every affected pod across every cluster, plus the blast radius: what service accounts those pods use, what secrets they can access, whether they have network egress to the C2 IP, and which other workloads they can reach. No grep, no per-image scanning, no guessing which namespaces to check.&lt;/p&gt;

&lt;p&gt;You can also set an admission control policy to block any new deployment containing &lt;code&gt;plain-crypto-js&lt;/code&gt; or the affected axios versions — so even if a team hasn't seen the advisory yet, the compromised image can't land in the cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do right now
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If you find affected images:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't just update the lockfile and redeploy.&lt;/strong&gt; The RAT may have already exfiltrated secrets from the container's environment. Rotate every secret that was mounted into or accessible from affected pods — service account tokens, API keys, database credentials, cloud provider credentials.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rebuild images from scratch.&lt;/strong&gt; Don't layer a fix on top of a potentially compromised image. Rebuild from the base image with a clean &lt;code&gt;npm ci&lt;/code&gt; against a verified lockfile.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Check for lateral movement.&lt;/strong&gt; If the RAT was active, the attacker had arbitrary code execution inside your cluster. Review RBAC permissions of affected pods — could they access other namespaces, secrets, or the Kubernetes API?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Block the C2 at the network level.&lt;/strong&gt; Add &lt;code&gt;142.11.206.73&lt;/code&gt; and &lt;code&gt;sfrclak[.]com&lt;/code&gt; to your network policy deny lists and DNS blocklists immediately.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;If you don't find affected images:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Verify, don't assume.&lt;/strong&gt; The absence of evidence in a spot check isn't the same as a clean bill of health. Scan every image in every namespace, not just the ones you think might be affected.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add &lt;code&gt;plain-crypto-js&lt;/code&gt; to your package blocklist&lt;/strong&gt; in whatever registry proxy or admission policy you use.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enforce &lt;code&gt;npm ci&lt;/code&gt; in all Dockerfiles.&lt;/strong&gt; If any of your Dockerfiles use &lt;code&gt;npm install&lt;/code&gt; instead of &lt;code&gt;npm ci&lt;/code&gt;, they ignore the lockfile and pull whatever's latest. That's how a three-hour window becomes your problem.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;p&gt;This is the third major npm supply chain attack in 2026. The playbook is consistent: compromise a maintainer account, add a malicious transitive dependency (not modify source directly), use postinstall hooks for execution, and deploy platform-specific payloads that self-delete.&lt;/p&gt;

&lt;p&gt;The defenses that matter are also consistent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lockfile enforcement&lt;/strong&gt; (&lt;code&gt;npm ci&lt;/code&gt;, not &lt;code&gt;npm install&lt;/code&gt;) in every build&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SBOM generation&lt;/strong&gt; on built images, not just source repos&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime visibility&lt;/strong&gt; into what's actually deployed in your clusters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Admission control&lt;/strong&gt; that can block known-bad packages before they run&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network policy&lt;/strong&gt; that limits egress from workloads by default&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your source repo being clean doesn't mean your cluster is clean. The question after every supply chain incident is: what's actually running right now, and can it reach anything it shouldn't?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://juliet.sh/blog/axios-npm-supply-chain-compromise-finding-it-in-your-kubernetes-clusters?ref=devto" rel="noopener noreferrer"&gt;juliet.sh&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>kubernetes</category>
      <category>npm</category>
      <category>supplychain</category>
    </item>
    <item>
      <title>Introducing the ABOM: Why Your CI/CD Pipelines Need a Bill of Materials</title>
      <dc:creator>TooFastTooCurious</dc:creator>
      <pubDate>Fri, 27 Mar 2026 00:26:52 +0000</pubDate>
      <link>https://dev.to/toofasttoocurious/introducing-the-abom-why-your-cicd-pipelines-need-a-bill-of-materials-5dj6</link>
      <guid>https://dev.to/toofasttoocurious/introducing-the-abom-why-your-cicd-pipelines-need-a-bill-of-materials-5dj6</guid>
      <description>&lt;p&gt;An &lt;strong&gt;ABOM (Actions Bill of Materials)&lt;/strong&gt; is a complete inventory of every GitHub Action your CI/CD pipelines depend on — including transitive dependencies buried inside composite actions, reusable workflows, and tool wrappers that your workflow files never mention directly.&lt;/p&gt;

&lt;p&gt;If you know what an SBOM is, you already get it. SBOMs catalog your application dependencies. ABOMs catalog your pipeline dependencies. And right now, most organizations have no idea what's actually running in their CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Take this workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Scan for vulnerabilities&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;crazy-max/ghaction-container-scan@v3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No mention of Trivy anywhere. But &lt;code&gt;ghaction-container-scan&lt;/code&gt; downloads and runs Trivy internally. When &lt;a href="https://juliet.sh/blog/trivy-supply-chain-compromise-what-kubernetes-teams-need-to-know" rel="noopener noreferrer"&gt;76 of 77 Trivy release tags were poisoned with credential-stealing malware&lt;/a&gt; in March 2026, organizations that grepped their workflows for &lt;code&gt;trivy-action&lt;/code&gt; found nothing — and assumed they were safe.&lt;/p&gt;

&lt;p&gt;They weren't.&lt;/p&gt;

&lt;p&gt;This isn't a Trivy-specific problem. It's a structural one. GitHub Actions have a dependency tree just like application code does, but nobody's been tracking it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why SBOMs don't cover this
&lt;/h2&gt;

&lt;p&gt;SBOMs document what goes &lt;em&gt;into&lt;/em&gt; your software — libraries, packages, container base images. That's the artifact side.&lt;/p&gt;

&lt;p&gt;But the pipeline that builds, tests, scans, and deploys that software has its own dependency tree. A compromised CI action can steal every secret in your pipeline, poison every artifact it touches, and propagate to every downstream system — and none of that shows up in an SBOM.&lt;/p&gt;

&lt;p&gt;After Trivy, this stopped being theoretical. The attack exfiltrated AWS credentials, Kubernetes tokens, Docker configs, and SSH keys from CI runners. It then used stolen npm credentials to publish a self-propagating worm into downstream packages. The pipeline was the entry point for all of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an ABOM contains
&lt;/h2&gt;

&lt;p&gt;An ABOM maps every action in your workflows, resolved recursively:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direct dependencies&lt;/strong&gt; — the actions your workflows reference explicitly. This is what grep finds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transitive dependencies&lt;/strong&gt; — actions called by composite actions or reusable workflows your workflows use. This is what grep misses. A single &lt;code&gt;uses:&lt;/code&gt; line in your workflow might resolve to a chain of five or six nested actions, any one of which could be compromised.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedded tools&lt;/strong&gt; — actions that don't call other actions but silently download and execute external tools like Trivy, Grype, or Snyk. These don't show up as action dependencies at all — you have to analyze the action's metadata and inputs to detect them.&lt;/p&gt;

&lt;p&gt;For each action, the ABOM records the owner, repository, version reference, whether it's pinned to an immutable SHA or a mutable tag, and the full chain of how your workflow reaches it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you do with it
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Incident response.&lt;/strong&gt; When a GitHub Action gets compromised, you need to know in minutes whether you're affected — not after a manual audit of every composite action your workflows use. Query the ABOM for the affected action and get an immediate answer, including transitive and embedded exposure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CI gate.&lt;/strong&gt; Generate an ABOM on every pull request and fail the build if it contains a known-compromised action. The same way you'd fail a build for a critical CVE in an application dependency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compliance.&lt;/strong&gt; If you're already generating SBOMs for regulatory or customer requirements, your CI/CD pipeline is a gap in that inventory. An ABOM closes it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drift detection.&lt;/strong&gt; Compare ABOMs across builds to detect when a new transitive dependency appears or when a previously-pinned action gets changed to a mutable tag.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standard formats
&lt;/h2&gt;

&lt;p&gt;ABOMs shouldn't be a proprietary format. The dependency relationships in a CI pipeline map cleanly onto existing BOM standards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CycloneDX 1.5&lt;/strong&gt; — actions become components, transitive relationships go in the dependency graph, compromised actions show up as vulnerabilities. Plugs directly into Dependency-Track, Grype, and other tooling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SPDX 2.3&lt;/strong&gt; — actions become packages with &lt;code&gt;DEPENDS_ON&lt;/code&gt; relationships. Works with existing license compliance and SBOM aggregation tools.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means you can manage your pipeline dependencies with the same tools you already use for application dependencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;We built &lt;a href="https://github.com/JulietSecurity/abom" rel="noopener noreferrer"&gt;abom&lt;/a&gt; to generate ABOMs from any GitHub repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
go &lt;span class="nb"&gt;install &lt;/span&gt;github.com/julietsecurity/abom@latest

&lt;span class="c"&gt;# Generate an ABOM for your repo&lt;/span&gt;
abom scan &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Check against known-compromised actions&lt;/span&gt;
abom scan &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--check&lt;/span&gt;

&lt;span class="c"&gt;# Export as CycloneDX&lt;/span&gt;
abom scan &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; cyclonedx-json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It resolves transitive dependencies by fetching action metadata from GitHub, caches results locally, and checks against a &lt;a href="https://github.com/JulietSecurity/abom-advisories" rel="noopener noreferrer"&gt;community-maintained advisory database&lt;/a&gt; of known-compromised actions. It's open source under Apache 2.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  This will happen again
&lt;/h2&gt;

&lt;p&gt;The Trivy compromise was not a one-off. GitHub Actions are a high-value target: they run with access to cloud credentials, deployment keys, package registry tokens, and production infrastructure. Any widely-used action is one misconfigured token away from becoming the next supply chain incident.&lt;/p&gt;

&lt;p&gt;The question is whether you'll find out you were affected from a tool, or from an incident report.&lt;/p&gt;




&lt;p&gt;Questions? Reach us at &lt;a href="mailto:contact@juliet.sh"&gt;contact@juliet.sh&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>security</category>
      <category>github</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
