<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: SURYANSH GUPTA</title>
    <description>The latest articles on DEV Community by SURYANSH GUPTA (@suryansh_gupta).</description>
    <link>https://dev.to/suryansh_gupta</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3820737%2F3a7e0ac6-8521-4611-ba29-562d4b6410ba.png</url>
      <title>DEV Community: SURYANSH GUPTA</title>
      <link>https://dev.to/suryansh_gupta</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/suryansh_gupta"/>
    <language>en</language>
    <item>
      <title>Hands-On AWS CI/CD: CodeBuild, CodeDeploy, CodePipeline &amp; Zero-Downtime Blue/Green Releases</title>
      <dc:creator>SURYANSH GUPTA</dc:creator>
      <pubDate>Sat, 06 Jun 2026 15:06:52 +0000</pubDate>
      <link>https://dev.to/aws-builders/hands-on-aws-cicd-codebuild-codedeploy-codepipeline-zero-downtime-bluegreen-releases-3ikb</link>
      <guid>https://dev.to/aws-builders/hands-on-aws-cicd-codebuild-codedeploy-codepipeline-zero-downtime-bluegreen-releases-3ikb</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;If you've ever wondered how production teams ship code dozens of times a day without breaking things (or how they recover fast when they do), the answer almost always comes down to a solid CI/CD pipeline. In this post, I'm going to walk you through exactly how I built one end-to-end on AWS — from pushing code to a Git repository all the way through automated build, test, deploy, rollback, and finally a blue/green deployment strategy.&lt;/p&gt;

&lt;p&gt;Here's what the full pipeline looks like at a high level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Git Push → S3 (source) → AWS CodeBuild (build + test) → AWS CodeDeploy → EC2 (production)
                                    ↑
                           AWS CodePipeline orchestrates it all
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's go step by step.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before diving in, here's what was already in place in this lab environment (you'd provision these yourself in a real project):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An EC2 instance used as a &lt;strong&gt;development environment&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A self-hosted &lt;strong&gt;Gitea&lt;/strong&gt; SCM (Git-based source control)&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;Auto Scaling Group&lt;/strong&gt; with 2 EC2 production instances&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;Application Load Balancer&lt;/strong&gt; (ALB) targeting those instances&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;CodeDeploy Agent&lt;/strong&gt; pre-installed on production instances&lt;/li&gt;
&lt;li&gt;IAM roles for CodeBuild, CodeDeploy, and CodePipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The application itself is a simple &lt;strong&gt;Node.js + Express&lt;/strong&gt; app with an AngularJS frontend. It has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;gulp&lt;/code&gt;-based build process&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Karma&lt;/code&gt; + &lt;code&gt;Jasmine&lt;/code&gt; unit tests&lt;/li&gt;
&lt;li&gt;A dev mode (port 3000) and production mode (port 8080)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 1 — Committing Code to the Git Repository
&lt;/h2&gt;

&lt;p&gt;The first step in any CI/CD pipeline is getting your code into source control. I connected to the EC2 dev instance via &lt;strong&gt;EC2 Instance Connect&lt;/strong&gt; (browser-based SSH — no key pair needed), then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Navigate into the app directory and run the tests first&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;app
npm &lt;span class="nb"&gt;test&lt;/span&gt;

&lt;span class="c"&gt;# Verify the app runs locally in dev mode&lt;/span&gt;
&lt;span class="nv"&gt;NODE_ENV&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;development &lt;span class="nv"&gt;DEBUG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;aws-code-services:&lt;span class="k"&gt;*&lt;/span&gt; npm start
&lt;span class="c"&gt;# App is now accessible at http://&amp;lt;EC2-IP&amp;gt;:3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once tests passed and the app looked good, I set up Git credentials and pushed to the remote repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Configure Git identity&lt;/span&gt;
git config &lt;span class="nt"&gt;--global&lt;/span&gt; user.email student@platform.qa.com
git config &lt;span class="nt"&gt;--global&lt;/span&gt; user.name student
git config &lt;span class="nt"&gt;--global&lt;/span&gt; credential.helper store
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"http://student:LabPassword123@&amp;lt;SCM-IP&amp;gt;:3000"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; ~/.git-credentials

&lt;span class="c"&gt;# Clone the empty remote repo&lt;/span&gt;
git clone http://&amp;lt;SCM-IP&amp;gt;:3000/student/app-repo.git
&lt;span class="nb"&gt;cd &lt;/span&gt;app-repo

&lt;span class="c"&gt;# Copy the app source into the repo&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; ../app/. &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Stage, commit, and push&lt;/span&gt;
git add &lt;span class="nt"&gt;-A&lt;/span&gt;
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"app v1.0"&lt;/span&gt;
git push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsnibfc0iag4y767ldtse.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsnibfc0iag4y767ldtse.png" alt=" " width="800" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this point, &lt;code&gt;app v1.0&lt;/code&gt; is live in the remote SCM repository. The SCM was configured to automatically &lt;strong&gt;zip and upload the source to an S3 bucket&lt;/strong&gt; (&lt;code&gt;code-build-source-*&lt;/code&gt;) whenever a push lands — this is the bridge between the self-hosted SCM and CodeBuild, which doesn't natively support self-hosted Git.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2 — Automated Build with AWS CodeBuild
&lt;/h2&gt;

&lt;p&gt;With source code in S3, &lt;strong&gt;AWS CodeBuild&lt;/strong&gt; picks it up and runs the build. The entire build is defined in a &lt;code&gt;buildspec.yml&lt;/code&gt; file at the root of the project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.2&lt;/span&gt;

&lt;span class="na"&gt;phases&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;install&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runtime-versions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;nodejs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;18&lt;/span&gt;
    &lt;span class="na"&gt;commands&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;npm install&lt;/span&gt;   &lt;span class="c1"&gt;# Install ALL dependencies (including dev)&lt;/span&gt;

  &lt;span class="na"&gt;pre_build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;commands&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;npm test&lt;/span&gt;      &lt;span class="c1"&gt;# Run automated unit tests&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;npm prune --production&lt;/span&gt;  &lt;span class="c1"&gt;# Remove dev dependencies&lt;/span&gt;

  &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;commands&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;npm run build&lt;/span&gt; &lt;span class="c1"&gt;# Production build via gulp (minification, bundling)&lt;/span&gt;

&lt;span class="na"&gt;artifacts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;files&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;**/*'&lt;/span&gt;          &lt;span class="c1"&gt;# Package everything for CodeDeploy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What's happening in each phase:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;install&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pulls in the Node.js runtime and installs all npm dependencies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pre_build&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Runs unit tests; if they fail, the build stops here. Then strips dev deps.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;build&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Runs the gulp production build — minifies and bundles frontend assets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;artifacts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Packages the entire working directory into a ZIP and uploads to S3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The CodeBuild project was configured to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use an &lt;strong&gt;AWS-managed Docker container&lt;/strong&gt; (no infrastructure to manage)&lt;/li&gt;
&lt;li&gt;Store build artifacts in a dedicated S3 bucket (&lt;code&gt;code-build-artifacts-*&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Log everything to &lt;strong&gt;Amazon CloudWatch Logs&lt;/strong&gt; for debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I triggered a manual build to verify the setup, watched the phase details, and confirmed the artifact landed in S3. ✅&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ghrnhh3grprnld2z7rc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ghrnhh3grprnld2z7rc.png" alt=" " width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpnjq7o0qwl0rv2ypwtz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpnjq7o0qwl0rv2ypwtz.png" alt=" " width="799" height="411"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Configuring AWS CodeDeploy
&lt;/h2&gt;

&lt;p&gt;This is where the actual deployment to EC2 happens. CodeDeploy uses an &lt;code&gt;appspec.yml&lt;/code&gt; at the project root to know how to deploy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.0&lt;/span&gt;
&lt;span class="na"&gt;os&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;linux&lt;/span&gt;

&lt;span class="na"&gt;files&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
    &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/ec2-user/app&lt;/span&gt;

&lt;span class="na"&gt;hooks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ApplicationStop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scripts/stop_server.sh&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;

  &lt;span class="na"&gt;ApplicationStart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scripts/start_server.sh&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;

  &lt;span class="na"&gt;ValidateService&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scripts/validate_service.sh&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The lifecycle hooks are critical:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ApplicationStop&lt;/code&gt; — gracefully stops any running instance of the Node server&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ApplicationStart&lt;/code&gt; — starts the server in production mode on port 8080&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ValidateService&lt;/code&gt; — curls port 8080 and fails the deployment if the app doesn't respond&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any hook script returns a non-zero exit code, &lt;strong&gt;CodeDeploy fails the deployment and triggers a rollback automatically&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjvt5u2xeq9uiucc63g4k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjvt5u2xeq9uiucc63g4k.png" alt=" " width="675" height="590"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating the Deployment Application and Groups
&lt;/h3&gt;

&lt;p&gt;In the CodeDeploy console, I created an application named &lt;code&gt;lab-app&lt;/code&gt; and two deployment groups:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In-place deployment group (&lt;code&gt;in-place&lt;/code&gt;):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Targets the Auto Scaling Group (&lt;code&gt;lab-app-prod-asg&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Uses &lt;code&gt;CodeDeployDefault.OneAtATime&lt;/code&gt; — updates one instance at a time, keeping the other serving traffic&lt;/li&gt;
&lt;li&gt;ALB integration with connection draining enabled&lt;/li&gt;
&lt;li&gt;Automatic rollback on failure ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Blue/green deployment group (&lt;code&gt;blue-green&lt;/code&gt;):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same ASG and ALB configuration&lt;/li&gt;
&lt;li&gt;CodeDeploy provisions &lt;strong&gt;fresh EC2 instances&lt;/strong&gt; for every deployment (no config drift)&lt;/li&gt;
&lt;li&gt;Original instances kept running until cutover completes&lt;/li&gt;
&lt;li&gt;Traffic switches via the ALB — zero downtime&lt;/li&gt;
&lt;li&gt;Automatic rollback on failure ✅&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4h8hbcbi15krhtk9wh1n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4h8hbcbi15krhtk9wh1n.png" alt=" " width="707" height="637"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdqtvam5e1gjiypmt2z4c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdqtvam5e1gjiypmt2z4c.png" alt=" " width="680" height="647"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Wiring It All Together with AWS CodePipeline
&lt;/h2&gt;

&lt;p&gt;CodePipeline is the orchestrator that connects source → build → deploy into a single automated workflow.&lt;/p&gt;

&lt;p&gt;I created a pipeline named &lt;code&gt;lab-app&lt;/code&gt; with these stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Source] → [Build] → [Production]
  S3         CodeBuild   CodeDeploy (in-place)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pipeline configuration highlights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source stage&lt;/strong&gt;: Watches the S3 bucket (&lt;code&gt;code-build-source-*&lt;/code&gt;) for &lt;code&gt;source.zip&lt;/code&gt; changes. When a new zip lands, the pipeline fires.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build stage&lt;/strong&gt;: Delegates to the &lt;code&gt;lab-app&lt;/code&gt; CodeBuild project. The output &lt;code&gt;BuildArtifact&lt;/code&gt; is passed downstream.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production stage&lt;/strong&gt;: I added this manually after creating the pipeline (you can't rename stages, so skip the default "Deploy" to avoid the misleading label). It runs a CodeDeploy action pointing at the &lt;code&gt;lab-app&lt;/code&gt; application and &lt;code&gt;in-place&lt;/code&gt; deployment group. Automatic rollback on stage failure is enabled.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One important note: the Pipeline checks S3 periodically for changes. In production, you'd configure &lt;strong&gt;EventBridge (CloudWatch Events)&lt;/strong&gt; for near-instant triggers, or use a native SCM integration if you're on GitHub, GitLab, or BitBucket.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5 — Following a Successful Deployment
&lt;/h2&gt;

&lt;p&gt;With the pipeline in place, I triggered it manually via &lt;strong&gt;Release change&lt;/strong&gt; and watched each stage:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Source → Succeeded&lt;/strong&gt; (green) — latest &lt;code&gt;source.zip&lt;/code&gt; pulled from S3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build → Succeeded&lt;/strong&gt; — CodeBuild ran all phases, artifact uploaded&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production → In Progress&lt;/strong&gt; — CodeDeploy started the in-place rollout&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Clicking into the CodeDeploy deployment view showed the &lt;strong&gt;lifecycle events per instance&lt;/strong&gt; in real time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BeforeAllowTraffic → ApplicationStop → ApplicationStart → ValidateService → AfterAllowTraffic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since &lt;code&gt;OneAtATime&lt;/code&gt; was configured, one instance was taken out of the ALB at a time, upgraded, validated, then returned to service before the second instance was touched. The app stayed available throughout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final verification:&lt;/strong&gt; I grabbed the ALB DNS name and loaded it in the browser. No "development mode" banner — confirmed production mode. Refreshing showed different server IPs alternating, proving both instances were serving traffic behind the load balancer. ✅&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fthy4gpspkospadryvumi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fthy4gpspkospadryvumi.png" alt=" " width="799" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6lqg31t9k3dqdrgafxxn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6lqg31t9k3dqdrgafxxn.png" alt=" " width="799" height="209"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6 — Intentional Failure and Automatic Rollback
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. I simulated a bad deployment by pushing &lt;code&gt;v1.1&lt;/code&gt; — a version where someone accidentally changed the server's listening port from &lt;code&gt;8080&lt;/code&gt; to &lt;code&gt;80&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; commits/v1_1/. app-repo/
&lt;span class="nb"&gt;cd &lt;/span&gt;app-repo
git add &lt;span class="nt"&gt;-A&lt;/span&gt;
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"app v1.1"&lt;/span&gt;
git push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline triggered automatically. CodeBuild passed (the unit tests didn't catch a port misconfiguration — a realistic scenario). The deployment reached the &lt;code&gt;ValidateService&lt;/code&gt; hook, which tried to &lt;code&gt;curl localhost:8080&lt;/code&gt;... and got a connection refused.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happened next, automatically:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;ValidateService&lt;/code&gt; script failed → deployment failed on instance 1&lt;/li&gt;
&lt;li&gt;CodeDeploy's &lt;code&gt;OneAtATime&lt;/code&gt; config meant instance 2 was &lt;strong&gt;skipped&lt;/strong&gt; — it never received the broken version&lt;/li&gt;
&lt;li&gt;CodeDeploy triggered an &lt;strong&gt;automatic rollback&lt;/strong&gt; — a new deployment was initiated using the last successful revision&lt;/li&gt;
&lt;li&gt;The rollback deployment showed &lt;code&gt;Initiating event: codeDeployRollback&lt;/code&gt; in the deployment history&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The app was &lt;strong&gt;never fully broken in production&lt;/strong&gt;. Only one instance briefly served the broken version, and it was rolled back before any real user impact. This is exactly why the &lt;code&gt;ValidateService&lt;/code&gt; hook exists — your automated unit tests can't catch everything, but a post-deploy smoke test can.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1155lsibnb9clkq4ko8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1155lsibnb9clkq4ko8.png" alt=" " width="705" height="575"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7 — Blue/Green Deployment
&lt;/h2&gt;

&lt;p&gt;Finally, I switched the pipeline to use the &lt;code&gt;blue-green&lt;/code&gt; deployment group and pushed &lt;code&gt;v1.2&lt;/code&gt; — a legitimate new feature (message emphasis toggles in the accumulator app).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; commits/v1_2/. app-repo/
&lt;span class="nb"&gt;cd &lt;/span&gt;app-repo
git add &lt;span class="nt"&gt;-A&lt;/span&gt;
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"app v1.2"&lt;/span&gt;
git push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The blue/green deployment flow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;New (green) instances provisioned&lt;/strong&gt; — CodeDeploy creates fresh EC2 instances from the ASG configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;App installed on green instances&lt;/strong&gt; — &lt;code&gt;v1.2&lt;/code&gt; deployed and validated on the new instances&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traffic rerouted&lt;/strong&gt; — ALB gradually shifts traffic from blue (original) to green (new), &lt;code&gt;OneAtATime&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Original (blue) instances retained&lt;/strong&gt; — kept running for rollback capability&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Watching the CodeDeploy console during this was genuinely satisfying: you could see the replacement instances appear, the lifecycle events complete, and traffic start routing to them — while the original instances continued serving users without interruption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final verification:&lt;/strong&gt; Refreshed the app. New feature (emphasis toggles) was visible. Server IPs in the bottom corner were &lt;strong&gt;different from before&lt;/strong&gt; — confirming entirely new instances were serving traffic, not the same ones from the in-place deployment. That's immutable infrastructure in action.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsbj6s821gloegs18ptol.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsbj6s821gloegs18ptol.png" alt=" " width="800" height="266"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;What You Learned&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;buildspec.yml&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Defines CodeBuild phases: install → pre_build → build → artifacts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;appspec.yml&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Defines CodeDeploy lifecycle hooks: stop → start → validate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;In-place deployment&lt;/td&gt;
&lt;td&gt;Rolling update on existing instances; faster but risks config drift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Blue/green deployment&lt;/td&gt;
&lt;td&gt;New instances every time; zero-downtime cutover; immutable infra&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automatic rollback&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ValidateService&lt;/code&gt; hook + rollback config = self-healing pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Architecture Diagram
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer (EC2 dev-instance)
        │
        │ git push
        ▼
    SCM (Gitea)
        │
        │ webhook → uploads source.zip
        ▼
    Amazon S3 (source bucket)
        │
        │ triggers CodePipeline
        ▼
  AWS CodePipeline
   ┌────┴────────────────────────────────┐
   │                                     │
[Source]     [Build]              [Production]
  S3    →  CodeBuild        →    CodeDeploy
           (build + test)        (in-place or
           artifact → S3          blue/green)
                                      │
                              ┌───────┴───────┐
                              │               │
                           Instance 1    Instance 2
                           (EC2, prod)   (EC2, prod)
                              └───────┬───────┘
                                      │
                           Application Load Balancer
                                      │
                                   Users
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This pipeline covers the core CI/CD loop. From here, you could extend it with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manual approval stage&lt;/strong&gt; in CodePipeline before production (for regulated environments)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SNS notifications&lt;/strong&gt; on pipeline success/failure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudWatch alarms&lt;/strong&gt; tied to CodeDeploy to trigger rollbacks on metrics (not just script failures)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EventBridge rules&lt;/strong&gt; for instant pipeline triggers instead of S3 polling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parameter Store / Secrets Manager&lt;/strong&gt; integration in the buildspec for managing environment variables securely&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built as part of the AWS CI/CD hands-on lab. Published under the AWS Builders community.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;#aws&lt;/code&gt; &lt;code&gt;#devops&lt;/code&gt; &lt;code&gt;#cicd&lt;/code&gt; &lt;code&gt;#codepipeline&lt;/code&gt; &lt;code&gt;#codedeploy&lt;/code&gt; &lt;code&gt;#codebuild&lt;/code&gt; &lt;code&gt;#cloud&lt;/code&gt; &lt;code&gt;#awscommunity&lt;/code&gt;&lt;/p&gt;

</description>
      <category>containers</category>
      <category>aws</category>
      <category>codebuild</category>
      <category>codedeploy</category>
    </item>
    <item>
      <title>I Built a Bot That Updates My EKS Nodes While I Sleep — Here's How</title>
      <dc:creator>SURYANSH GUPTA</dc:creator>
      <pubDate>Sat, 30 May 2026 10:03:45 +0000</pubDate>
      <link>https://dev.to/aws-builders/i-built-a-bot-that-updates-my-eks-nodes-while-i-sleep-heres-how-2lgd</link>
      <guid>https://dev.to/aws-builders/i-built-a-bot-that-updates-my-eks-nodes-while-i-sleep-heres-how-2lgd</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Manual EKS AMI updates are slow, risky, and easy to forget. I wired together EventBridge, Lambda, Amazon Bedrock (Claude 3.5 Haiku), GitHub PRs, ArgoCD, and Karpenter into a pipeline that detects new AMIs, runs AI risk analysis, opens a PR for human review, and rolls out nodes automatically — zero downtime, full audit trail.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The problem every EKS team hits eventually
&lt;/h2&gt;

&lt;p&gt;You're running production Kubernetes on AWS. You know you're supposed to keep worker nodes patched. But between sprints, incidents, and everything else — checking for new EKS-optimized AMIs falls through the cracks.&lt;/p&gt;

&lt;p&gt;When you finally do an update, there's a whole ritual: find the new AMI ID, read through the release notes, assess any CVEs, draft a PR, wait for approvals, then carefully roll out nodes without taking down your workloads.&lt;/p&gt;

&lt;p&gt;It's not rocket science — it's just slow, manual, and one of those tasks that always feels lower priority than the thing currently on fire.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if the whole thing ran itself?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The solution in one sentence
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Twice a day, a Lambda checks for new EKS AMIs. If one exists, Bedrock analyzes the risk and opens a GitHub PR. A human reviews it. Merging the PR triggers ArgoCD + Karpenter to roll out the new nodes with zero downtime.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The magic is that &lt;strong&gt;the only thing a human needs to do is read the AI's analysis and merge (or close) the PR.&lt;/strong&gt; Everything else — detection, analysis, branch creation, notification, node rollout — is automated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture: Three clean phases
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8qckdrg596fjpqprtncw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8qckdrg596fjpqprtncw.png" alt="Three-phase EKS AMI automation pipeline: Detect via EventBridge and Lambda, AI Analyze via Amazon Bedrock and GitHub PR, Deploy via ArgoCD and Karpenter"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1 — Detection
&lt;/h3&gt;

&lt;p&gt;An &lt;strong&gt;EventBridge scheduled rule&lt;/strong&gt; fires at 9 AM and 9 PM UTC every day. It triggers a Lambda that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Queries &lt;strong&gt;AWS SSM Parameter Store&lt;/strong&gt; for the latest EKS-optimized AMI ID (&lt;code&gt;/aws/service/eks/optimized-ami/1.34/amazon-linux-2023/recommended/image_id&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Compares it against what's currently committed in your &lt;strong&gt;GitHub repository&lt;/strong&gt; (your source of truth)&lt;/li&gt;
&lt;li&gt;If they differ — new AMI exists → triggers the Step Functions workflow&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No new AMI? The Lambda exits quietly. Nothing else happens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2 — AI Analysis + Pull Request
&lt;/h3&gt;

&lt;p&gt;This is where it gets interesting. &lt;strong&gt;AWS Step Functions&lt;/strong&gt; orchestrates three Lambda functions in sequence:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lambda 1 — &lt;code&gt;bedrock-analyzer&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fetches the real AMI release notes from GitHub (awslabs/amazon-eks-ami) and sends them to &lt;strong&gt;Amazon Bedrock running Claude 3.5 Haiku&lt;/strong&gt; with this prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Analyze this Amazon EKS AMI update using the actual release notes.
New AMI ID: {ami_id}
Previous AMI ID: {previous_ami}

ACTUAL EKS AMI RELEASE NOTES:
{release_notes}

Respond in JSON with:
- risk_score: 1–10
- recommendation: APPROVE or REJECT
- summary: one-line summary of actual changes
- pr_description: full markdown PR body with CVEs, package versions,
  risk assessment, and review guidance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output is a structured JSON object with a risk score and a ready-to-paste PR description.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lambda 2 — &lt;code&gt;gitops-updater&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Uses GitHub App credentials (stored in &lt;strong&gt;AWS Secrets Manager&lt;/strong&gt;) to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a new branch&lt;/li&gt;
&lt;li&gt;Update the &lt;strong&gt;Karpenter EC2NodeClass&lt;/strong&gt; YAML with the new AMI ID&lt;/li&gt;
&lt;li&gt;Open a Pull Request with the full Bedrock analysis embedded in the description&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lambda 3 — &lt;code&gt;send-notification&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fires an &lt;strong&gt;SNS email&lt;/strong&gt; to the team: "New AMI detected, PR #N is open for your review." Includes the PR link and the one-line AI summary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The human's job:&lt;/strong&gt; Read the AI analysis. Check the YAML diff (it's literally one line — the AMI ID). Merge to approve, close to reject.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3 — GitOps Deployment
&lt;/h3&gt;

&lt;p&gt;After the PR is merged:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ArgoCD&lt;/strong&gt; detects the commit on &lt;code&gt;main&lt;/code&gt;, auto-syncs the updated &lt;code&gt;EC2NodeClass&lt;/code&gt; manifest to the EKS cluster&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Karpenter&lt;/strong&gt; sees the new AMI ID in the &lt;code&gt;EC2NodeClass&lt;/code&gt;, provisions new EC2 nodes with the updated AMI, then gracefully drains the old nodes&lt;/li&gt;
&lt;li&gt;Workloads migrate to new nodes. &lt;strong&gt;Zero downtime.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The whole rollout happens without anyone touching &lt;code&gt;kubectl&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the PR actually looks like
&lt;/h2&gt;

&lt;p&gt;This is what your team sees in GitHub:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## EKS AMI Update — ami-04b406d4e6eaca578&lt;/span&gt;

&lt;span class="gs"&gt;**AI Risk Score: 2/10 — APPROVE**&lt;/span&gt;

&lt;span class="gu"&gt;### What changed&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Go updated to 1.25.9
&lt;span class="p"&gt;-&lt;/span&gt; Kernel updated to 6.12.79-101.147.amzn2023
&lt;span class="p"&gt;-&lt;/span&gt; No new CVEs introduced

&lt;span class="gu"&gt;### CVE Assessment&lt;/span&gt;
No critical or high-severity CVEs in this update. Two previously
known CVEs (CVE-2024-XXXX, CVE-2024-YYYY) are patched.

&lt;span class="gu"&gt;### Review guidance&lt;/span&gt;
This is a routine kernel + runtime update. Low risk. Recommend
merging during business hours with normal monitoring in place.
&lt;span class="p"&gt;
---&lt;/span&gt;
&lt;span class="ge"&gt;*Merge this PR to trigger ArgoCD + Karpenter rollout.*&lt;/span&gt;
&lt;span class="ge"&gt;*Close this PR to skip this AMI version.*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your reviewer doesn't need to dig through release notes. The AI already did it.&lt;/p&gt;




&lt;h2&gt;
  
  
  CloudFormation: everything in one stack
&lt;/h2&gt;

&lt;p&gt;The whole solution deploys from a single CloudFormation template. Here's what it provisions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS Secrets Manager&lt;/td&gt;
&lt;td&gt;GitHub App credentials&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon SNS + subscription&lt;/td&gt;
&lt;td&gt;Email alerts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5 IAM roles&lt;/td&gt;
&lt;td&gt;Per-function least-privilege&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 Lambda functions&lt;/td&gt;
&lt;td&gt;Detector, analyzer, PR creator, notifier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon Bedrock Guardrail&lt;/td&gt;
&lt;td&gt;Content filtering on AI output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Step Functions state machine&lt;/td&gt;
&lt;td&gt;Orchestrates analyze → PR → notify&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EventBridge rule&lt;/td&gt;
&lt;td&gt;Twice-daily schedule&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Deploy it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudformation create-stack &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--stack-name&lt;/span&gt; eks-ami-update &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--template-body&lt;/span&gt; file://cloudformation-template.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--capabilities&lt;/span&gt; CAPABILITY_NAMED_IAM &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameters&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ParameterKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;NotificationEmail,ParameterValue&lt;span class="o"&gt;=&lt;/span&gt;your@email.com &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ParameterKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;GitHubAppId,ParameterValue&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;app-id&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ParameterKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;GitHubAppInstallationId,ParameterValue&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;install-id&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ParameterKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;GitHubAppPrivateKey,ParameterValue&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; app.pem | &lt;span class="nb"&gt;tr&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'\n'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ParameterKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;GitHubRepoOwner,ParameterValue&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-org&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ParameterKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;GitHubRepoName,ParameterValue&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-repo&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ParameterKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;GitHubFilePath,ParameterValue&lt;span class="o"&gt;=&lt;/span&gt;karpenter-configs/clusters/your-cluster/nodeclass.yaml &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ParameterKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;GitHubBranch,ParameterValue&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ParameterKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;EKSVersion,ParameterValue&lt;span class="o"&gt;=&lt;/span&gt;1.34
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Takes about 2–3 minutes. Confirm the SNS subscription email when it arrives.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites checklist
&lt;/h2&gt;

&lt;p&gt;Before deploying, you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] An existing EKS cluster (v1.34+)&lt;/li&gt;
&lt;li&gt;[ ] Karpenter installed and configured&lt;/li&gt;
&lt;li&gt;[ ] ArgoCD installed with &lt;strong&gt;auto-sync enabled&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;[ ] A GitHub repository for Karpenter configs&lt;/li&gt;
&lt;li&gt;[ ] A GitHub App installed on that repo (you need App ID, Installation ID, and Private Key)&lt;/li&gt;
&lt;li&gt;[ ] Amazon Bedrock enabled in your region (enable Claude 3.5 Haiku access in the Bedrock console)&lt;/li&gt;
&lt;li&gt;[ ] AWS CLI + kubectl configured&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Fork the &lt;a href="https://github.com/suryansh639/sample-eks-ami-gitops-pipeline.git" rel="noopener noreferrer"&gt;aws-samples repository&lt;/a&gt; to your own account — you need write access to configure the GitHub App. Deploy your EC2NodeClass config to the repo before running the stack.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Testing it without waiting for an AMI release
&lt;/h2&gt;

&lt;p&gt;Don't want to wait up to 12 hours for the schedule to fire? Trigger it manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws lambda invoke &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--function-name&lt;/span&gt; eks-ami-detector &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--payload&lt;/span&gt; &lt;span class="s1"&gt;'{}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cli-binary-format&lt;/span&gt; raw-in-base64-out &lt;span class="se"&gt;\&lt;/span&gt;
  /tmp/response.json &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cat&lt;/span&gt; /tmp/response.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check your inbox. You should get an SNS email with the risk analysis and PR link within a couple of minutes.&lt;/p&gt;

&lt;p&gt;After merging, verify the ArgoCD sync:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Update your kubeconfig&lt;/span&gt;
aws eks update-kubeconfig &lt;span class="nt"&gt;--region&lt;/span&gt; &amp;lt;region&amp;gt; &lt;span class="nt"&gt;--name&lt;/span&gt; &amp;lt;cluster-name&amp;gt;

&lt;span class="c"&gt;# Check ArgoCD sync policy&lt;/span&gt;
kubectl get application karpenter-nodeclass &lt;span class="nt"&gt;-n&lt;/span&gt; argocd &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.spec.syncPolicy}'&lt;/span&gt;

&lt;span class="c"&gt;# Verify the AMI ID was applied&lt;/span&gt;
kubectl get ec2nodeclass default &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | &lt;span class="nb"&gt;grep &lt;/span&gt;ami-
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Common issues and how to fix them
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SNS subscription not confirmed&lt;/strong&gt; — Check your spam folder. The confirmation email comes from AWS and sometimes gets filtered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub App auth failure&lt;/strong&gt; — Double-check the App is installed on the correct repository with read/write permissions. Regenerate the private key in GitHub if needed and re-run the CloudFormation update.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bedrock access denied&lt;/strong&gt; — Go to the Amazon Bedrock console → Model access → enable Claude 3.5 Haiku in your region. This is a manual step that's easy to miss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ArgoCD not syncing&lt;/strong&gt; — Verify the Application resource has &lt;code&gt;spec.syncPolicy.automated&lt;/code&gt; set. Check that the repo URL and path match exactly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step Functions failures&lt;/strong&gt; — Check CloudWatch Logs for the failing Lambda. 99% of the time it's an IAM permission issue or a missing secret.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this architecture is worth copying
&lt;/h2&gt;

&lt;p&gt;A few design decisions I want to highlight:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub PRs as the approval interface&lt;/strong&gt; — Engineers already live in GitHub. Using a PR as the human gate means no new tool to learn, built-in commenting, and a permanent audit trail in Git history. The PR description IS the change record.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI analysis on real release notes&lt;/strong&gt; — The Bedrock prompt fetches actual release notes from the awslabs/amazon-eks-ami repo. It's not making things up — it's summarizing real content. The risk score is grounded in actual CVE and package data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Karpenter over managed node groups&lt;/strong&gt; — Karpenter watches the EC2NodeClass for changes and handles the node lifecycle automatically. You don't need to write any drain/cordon scripts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Least-privilege IAM&lt;/strong&gt; — Each Lambda has its own role with only the permissions it needs. The CF template provisions five separate roles. This matters in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guardrails on Bedrock&lt;/strong&gt; — The solution includes a Bedrock Guardrail for content filtering on the AI output. Belt and suspenders.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cleaning up
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudformation delete-stack &lt;span class="nt"&gt;--stack-name&lt;/span&gt; eks-ami-update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What I'd add next
&lt;/h2&gt;

&lt;p&gt;A few things that would make this even better:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Slack notification&lt;/strong&gt; instead of (or in addition to) SNS email — PR link directly in your #platform channel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dry-run mode&lt;/strong&gt; — run the full pipeline but don't actually open a PR, just log the analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cluster support&lt;/strong&gt; — one stack managing AMI updates across dev/staging/prod with different approval thresholds per environment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom risk criteria&lt;/strong&gt; — tune the Bedrock prompt to your org's specific compliance requirements (PCI-DSS, SOC 2, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic REJECT on critical CVEs&lt;/strong&gt; — skip the PR entirely and alert the team if the risk score is 8+&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Get the code
&lt;/h2&gt;

&lt;p&gt;Fork the repo, follow the README, and deploy:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://github.com/suryansh639/sample-eks-ami-gitops-pipeline.git" rel="noopener noreferrer"&gt;GitHub: suryansh639/sample-eks-ami-gitops-pipeline&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The CloudFormation template, Lambda code, and example Karpenter configs are all there.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;The goal wasn't to remove humans from the loop — it was to remove the &lt;em&gt;boring&lt;/em&gt; part of the loop. The AI reads the release notes. The AI writes the PR description. The human decides. The automation executes.&lt;/p&gt;

&lt;p&gt;That's the right split. And it means your nodes actually get updated on time, every time, with a full audit trail and no 2 AM surprises.&lt;/p&gt;

&lt;p&gt;If you try this out, drop a comment — I'd love to hear what customizations you make.&lt;/p&gt;




</description>
      <category>aws</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>containers</category>
    </item>
    <item>
      <title>What Is an Agent Harness? And Why Every AI Agent Needs One</title>
      <dc:creator>SURYANSH GUPTA</dc:creator>
      <pubDate>Sat, 09 May 2026 07:28:09 +0000</pubDate>
      <link>https://dev.to/aws-builders/what-is-an-agent-harness-and-why-every-ai-agent-needs-one-382l</link>
      <guid>https://dev.to/aws-builders/what-is-an-agent-harness-and-why-every-ai-agent-needs-one-382l</guid>
      <description>&lt;p&gt;If you've spent any time building with AI lately, you've probably heard the word "agent" thrown around a lot. But here's something that doesn't get talked about nearly as much: &lt;strong&gt;before you can have a real AI agent, you need a harness.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I know that term might sound unfamiliar or even a little abstract. When I first came across it, I had the same reaction. But once it clicked, I couldn't unsee it — and I genuinely think it's one of the most important concepts to understand if you want to go beyond just calling an LLM API and actually building something that &lt;em&gt;does things&lt;/em&gt; autonomously.&lt;/p&gt;

&lt;p&gt;Let's break it all down from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With "Just Using a Model"
&lt;/h2&gt;

&lt;p&gt;Picture this: you've got API access to a powerful model like Claude or GPT-4. You send it a prompt, it sends back a response. That's great for chatbots and one-shot completions — but what if you want the model to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Browse the web and pull real-time data?&lt;/li&gt;
&lt;li&gt;Execute Python code to analyze that data?&lt;/li&gt;
&lt;li&gt;Remember what you told it last week?&lt;/li&gt;
&lt;li&gt;Coordinate across multiple steps — each one depending on the last?&lt;/li&gt;
&lt;li&gt;Call your internal APIs or tools?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A raw model, on its own, can't do any of that. It can &lt;em&gt;talk&lt;/em&gt; about doing those things, but it has no way to actually &lt;em&gt;carry them out&lt;/em&gt;. It's like hiring a brilliant analyst who has no laptop, no internet, and can only communicate by passing notes. The intelligence is there — the infrastructure is not.&lt;/p&gt;

&lt;p&gt;That missing infrastructure is the harness.&lt;/p&gt;




&lt;h2&gt;
  
  
  So, What Exactly Is an Agent Harness?
&lt;/h2&gt;

&lt;p&gt;An agent harness is everything you build &lt;em&gt;around&lt;/em&gt; a model to transform it from a text-generator into an agent that can act in the real world.&lt;/p&gt;

&lt;p&gt;The cleanest formula I've come across is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Agent = Model + Harness&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Anything in your agent that isn't the model itself — is part of the harness.&lt;/p&gt;

&lt;p&gt;In concrete terms, the harness typically includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The orchestration loop&lt;/strong&gt; — the logic that takes a user message, asks the model what to do, runs that action, feeds the result back, and repeats until the task is complete.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool connections&lt;/strong&gt; — the plumbing that lets the model call a browser, run code, query a database, or hit an external API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; — short-term context within a session AND long-term memory that persists across sessions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context management&lt;/strong&gt; — deciding what information goes into the prompt at each step (you can't just keep appending forever — models have token limits).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute and sandboxing&lt;/strong&gt; — somewhere safe for the agent to run code without blowing up your system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication&lt;/strong&gt; — so your agent can securely call external APIs without leaking credentials.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; — logs, traces, and debugging tools so you know what happened when things go sideways at 2 AM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session management&lt;/strong&gt; — the ability for users to pause and resume, pick up right where they left off.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Look at any AI-powered product you use today — Claude Code, GitHub Copilot, Cursor, Perplexity — and behind the scenes, there's a harness doing all of this work. The model is just one piece of a much larger machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Harness Building Has Been So Painful
&lt;/h2&gt;

&lt;p&gt;Here's the honest reality: building a harness from scratch is &lt;em&gt;hard&lt;/em&gt; and &lt;em&gt;time-consuming&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you've done it before, you know the drill. You pick a framework — LangGraph, LlamaIndex, CrewAI, Strands Agents — and start writing code. You wire up your tools. You manage your prompt structure carefully so the model doesn't get confused. You add error handling for when tool calls fail. You build retry logic. You handle streaming output. You set up logging. You package everything into a container, provision some compute, and deploy it.&lt;/p&gt;

&lt;p&gt;And then you realize you need session persistence. So you add a database. And then you realize you need the agent to authenticate with an external API. So you set up credential management. And now you need to understand why the agent went down a weird reasoning path, so you add tracing.&lt;/p&gt;

&lt;p&gt;For a straightforward use case, this might take a few days. For a complex one, it could take weeks — and a whole team.&lt;/p&gt;

&lt;p&gt;This is the real barrier to building with AI agents. Not the model. The harness.&lt;/p&gt;




&lt;h2&gt;
  
  
  Enter Managed Harnesses: The Agent Factory Model
&lt;/h2&gt;

&lt;p&gt;Tooling has finally started catching up to this problem. The idea behind a managed harness is simple: instead of writing all that orchestration and infrastructure code yourself, you declare what your agent needs as &lt;em&gt;configuration&lt;/em&gt;, and the service builds the harness for you.&lt;/p&gt;

&lt;p&gt;Think of it like the difference between setting up your own server (writing harness code from scratch) versus using a managed cloud service (declaring config and letting the platform handle the rest).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon Bedrock AgentCore&lt;/strong&gt; is one of the services taking this approach. With AgentCore's harness feature, you define your agent in a JSON config file — model, system prompt, tools, memory settings — and the platform compiles that into a fully running agent, handling all the infrastructure underneath.&lt;/p&gt;

&lt;p&gt;Under the hood, AgentCore harness uses &lt;strong&gt;Strands Agents&lt;/strong&gt; (AWS's open-source agent SDK) to assemble the orchestration loop, tool execution, memory management, context handling, and streaming. Then it runs the whole thing inside an isolated &lt;strong&gt;microVM&lt;/strong&gt; — its own dedicated CPU, memory, and filesystem — without you provisioning a single server.&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's Build Something: An AI Trends Analyst in Minutes
&lt;/h2&gt;

&lt;p&gt;To make this concrete, here's how you'd go from zero to a working AI agent using AgentCore harness — and yes, this genuinely takes about 5 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Goal
&lt;/h3&gt;

&lt;p&gt;Build an agent that browses HackerNews and dev.to, pulls today's top AI and developer tools posts, clusters them by topic, and produces a ranked summary with a chart — all autonomously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install the CLI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @aws/agentcore@preview
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Create Your Agent Config Interactively
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore create
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command walks you through a set of prompts — which model to use, which tools to enable, authentication type, and so on. At the end, it generates a config file like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"TrendsAgentHarness"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bedrock"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"modelId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"global.anthropic.claude-sonnet-4-6"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agentcore_browser"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"browser"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agentcore_code_interpreter"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"code-interpreter"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skills"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"authorizerType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AWS_IAM"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The browser tool lets the agent navigate real websites. The code interpreter gives it a Python sandbox to crunch data and generate charts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Write Your System Prompt
&lt;/h3&gt;

&lt;p&gt;Edit the &lt;code&gt;system-prompt.md&lt;/code&gt; file that was created alongside the config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Your job is to keep a pulse on what the AI and dev community is buzzing 
about right now. Every session, head over to HackerNews and dev.to, 
scrape today's hottest posts, then use the code interpreter to make sense 
of it all — cluster the topics, rank them by how often they show up, and 
summarize the top 5 in plain language. Throw in a bar chart. No fluff.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system prompt is your agent's personality and operating instructions. This is where you define what it does, how it thinks, and what output you expect from it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Deploy It
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Behind the scenes, this takes your config and system prompt, assembles a Strands Agents program, and deploys it into a managed microVM environment. No Dockerfile, no Kubernetes, no EC2 instance. Just one command.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Invoke It
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore invoke &lt;span class="nt"&gt;--harness&lt;/span&gt; TrendsAgentHarness &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--session-id&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;uuidgen&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"What's trending in IT today?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What happens when you run this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The agent opens a browser and navigates to HackerNews.&lt;/li&gt;
&lt;li&gt;It scrolls through and reads the top posts.&lt;/li&gt;
&lt;li&gt;It does the same on dev.to.&lt;/li&gt;
&lt;li&gt;It pulls all the results into the code interpreter.&lt;/li&gt;
&lt;li&gt;It runs Python to cluster topics, calculate frequency, and build a bar chart.&lt;/li&gt;
&lt;li&gt;It streams a formatted summary back to your terminal.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All of this runs in an isolated microVM that spins up for this session and tears down when it's done. No cross-session data leakage, no noisy neighbors.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Comes Built In
&lt;/h2&gt;

&lt;p&gt;Here's a breakdown of what AgentCore harness gives you without any extra setup:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;What It Actually Means For You&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Isolated microVM per session&lt;/td&gt;
&lt;td&gt;Your agent gets its own CPU, memory, and filesystem. Sessions are completely isolated from each other.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shell access&lt;/td&gt;
&lt;td&gt;The agent can run shell commands directly without going through the model's reasoning loop — faster and cheaper.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent filesystem&lt;/td&gt;
&lt;td&gt;Mid-session, the agent can save files, pause, and resume exactly where it left off.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model-agnostic routing&lt;/td&gt;
&lt;td&gt;Switch between Bedrock, OpenAI, and Google Gemini. You can even change providers mid-session and the conversation context stays intact.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in browser tool&lt;/td&gt;
&lt;td&gt;Powered by AgentCore Browser — the agent can navigate real websites, not just search APIs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in code interpreter&lt;/td&gt;
&lt;td&gt;A full Python sandbox. The agent can write and execute code, generate charts, process files, and more.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP server support&lt;/td&gt;
&lt;td&gt;Connect to any MCP-compatible tool server — Slack, Notion, GitHub, whatever your workflow needs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AgentCore Gateway&lt;/td&gt;
&lt;td&gt;Connect to APIs you've registered centrally, so credentials are managed outside the agent.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom tool definitions&lt;/td&gt;
&lt;td&gt;Define your own inline function tools for the agent to call.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skills&lt;/td&gt;
&lt;td&gt;Package domain knowledge as markdown + scripts and give your agent expert-level context on demand.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full observability&lt;/td&gt;
&lt;td&gt;Every action is auto-traced via AgentCore Observability, so you can debug and audit everything that happened.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Agent Skills: Teaching Your Agent Domain Expertise
&lt;/h2&gt;

&lt;p&gt;One feature worth calling out specifically is &lt;strong&gt;skills&lt;/strong&gt;. An agent skill is a bundle of markdown instructions and (optionally) scripts that gives your agent deep knowledge about a specific domain or workflow.&lt;/p&gt;

&lt;p&gt;Think of it this way: you can train a general model on your specific context. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A skill that teaches the agent how to work with your internal data format.&lt;/li&gt;
&lt;li&gt;A skill that walks the agent through your company's API conventions.&lt;/li&gt;
&lt;li&gt;A skill that gives the agent step-by-step knowledge of how to process Excel reports your way.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You package the skill into the agent's environment, point the harness at it, and the agent picks it up and uses it automatically when relevant. No fine-tuning. No custom model training. Just structured knowledge the agent can reference.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Escape Hatch: When You Outgrow Config
&lt;/h2&gt;

&lt;p&gt;One question you might be asking: "What happens when my use case gets complex enough that a config file isn't enough?"&lt;/p&gt;

&lt;p&gt;That's a fair and important question. Maybe you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom multi-agent orchestration where agents hand off tasks to each other.&lt;/li&gt;
&lt;li&gt;Specialized routing logic based on the content of a message.&lt;/li&gt;
&lt;li&gt;A fully custom memory layer with your own vector database.&lt;/li&gt;
&lt;li&gt;Integration with internal infrastructure that doesn't fit a standard pattern.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AgentCore harness has an answer for this: &lt;strong&gt;export to code&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When you need full control, you can export your harness configuration to Strands Agents code. You get the equivalent Python program that AgentCore was running for you — fully readable, fully editable — and you can extend it however you need. You stay on the same platform, just with more control.&lt;/p&gt;

&lt;p&gt;This is a smart design. You start with the fast path (config), and you graduate to the custom path (code) only when you actually need it. You're not locked into one or the other.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Questions Answered
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Do I need to build a harness if I'm just using Claude.ai or ChatGPT?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. Those are consumer products where someone else already built the harness for you. You need to build your own when you're creating &lt;em&gt;custom&lt;/em&gt; agents — ones that call your specific tools, connect to your internal systems, maintain state, or run autonomously over multiple steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is a harness the same as an agent framework?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not quite. A framework (like Strands Agents, LangGraph, or CrewAI) gives you the building blocks — tool interfaces, loop patterns, model connectors. A harness is the fully assembled, running system: framework code plus compute, sandboxing, memory, auth, and observability. You use a framework to &lt;em&gt;build&lt;/em&gt; a harness, or you use a managed service to build one &lt;em&gt;for you&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I build a harness without a framework?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Technically yes, but you'd be writing the entire orchestration loop, tool dispatch, error recovery, and context management from scratch. Frameworks exist precisely so you don't have to. It's a bit like writing raw socket code instead of using Express.js — possible, but almost never the right call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is the browser tool expensive on tokens?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, it does consume more tokens than simpler tools since it's processing full web pages. For the trends analyst use case, it's absolutely worth it. For agents that need lighter-weight data fetching, you might want to explore API-based tools or MCP servers that return structured data instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters for the Community
&lt;/h2&gt;

&lt;p&gt;For a long time, building a production-grade AI agent required deep expertise across model APIs, orchestration frameworks, cloud infrastructure, and security. That's a lot of disciplines to combine, and it's been a genuine barrier for developers who want to experiment and build.&lt;/p&gt;

&lt;p&gt;Managed harness services like AgentCore change that equation. The gap between "I have an idea for an agent" and "I have a running agent" is now measured in minutes for straightforward use cases. That's genuinely exciting.&lt;/p&gt;

&lt;p&gt;It also means the interesting work shifts. Instead of spending your energy on infrastructure plumbing, you can focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What should your agent actually &lt;em&gt;do&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;What domain knowledge does it need?&lt;/li&gt;
&lt;li&gt;What tools should it have access to?&lt;/li&gt;
&lt;li&gt;How should it reason and communicate?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are the questions worth spending your time on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to Go From Here
&lt;/h2&gt;

&lt;p&gt;AgentCore harness is currently in &lt;strong&gt;public preview&lt;/strong&gt; in four AWS regions: US West (Oregon), US East (N. Virginia), Europe (Frankfurt), and Asia Pacific (Sydney).&lt;/p&gt;

&lt;p&gt;Here are the resources to get started:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/agentcore.html" rel="noopener noreferrer"&gt;AgentCore harness documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://strandsagents.com" rel="noopener noreferrer"&gt;Strands Agents SDK (open source)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/bedrock/pricing/" rel="noopener noreferrer"&gt;AgentCore pricing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trends analyst agent described in this post — browsing HackerNews, clustering AI topics, generating a chart — took about 5 minutes from idea to first working invocation. The JSON config is 15 lines. The system prompt is 5 lines.&lt;/p&gt;

&lt;p&gt;What would you build with 5 minutes and a config file? I'd love to see what the community comes up with. Drop your ideas or experiments in the comments.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this post helped you understand agent harnesses better, consider sharing it with someone who's been struggling to wrap their head around the agent architecture puzzle. And if you're already building harnesses the hard way, maybe it's time to let the factory do some of that work for you.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>mcp</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Cut Amazon Bedrock Costs with a 3-Layer Caching Pipeline on AWS Lambda + ElastiCache</title>
      <dc:creator>SURYANSH GUPTA</dc:creator>
      <pubDate>Tue, 05 May 2026 03:46:22 +0000</pubDate>
      <link>https://dev.to/aws-builders/cut-amazon-bedrock-costs-with-a-3-layer-caching-pipeline-on-aws-lambda-elasticache-1oi</link>
      <guid>https://dev.to/aws-builders/cut-amazon-bedrock-costs-with-a-3-layer-caching-pipeline-on-aws-lambda-elasticache-1oi</guid>
      <description>&lt;p&gt;If you're building AI-powered apps on AWS, you've probably felt the sting of Bedrock inference costs. Every token counts — and when users hammer your app with similar or identical questions, you're paying for the same answer over and over again.&lt;/p&gt;

&lt;p&gt;In this post I'll walk through a three-layer caching and optimization pipeline I built inside a single Lambda function backed by ElastiCache (Redis). By the end, you'll have a pattern that can dramatically reduce Bedrock calls in any support chatbot, internal knowledge assistant, or document Q&amp;amp;A tool you're shipping.&lt;/p&gt;

&lt;p&gt;Here's what we're building:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User prompt → Hash Check → Semantic Check → Prompt Compression → Bedrock → Cache Write
                  ↓               ↓
             hash_hit        semantic_hit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Architecture at a Glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;AWS Lambda&lt;/strong&gt; (Python)&lt;/td&gt;
&lt;td&gt;Caching logic, embedding, compression&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Amazon ElastiCache&lt;/strong&gt; (Redis 7.1)&lt;/td&gt;
&lt;td&gt;Persistent shared memory across invocations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Amazon Bedrock&lt;/strong&gt; (Nova Micro)&lt;/td&gt;
&lt;td&gt;Foundation model, only called on a true miss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Titan Embeddings v2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Converts prompts to semantic vectors&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Because Lambda is stateless, every invocation starts fresh with zero memory of prior calls. ElastiCache fills that gap — it's the shared brain that persists across invocations and across different users hitting your function simultaneously.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 1 — Hash-Based Caching: The Fastest Win
&lt;/h2&gt;

&lt;p&gt;Before anything touches Bedrock, we check whether we've already answered this exact question.&lt;/p&gt;

&lt;p&gt;The trick is &lt;strong&gt;normalizing&lt;/strong&gt; the prompt first — lowercase, collapse whitespace — so &lt;code&gt;"  What is   MACHINE LEARNING?  "&lt;/code&gt; and &lt;code&gt;"what is machine learning?"&lt;/code&gt; produce the same SHA-256 fingerprint and share one cache entry.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On every invocation, we check Redis with the &lt;code&gt;hash:&lt;/code&gt; prefix before doing anything else:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;hash_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HASH_PREFIX&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;compute_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hash_hit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A hash hit costs you a single Redis &lt;code&gt;GET&lt;/code&gt; — no embedding call, no Bedrock invocation, no tokens burned. This is the fastest and cheapest path through the entire pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When does this shine?&lt;/strong&gt; Any FAQ-style workload where users repeatedly ask the same questions. Support bots. Help center chatbots. Internal HR assistants.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 2 — Semantic Similarity Caching: Catching Paraphrases
&lt;/h2&gt;

&lt;p&gt;Hash-based caching misses paraphrases. &lt;code&gt;"What is machine learning?"&lt;/code&gt; and &lt;code&gt;"How would you define machine learning?"&lt;/code&gt; are semantically identical but produce completely different hashes.&lt;/p&gt;

&lt;p&gt;Semantic caching solves this with &lt;strong&gt;vector embeddings&lt;/strong&gt;. We convert every prompt to a list of floats that encodes its &lt;em&gt;meaning&lt;/em&gt;, then compare incoming prompts to stored vectors using cosine similarity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;bedrock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-runtime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bedrock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;modelId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;EMBED_MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputText&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="n"&gt;contentType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;accept&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cosine_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;norm_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;norm_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;norm_a&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;norm_b&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;norm_a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;norm_b&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since Redis stores bytes, not arrays, we serialize the vector with &lt;code&gt;struct.pack&lt;/code&gt; before writing and unpack it on read:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;serialize_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;deserialize_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unpack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the handler, after a hash miss we embed the incoming prompt and scan stored vectors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;stored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_embeddings&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;best_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;best_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stored&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cosine_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;best_score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;best_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
        &lt;span class="n"&gt;best_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;best_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;SIMILARITY_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;best_response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;semantic_hit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;best_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;SIMILARITY_THRESHOLD&lt;/code&gt; environment variable (default &lt;code&gt;0.90&lt;/code&gt;) is your dial for how aggressive the matching should be. Lower it to &lt;code&gt;0.80&lt;/code&gt; and you'll catch more paraphrases at the risk of serving a slightly off response. Tune it against your own traffic.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 In practice, I've seen semantic_hit catch prompts like &lt;em&gt;"Explain ML to me"&lt;/em&gt; against a cached answer for &lt;em&gt;"What is machine learning?"&lt;/em&gt; with a score around &lt;strong&gt;0.94&lt;/strong&gt; — well above threshold, and a completely avoided Bedrock call.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Layer 3 — Prompt Compression: Saving Tokens on Every Miss
&lt;/h2&gt;

&lt;p&gt;Even with two cache layers, some prompts will always be new. Prompt compression squeezes cost out of every genuine cache miss by stripping filler language before the prompt reaches Bedrock.&lt;/p&gt;

&lt;p&gt;Filler phrases like &lt;code&gt;"Could you please"&lt;/code&gt;, &lt;code&gt;"I was wondering if"&lt;/code&gt;, or &lt;code&gt;"As an AI language model"&lt;/code&gt; consume tokens without improving the model's response. We maintain a simple list and strip them at runtime:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;FILLER_PHRASES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;please could you&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;i was wondering if&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;could you please&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;i would like you to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;as an ai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;can you please&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# ... extend this list based on your traffic patterns
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;compressed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;phrase&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;FILLER_PHRASES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;compressed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;compressed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phrase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;compressed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compressed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="n"&gt;original_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;compressed_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compressed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[compression] original: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;original_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tokens, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compressed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;compressed_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tokens, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;saved: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;original_tokens&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;compressed_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;compressed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CloudWatch log line gives you a measurable view of the savings on every miss — you can query these logs over time to identify your most common filler patterns and keep optimizing the list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One critical design decision:&lt;/strong&gt; compression runs &lt;em&gt;after&lt;/em&gt; both cache checks, not before.&lt;/p&gt;

&lt;p&gt;If you compressed first, you'd alter the prompt before hashing it — so &lt;code&gt;"Could you please explain ML?"&lt;/code&gt; and &lt;code&gt;"Explain ML"&lt;/code&gt; would hash to the same key on the second call but different keys on the first, breaking cache consistency. The original prompt is always used for cache lookups; compression is purely a token cost optimization that only fires when a Bedrock call is actually going to happen.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Pipeline in &lt;code&gt;lambda_handler&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Putting it all together, the handler becomes a clean sequential pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Layer 1: Exact hash match — fastest path, zero AI calls
&lt;/span&gt;    &lt;span class="n"&gt;hash_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HASH_PREFIX&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;compute_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hash_hit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Layer 2: Semantic similarity — catches paraphrases
&lt;/span&gt;    &lt;span class="n"&gt;query_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;stored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_embeddings&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;best_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;best_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stored&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cosine_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;best_score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;best_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;best_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;best_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;SIMILARITY_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;best_response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;semantic_hit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;best_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;

    &lt;span class="c1"&gt;# Layer 3: Compress before sending to Bedrock
&lt;/span&gt;    &lt;span class="n"&gt;compressed_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_bedrock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compressed_prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Write back both hash and embedding for future hits
&lt;/span&gt;    &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CACHE_TTL_SECONDS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;embed_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;EMBED_PREFIX&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;compute_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;store_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embed_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;miss&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Key Observations from Testing
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Cache Result&lt;/th&gt;
&lt;th&gt;Bedrock Call?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;"What is machine learning?"&lt;/code&gt; (1st call)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;miss&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;"What is machine learning?"&lt;/code&gt; (2nd call)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;hash_hit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;"  What is   MACHINE LEARNING?  "&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;hash_hit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;"How would you define machine learning?"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;semantic_hit (0.94)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;"Could you please explain what machine learning is?"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;miss&lt;/code&gt; → compressed&lt;/td&gt;
&lt;td&gt;✅ Yes (fewer tokens)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  When to Use This Pattern
&lt;/h2&gt;

&lt;p&gt;This three-layer pipeline is most valuable when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query volume is high&lt;/strong&gt; — the cost savings on cache hits compound quickly at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Users tend to ask similar questions&lt;/strong&gt; — support bots, knowledge bases, FAQ tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompts are verbose&lt;/strong&gt; — compression delivers more savings when users write long-winded queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency matters&lt;/strong&gt; — a Redis &lt;code&gt;GET&lt;/code&gt; is orders of magnitude faster than a Bedrock roundtrip&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's less impactful for highly creative or unique queries (content generation, code synthesis) where every prompt is genuinely different and semantic similarity won't trigger often.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently in Production
&lt;/h2&gt;

&lt;p&gt;A few things worth considering as you take this pattern to prod:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Replace the linear embedding scan&lt;/strong&gt; with a proper vector search (Redis Stack's &lt;code&gt;HNSW&lt;/code&gt; index, or OpenSearch with k-NN). Scanning every stored embedding is fine at low volume but doesn't scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instrument cache hit rates&lt;/strong&gt; with CloudWatch metrics so you can track ROI over time and justify the ElastiCache spend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tune &lt;code&gt;SIMILARITY_THRESHOLD&lt;/code&gt; per use case.&lt;/strong&gt; A support bot can be aggressive (0.85); a medical or legal assistant should be conservative (0.95+).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyze your CloudWatch compression logs&lt;/strong&gt; weekly and update &lt;code&gt;FILLER_PHRASES&lt;/code&gt; based on real traffic patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add a warm-up step&lt;/strong&gt; for known common queries — pre-populate the cache on deploy so the very first user gets a cache hit.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Three layers, one Lambda function, one ElastiCache cluster. Together they cover the most common sources of Bedrock cost:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What it eliminates&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hash caching&lt;/td&gt;
&lt;td&gt;Exact duplicate calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Paraphrased duplicate calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt compression&lt;/td&gt;
&lt;td&gt;Excess tokens on every genuine miss&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern is modular — you can adopt any one layer independently, and each one pays for itself at a different traffic threshold. Start with hash caching (zero additional AWS cost beyond ElastiCache), add semantic caching once you see recurring paraphrases in your logs, and layer in prompt compression as your prompt corpus grows longer.&lt;/p&gt;

&lt;p&gt;If you're building on Amazon Bedrock, this is one of the highest-ROI architectural patterns you can drop into an existing Lambda-based backend with minimal rework.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built and tested as part of an AWS hands-on lab. All code runs on Python 3.12, Redis 7.1, and Amazon Bedrock Nova Micro via a cross-region inference profile.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Have questions or want to share your own caching numbers? Drop them in the comments below 👇&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aws</category>
      <category>awscommunitybuilder</category>
    </item>
  </channel>
</rss>
