<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jayesh Shinde</title>
    <description>The latest articles on DEV Community by Jayesh Shinde (@jayesh_shinde).</description>
    <link>https://dev.to/jayesh_shinde</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3517237%2Fb526557a-5a75-4338-814a-d6d3277a47b7.jpg</url>
      <title>DEV Community: Jayesh Shinde</title>
      <link>https://dev.to/jayesh_shinde</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jayesh_shinde"/>
    <language>en</language>
    <item>
      <title>Stop Using IAM Access Keys: Secure Cross-Cloud Workloads with OIDC Federation</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Thu, 11 Jun 2026 02:21:33 +0000</pubDate>
      <link>https://dev.to/jayesh_shinde/stop-using-iam-access-keys-secure-cross-cloud-workloads-with-oidc-federation-4jgd</link>
      <guid>https://dev.to/jayesh_shinde/stop-using-iam-access-keys-secure-cross-cloud-workloads-with-oidc-federation-4jgd</guid>
      <description>&lt;p&gt;As developers and DevOps engineers, we’ve all been there. You have an external service—maybe an &lt;strong&gt;Azure Dynamics 365 (D365)&lt;/strong&gt; business application or a &lt;strong&gt;GitHub Actions&lt;/strong&gt; CI/CD pipeline—that needs to upload a file to Amazon S3 or trigger an AWS Lambda function. &lt;/p&gt;

&lt;p&gt;The easiest path? Create an AWS IAM User, generate a pair of static &lt;code&gt;AWS_ACCESS_KEY_ID&lt;/code&gt; and &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt; credentials, dump them into your external service secrets, and call it a day. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stop doing this.&lt;/strong&gt; 🛑&lt;/p&gt;

&lt;p&gt;According to the AWS Well-Architected Framework, long-lived access keys are one of the highest security risks to a cloud environment. If those keys are leaked, hardcoded by accident, or left unrotated, your entire AWS perimeter is compromised.&lt;/p&gt;

&lt;p&gt;The solution? &lt;strong&gt;Workload Identity Federation via OpenID Connect (OIDC)&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;In this post, we’ll look at why you need to ditch IAM users and exactly how to connect external workloads like Azure D365 and GitHub Actions securely using short-lived, temporary tokens.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why NOT IAM Users?
&lt;/h2&gt;

&lt;p&gt;Let’s look at the numbers. Managing static keys manually vs. assuming dynamic roles:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;IAM User (Access Keys)&lt;/th&gt;
&lt;th&gt;IAM Role (OIDC / Assume Role)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Credential Rotation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual, tedious, and error-prone&lt;/td&gt;
&lt;td&gt;Automatic (handled by AWS STS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Leakage Risk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (long-lived keys last forever)&lt;/td&gt;
&lt;td&gt;Low (short-lived tokens expire in &amp;lt;1 hr)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auditability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hard to trace back to specific sessions&lt;/td&gt;
&lt;td&gt;Clear session-based trails in CloudTrail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1 user key per service to track&lt;/td&gt;
&lt;td&gt;1 IAM role, multiple trusted identity claim mappings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS Status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🛑 Discouraged&lt;/td&gt;
&lt;td&gt;Preferred&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Core Concept: Workload Identity Federation
&lt;/h2&gt;

&lt;p&gt;Instead of authenticating with a pre-shared password (an Access Key), AWS and your external identity provider establish a cryptographic trust.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
[External Workload] ──(Requests OIDC JWT)──&amp;gt; [Identity Provider (Entra ID / GitHub)]
│                                                      │
│ (Presents JWT Token)                                 │
▼                                                      ▼
[AWS STS AssumeRoleWithWebIdentity] &amp;lt;──(Verifies Token)─────────┘
│
▼
[Temporary AWS Credentials Granted] ───&amp;gt; [Access AWS Resources (S3, Lambda, etc.)]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS acts as the validation authority, inspecting the incoming JSON Web Token (JWT) from your external provider, matching it against rules you define, and granting temporary AWS IAM credentials valid for only a brief window.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scenario 1: Azure D365 / Entra ID ➔ AWS
&lt;/h2&gt;

&lt;p&gt;If you have a business workflow running in Microsoft Dynamics 365 or an Azure Function trying to authenticate to AWS, this is the gold-standard implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Register Microsoft as an Identity Provider in AWS IAM
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Navigate to &lt;strong&gt;AWS IAM&lt;/strong&gt; ➔ &lt;strong&gt;Identity Providers&lt;/strong&gt; ➔ &lt;strong&gt;Add Provider&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Select &lt;strong&gt;OpenID Connect&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider URL:&lt;/strong&gt; &lt;code&gt;https://login.microsoftonline.com/&amp;lt;YOUR_AZURE_TENANT_ID&amp;gt;/v2.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audience:&lt;/strong&gt; Enter the Application (Client) ID of your Azure App Registration.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  2. Configure the IAM Role &amp;amp; Trust Policy
&lt;/h3&gt;

&lt;p&gt;When creating the IAM role that D365 will assume, the &lt;strong&gt;Trust Policy&lt;/strong&gt; must enforce strict conditions. Do not just check the audience (&lt;code&gt;aud&lt;/code&gt;); you &lt;strong&gt;must&lt;/strong&gt; pin the specific subject (&lt;code&gt;sub&lt;/code&gt;) to ensure other tenants or apps inside your Azure ecosystem can't hijack the role.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Federated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:iam::&amp;lt;AWS_ACCOUNT_ID&amp;gt;:oidc-provider/[login.microsoftonline.com/](https://login.microsoftonline.com/)&amp;lt;AZURE_TENANT_ID&amp;gt;/v2.0"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts:AssumeRoleWithWebIdentity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"[login.microsoftonline.com/](https://login.microsoftonline.com/)&amp;lt;AZURE_TENANT_ID&amp;gt;/v2.0:aud"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;AZURE_APP_CLIENT_ID&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"[login.microsoftonline.com/](https://login.microsoftonline.com/)&amp;lt;AZURE_TENANT_ID&amp;gt;/v2.0:sub"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;AZURE_SERVICE_PRINCIPAL_OBJECT_ID&amp;gt;"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Scenario 2: GitHub Actions ➔ AWS
&lt;/h2&gt;

&lt;p&gt;The exact same principles apply to your CI/CD pipelines. No more storing AWS keys in GitHub Secrets.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Register GitHub as an Identity Provider in AWS IAM
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;AWS IAM&lt;/strong&gt; ➔ &lt;strong&gt;Identity Providers&lt;/strong&gt; ➔ &lt;strong&gt;Add Provider&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider URL:&lt;/strong&gt; &lt;code&gt;https://token.actions.githubusercontent.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audience:&lt;/strong&gt; &lt;code&gt;sts.amazonaws.com&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  2. Configure the GitHub IAM Trust Policy
&lt;/h3&gt;

&lt;p&gt;To keep things secure, your condition should restrict role-assumption to a &lt;strong&gt;specific GitHub Organization&lt;/strong&gt;, &lt;strong&gt;Repository&lt;/strong&gt;, or even a &lt;strong&gt;specific git branch&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Federated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:iam::&amp;lt;AWS_ACCOUNT_ID&amp;gt;:oidc-provider/token.actions.githubusercontent.com"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts:AssumeRoleWithWebIdentity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"token.actions.githubusercontent.com:aud"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts.amazonaws.com"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StringLike"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"token.actions.githubusercontent.com:repo"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your-github-org/your-repo-name:*"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Use it in your GitHub Actions Workflow
&lt;/h3&gt;

&lt;p&gt;In your &lt;code&gt;.github/workflows/deploy.yml&lt;/code&gt;, make sure to grant the workflow &lt;code&gt;id-token: write&lt;/code&gt; permissions so it can request the JWT token from GitHub's OIDC engine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy to AWS&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt; &lt;span class="c1"&gt;# Mandatory for OIDC federation&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;AWSLogin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure AWS Credentials&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;role-to-assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::&amp;lt;AWS_ACCOUNT_ID&amp;gt;:role/YourGitHubOidcRole&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Test AWS Connection&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws s3 ls&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pro-Tips for Production Implementations 💡
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure-as-Code (IaC) Thumbprints:&lt;/strong&gt; If you are automating this setup with Terraform or CloudFormation, AWS requires a server certificate thumbprint for the OIDC provider. For Azure and GitHub, make sure your automated script dynamically fetches or references the correct root CA thumbprints to ensure authentication doesn't suddenly fail when Microsoft or GitHub updates their SSL certificates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "Multi-Tenant" Trap:&lt;/strong&gt; In Azure, always make your App Registration &lt;strong&gt;Single-Tenant&lt;/strong&gt; unless you have an explicit multi-organization architecture requirement. This is an excellent defense-in-depth practice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What if I have an on-premises legacy workload?&lt;/strong&gt; If you have legacy servers with no centralized OIDC identity provider, don't revert to IAM users right away! Look into &lt;strong&gt;AWS IAM Roles Anywhere&lt;/strong&gt;. It allows local legacy infrastructure to use local X.509 certificates to securely request short-lived tokens from AWS STS.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary: Choose Your Strategy
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload Scenario&lt;/th&gt;
&lt;th&gt;Recommended Authentication Mechanism&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Azure D365 / Azure Functions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OIDC Federation (Entra ID) + IAM Role&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub Actions / GitLab CI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OIDC Federation (GitHub/GitLab Provider) + IAM Role&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS Cross-Account Communication&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native IAM Role Cross-Account Trust Policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Legacy On-Premises Server (No IdP)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS IAM Roles Anywhere (X.509 Certificates)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Migrating to Workload Identity Federation might take a few extra minutes of initial configuration compared to copying and pasting static keys, but the massive leap in cloud security makes it non-negotiable for modern systems.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>iam</category>
      <category>security</category>
      <category>cloud</category>
    </item>
    <item>
      <title>The Mystery of the 37-Second Lambda Delay (And How AWS EventBridge Fooled Us)</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sun, 07 Jun 2026 09:31:20 +0000</pubDate>
      <link>https://dev.to/jayesh_shinde/the-mystery-of-the-37-second-lambda-delay-and-how-aws-eventbridge-fooled-us-21m6</link>
      <guid>https://dev.to/jayesh_shinde/the-mystery-of-the-37-second-lambda-delay-and-how-aws-eventbridge-fooled-us-21m6</guid>
      <description>&lt;h2&gt;
  
  
  The Mystery of the 37-Second Lambda Delay (And How AWS EventBridge Fooled Us)
&lt;/h2&gt;

&lt;p&gt;We’ve all been there. Everything works flawlessly in your SQA environment, but the moment your code hits UAT, it behaves like it’s wading through molasses.&lt;/p&gt;

&lt;p&gt;Recently, we ran into a bizarre ghost in our AWS infrastructure: a Node.js Lambda function, triggered on a regular 10-minute interval by Amazon EventBridge, was consistently taking &lt;strong&gt;37 seconds&lt;/strong&gt; to log its very first line of code.&lt;/p&gt;

&lt;p&gt;We initially thought it was a classic cold start or VPC network issue, but the real culprit turned out to be much sneakier. Here is how we realized EventBridge completely fooled us.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: The 37-Second Wall
&lt;/h2&gt;

&lt;p&gt;In SQA, the Lambda invoked within 2 to 3 seconds. In UAT, it took 37 seconds.&lt;/p&gt;

&lt;p&gt;We tried changing the trigger to run every single minute, but the 37-second delay &lt;em&gt;still&lt;/em&gt; happened. We even threw Provisioned Concurrency at it to force the containers to stay warm, but it made absolutely no difference. The &lt;code&gt;START RequestId&lt;/code&gt; log line stubbornly refused to print until the 37th second of the minute.&lt;/p&gt;

&lt;p&gt;The environment wasn't lagging; it was completely stalled before our code even kicked off.&lt;/p&gt;




&lt;h2&gt;
  
  
  How We Debugged It
&lt;/h2&gt;

&lt;p&gt;The breakthrough came when we stopped looking at the CloudWatch timestamps and looked inside the actual &lt;code&gt;event&lt;/code&gt; object payload passed into the Node.js handler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws.events"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-06-14T18:00:37Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"resources"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:events:...:rule/ten-min-cron"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When we looked at that &lt;code&gt;"time"&lt;/code&gt; field generated by EventBridge, the lightbulb finally went on. The timestamp read exactly &lt;code&gt;:37&lt;/code&gt; seconds past the minute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EventBridge wasn't even sending the event to our Lambda until the 37th second.&lt;/strong&gt; Our Lambda wasn't lagging; it was executing the exact millisecond AWS handed it the job.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Realization: EventBridge Jitter &amp;amp; The Redo
&lt;/h2&gt;

&lt;p&gt;As it turns out, AWS explicitly states that EventBridge scheduled rules have a &lt;strong&gt;60-second precision window&lt;/strong&gt;. To prevent millions of customer crons from firing at exactly &lt;code&gt;12:00:00.000&lt;/code&gt; and melting downstream services worldwide (the "thundering herd" problem), AWS intentionally jitters and staggers the execution across those first 60 seconds.&lt;/p&gt;

&lt;p&gt;Our UAT cron rule just happened to get dealt a brutal 37-second delay slot by AWS's internal scheduling engine when the infrastructure was first built.&lt;/p&gt;

&lt;p&gt;To prove it, we completely &lt;strong&gt;destroyed our UAT infrastructure and stood it up again from scratch&lt;/strong&gt;. When the new EventBridge rule was created, AWS assigned it a completely different internal bucket. Boom—the delay instantly dropped to &lt;strong&gt;8 seconds&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Final Takeaway
&lt;/h3&gt;

&lt;p&gt;If you are triggering Lambdas via an ALB or API Gateway, AWS treats it as live traffic and routes it in milliseconds. But with EventBridge crons, you are completely at the mercy of the schedule lottery.&lt;/p&gt;

&lt;p&gt;Our code wasn't broken, and our network was fine. It was just luck of the draw with AWS's background clock engine. If your background tasks can handle running a few seconds late, save yourself the headache and just let EventBridge do its thing!&lt;/p&gt;

</description>
      <category>lambda</category>
      <category>aws</category>
      <category>serverless</category>
      <category>eventbridge</category>
    </item>
    <item>
      <title>How we tracked down a mysterious latency issue in our AWS Lambda + RDS Proxy stack, and discovered Prisma was the culprit all along.</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sun, 17 May 2026 00:56:32 +0000</pubDate>
      <link>https://dev.to/jayesh_shinde/how-we-tracked-down-a-mysterious-latency-issue-in-our-aws-lambda-rds-proxy-stack-and-discovered-4h0i</link>
      <guid>https://dev.to/jayesh_shinde/how-we-tracked-down-a-mysterious-latency-issue-in-our-aws-lambda-rds-proxy-stack-and-discovered-4h0i</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;Our API Was Fine. Database Was Fine. So Why Were Queries Taking 16 Seconds?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It started with a support ticket. A customer-facing API that normally responds in 200ms was occasionally spiking to 16 seconds. Not every request — just enough to make people nervous.&lt;/p&gt;

&lt;p&gt;We run a fairly standard serverless stack: AWS Lambda (Node.js), RDS Proxy in front of Aurora MySQL, and Prisma ORM handling all the database interactions. About 27 microservices, 13 database schemas, and roughly 14 Prisma client instances managed through a shared dependency injection container. It had been running fine for months.&lt;/p&gt;

&lt;p&gt;So what changed? Honestly, nothing. And that's what made it so confusing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ghost in the Logs
&lt;/h2&gt;

&lt;p&gt;The first thing we did was check our application logs. A simple query — something like "fetch an customer info by ID" — was reported as taking 16 seconds from the Lambda's perspective. We grabbed the same query from the logs and ran it manually against the database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3.2 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yeah. Three milliseconds.&lt;/p&gt;

&lt;p&gt;So the database was fast. The query was fast. But our application was experiencing 16 seconds of... something. The time was being spent somewhere between the Lambda and the database, and we had no idea where.&lt;/p&gt;

&lt;h2&gt;
  
  
  Following the Breadcrumbs to RDS Proxy
&lt;/h2&gt;

&lt;p&gt;Since the query itself was not slow, we started looking at the connection layer. We use RDS Proxy to manage connection pooling for our Lambda fleet — standard practice to avoid overwhelming Aurora with hundreds of short-lived connections.&lt;/p&gt;

&lt;p&gt;We pulled up CloudWatch and looked at a metric we had honestly never paid much attention to before:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;DatabaseConnectionsCurrentlySessionPinned&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And there it was. During business hours, we were seeing &lt;strong&gt;400 to 870 pinned connections&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For those unfamiliar: RDS Proxy's whole purpose is to &lt;strong&gt;multiplex&lt;/strong&gt; database connections. Multiple Lambda invocations should be sharing a small pool of backend connections. But when a connection gets "pinned," the proxy dedicates that backend connection exclusively to one client session. It can't be shared. It can't be reused. It just sits there, held hostage.&lt;/p&gt;

&lt;p&gt;With 870 pinned connections, our proxy was essentially not proxying. Lambda invocations were queuing up, waiting for a pinned connection to free up, and that waiting time was showing up as query latency on the application side.&lt;/p&gt;

&lt;p&gt;But why were connections getting pinned?&lt;/p&gt;

&lt;h2&gt;
  
  
  The RDS Proxy Logs Tell All
&lt;/h2&gt;

&lt;p&gt;We dug into the RDS Proxy log group (&lt;code&gt;/rds/proxy&lt;/code&gt;) using CloudWatch Log Insights:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;pinned&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;or&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Pinning&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dominant pinning reason, appearing hundreds of times per minute:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"A protocol-level prepared statement was detected"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Prepared statements. That's what was pinning every single connection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Down the MySQL Protocol Rabbit Hole
&lt;/h2&gt;

&lt;p&gt;Here's something most people don't realize about Prisma — and honestly, we didn't either until this investigation.&lt;/p&gt;

&lt;p&gt;MySQL has two query protocols:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Text Protocol&lt;/strong&gt; (&lt;code&gt;COM_QUERY&lt;/code&gt;) — You send a SQL string, the server parses and executes it, and sends back the result. Stateless. RDS Proxy can multiplex these connections freely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Binary Protocol&lt;/strong&gt; (&lt;code&gt;COM_STMT_PREPARE&lt;/code&gt; → &lt;code&gt;COM_STMT_EXECUTE&lt;/code&gt; → &lt;code&gt;COM_STMT_CLOSE&lt;/code&gt;) — You first ask the server to prepare a statement, which creates a server-side handle. Then you execute against that handle with bound parameters. The handle is tied to the specific connection.&lt;/p&gt;

&lt;p&gt;RDS Proxy &lt;strong&gt;cannot multiplex&lt;/strong&gt; connections that have prepared statement handles open. The moment it sees &lt;code&gt;COM_STMT_PREPARE&lt;/code&gt;, it pins the connection.&lt;/p&gt;

&lt;p&gt;And here's the kicker: &lt;strong&gt;Prisma's query engine uses the binary protocol for everything.&lt;/strong&gt; Every &lt;code&gt;findMany&lt;/code&gt;, every &lt;code&gt;update&lt;/code&gt;, every &lt;code&gt;create&lt;/code&gt; — they all go through &lt;code&gt;COM_STMT_PREPARE&lt;/code&gt; under the hood. Your application code looks like innocent ORM calls, but on the wire, every single one is a prepared statement.&lt;/p&gt;

&lt;p&gt;We had 14 Prisma clients, each running queries through the binary protocol. Every query pinned its connection. Multiply that across a fleet of Lambda invocations, and you get 870 pinned connections during peak hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  First Attempt: &lt;code&gt;statement_cache_size=0&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Our first mitigation idea came from Prisma's documentation. There's a connection URL parameter called &lt;code&gt;statement_cache_size&lt;/code&gt; that controls how many prepared statements are cached. We set it to zero:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;mysql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="k"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;pass&lt;/span&gt;&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3306&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;mydb&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="n"&gt;statement_cache_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The theory was: if we disable caching, Prisma will close each prepared statement immediately after execution instead of holding it open.&lt;/p&gt;

&lt;p&gt;We deployed it. We watched the metrics. And initially, it looked promising — &lt;code&gt;Prepared_stmt_count&lt;/code&gt; on the MySQL server dropped to near zero. But the actual &lt;code&gt;DatabaseConnectionsCurrentlySessionPinned&lt;/code&gt; metric? Still 400-870.&lt;/p&gt;

&lt;p&gt;After more digging, we figured out why. &lt;code&gt;statement_cache_size=0&lt;/code&gt; disables the &lt;strong&gt;cache&lt;/strong&gt;, but it doesn't change the &lt;strong&gt;protocol&lt;/strong&gt;. Prisma still sends &lt;code&gt;COM_STMT_PREPARE&lt;/code&gt; for every query. Even though the statement is closed immediately after execution, RDS Proxy pins the connection the moment it sees that &lt;code&gt;COM_STMT_PREPARE&lt;/code&gt; packet. The pin happens before the statement is even executed, let alone closed.&lt;/p&gt;

&lt;p&gt;So &lt;code&gt;statement_cache_size=0&lt;/code&gt; was a hygiene improvement (it prevents prepared statement count from growing unbounded), but it didn't solve our actual problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Second Attempt: Reducing Pool Size
&lt;/h2&gt;

&lt;p&gt;We tried reducing the connection pool size per Prisma client. The idea was: fewer connections per Lambda means fewer connections to pin, which means the proxy has more backend connections available.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;connection_limit&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It helped a little. The peaks were lower. But we were just managing the symptom — the pinning itself never went away. Every connection was still getting pinned, just fewer of them at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actual Fix: Prisma 7.8.0 and Driver Adapters
&lt;/h2&gt;

&lt;p&gt;While digging through Prisma's changelog and GitHub issues, we discovered that Prisma 7.x had fundamentally changed how the ORM talks to databases.&lt;/p&gt;

&lt;p&gt;In Prisma 5.x (our version), every query goes through a &lt;strong&gt;Rust query engine&lt;/strong&gt; — a compiled binary that speaks the MySQL binary protocol. You literally ship a &lt;code&gt;libquery_engine-rhel-openssl-3.0.x.so.node&lt;/code&gt; file with your Lambda. There's no way to make it use the text protocol.&lt;/p&gt;

&lt;p&gt;In Prisma 7.x, the Rust engine is gone. Instead, you use &lt;strong&gt;driver adapters&lt;/strong&gt; — thin wrappers around native JavaScript database drivers like &lt;code&gt;mysql2&lt;/code&gt; or &lt;code&gt;mariadb&lt;/code&gt;. And crucially, &lt;code&gt;@prisma/adapter-mariadb&lt;/code&gt; supports a &lt;code&gt;useTextProtocol&lt;/code&gt; option.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Prisma 5.x — Rust engine, binary protocol, pins RDS Proxy&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PrismaClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;datasources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;connectionUrl&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Prisma 7.x — Driver adapter, text protocol, RDS Proxy friendly&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;adapter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PrismaMariaDb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;port&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;password&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;mydb&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;useTextProtocol&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PrismaClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;adapter&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;useTextProtocol: true&lt;/code&gt;, the adapter uses &lt;code&gt;connection.query()&lt;/code&gt; (text protocol, &lt;code&gt;COM_QUERY&lt;/code&gt;) instead of &lt;code&gt;connection.execute()&lt;/code&gt; (binary protocol, &lt;code&gt;COM_STMT_PREPARE&lt;/code&gt;). No prepared statements. No pinning.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Migration Wasn't Trivial, But It Was Focused
&lt;/h3&gt;

&lt;p&gt;Upgrading from Prisma 5.12 to 7.8.0 is a major version jump. Here's what we had to change:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema files (all 13 of them):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generator changed from &lt;code&gt;prisma-client-js&lt;/code&gt; to &lt;code&gt;prisma-client&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Removed &lt;code&gt;binaryTargets&lt;/code&gt; entirely — no more Rust binary&lt;/li&gt;
&lt;li&gt;Removed &lt;code&gt;previewFeatures = ["tracing"]&lt;/code&gt; — tracing is GA now&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Dependency injection container:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replaced URL-based datasource configuration with adapter-based instantiation&lt;/li&gt;
&lt;li&gt;Each of our 14 Prisma clients got its own adapter instance with &lt;code&gt;useTextProtocol: true&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Connection config changed from URL string to config object (host, port, user, password, SSL)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Build pipeline:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deleted the entire block that copied the Rust query engine binary into Lambda bundles&lt;/li&gt;
&lt;li&gt;Lambda bundles immediately got ~90% smaller&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What stayed the same:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All our &lt;code&gt;$extends&lt;/code&gt; wrappers (retry logic) worked without changes&lt;/li&gt;
&lt;li&gt;esbuild bundling continued to output CJS — no ESM migration needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total effort was about 2-3 days including testing across all services.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;After deploying Prisma 7.8.0 with driver adapters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DatabaseConnectionsCurrentlySessionPinned&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;400–870 (business hours)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Near zero&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RDS Proxy pin reason&lt;/td&gt;
&lt;td&gt;"protocol-level prepared statement"&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Gone&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda cold start&lt;/td&gt;
&lt;td&gt;~800ms – 1.5s&lt;/td&gt;
&lt;td&gt;~250ms – 400ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda bundle size&lt;/td&gt;
&lt;td&gt;~14 MB (Rust binary)&lt;/td&gt;
&lt;td&gt;~1.6 MB (pure JS/TS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query latency (P99)&lt;/td&gt;
&lt;td&gt;16s spikes&lt;/td&gt;
&lt;td&gt;Consistent &amp;lt; 500ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The mysterious 16-second latencies vanished. Not reduced — vanished.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bonus: ORM Comparison
&lt;/h2&gt;

&lt;p&gt;During our investigation we also evaluated whether switching ORMs entirely might be a cleaner path. Here's what we found:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Prisma 5.12 (Rust Engine)&lt;/th&gt;
&lt;th&gt;Prisma 7.8.0 (Adapter)&lt;/th&gt;
&lt;th&gt;Drizzle ORM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Engine Footprint&lt;/td&gt;
&lt;td&gt;~14 MB (Heavy Rust Binary)&lt;/td&gt;
&lt;td&gt;~1.6 MB (WASM / TS)&lt;/td&gt;
&lt;td&gt;Minimal (Pure JS/TS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical Cold Start&lt;/td&gt;
&lt;td&gt;Very Poor (800ms – 1.5s+)&lt;/td&gt;
&lt;td&gt;Good (250ms – 400ms)&lt;/td&gt;
&lt;td&gt;Excellent (100ms – 200ms)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RDS Proxy Friendly?&lt;/td&gt;
&lt;td&gt;No (Pins everything)&lt;/td&gt;
&lt;td&gt;Yes (Via &lt;code&gt;useTextProtocol&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Yes (Via &lt;code&gt;prepare: false&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type Safety Style&lt;/td&gt;
&lt;td&gt;Generated Schema Client&lt;/td&gt;
&lt;td&gt;Generated Schema Client&lt;/td&gt;
&lt;td&gt;Code-first / Infer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Drizzle is objectively lighter and has better cold starts, but migrating 27 services away from Prisma would be a months-long project. Prisma 7.8.0 with driver adapters got us to "RDS Proxy friendly" without changing our query layer, model definitions, or testing patterns. For us, that was the right trade-off.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Your ORM's wire protocol matters more than you think.&lt;/strong&gt;&lt;br&gt;
We spent weeks optimizing queries that didn't need optimizing. The queries were fast. The protocol was the problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. &lt;code&gt;statement_cache_size=0&lt;/code&gt; is not the same as "no prepared statements."&lt;/strong&gt;&lt;br&gt;
This one burned us. Disabling the cache still uses &lt;code&gt;COM_STMT_PREPARE&lt;/code&gt; — it just closes the statement immediately. RDS Proxy doesn't care. It pins on the &lt;code&gt;PREPARE&lt;/code&gt;, not on the cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. RDS Proxy logs are your friend.&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;/rds/proxy&lt;/code&gt; CloudWatch log group tells you exactly why connections are being pinned. We should have looked there on day one instead of chasing query performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. CloudWatch metrics can mislead without context.&lt;/strong&gt;&lt;br&gt;
We initially saw &lt;code&gt;DatabaseConnectionsCurrentlySessionPinned&lt;/code&gt; drop after deploying &lt;code&gt;statement_cache_size=0&lt;/code&gt; and thought we'd fixed it. The drop was real but temporary — a deployment artifact from Lambda instances recycling. The metric climbed right back up once traffic resumed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Check your Rust binary targets.&lt;/strong&gt;&lt;br&gt;
While investigating, we also discovered our Lambda was bundling an OpenSSL 1.0.x query engine binary on an Amazon Linux 2023 runtime that uses OpenSSL 3.0.x. A latent mismatch that happened to work by accident but could have caused cryptic failures at any time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Still Some Loose Ends
&lt;/h2&gt;

&lt;p&gt;Even after the Prisma 7 migration, we found two residual pinning patterns in our RDS Proxy logs — both from raw SQL queries that use MySQL user variables (&lt;code&gt;SET @val = ...&lt;/code&gt;). These pin for a different reason: "SQL changed session settings that the proxy doesn't track."&lt;/p&gt;

&lt;p&gt;One we converted to a Prisma model query (it was using &lt;code&gt;SELECT INTO @var&lt;/code&gt; when a simple &lt;code&gt;findFirst&lt;/code&gt; + &lt;code&gt;update&lt;/code&gt; would do). The other involves a stored procedure with an &lt;code&gt;OUT&lt;/code&gt; parameter, which genuinely requires user variables — that's a stored proc refactor we've deferred.&lt;/p&gt;

&lt;p&gt;The prepared statement pinning that was causing 400-870 pinned connections? That's completely gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;If you're running Prisma with AWS RDS Proxy and experiencing mysterious latency:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check &lt;code&gt;DatabaseConnectionsCurrentlySessionPinned&lt;/code&gt; in CloudWatch&lt;/li&gt;
&lt;li&gt;Check your RDS Proxy logs for &lt;code&gt;"prepared statement was detected"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;If that's your issue, &lt;code&gt;statement_cache_size=0&lt;/code&gt; won't fix it — the protocol is the problem&lt;/li&gt;
&lt;li&gt;Upgrade to Prisma 7.x with &lt;code&gt;@prisma/adapter-mariadb&lt;/code&gt; (or &lt;code&gt;@prisma/adapter-mysql&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;useTextProtocol: true&lt;/code&gt; on the adapter&lt;/li&gt;
&lt;li&gt;Watch your pinned connections drop to zero&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;This post documents a real production investigation. The codebase manages ~27 microservices across 13 MySQL schemas on AWS Lambda with RDS Proxy. If you've hit something similar, I hope this saves you the weeks of debugging it took us.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>prisma</category>
      <category>mysql</category>
      <category>serverless</category>
    </item>
    <item>
      <title>How We Fixed Intermittent ECS Image-Not-Found Errors in AWS CDK</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Fri, 13 Mar 2026 09:00:11 +0000</pubDate>
      <link>https://dev.to/jayesh_shinde/how-we-fixed-intermittent-ecs-image-not-found-errors-in-aws-cdk-477f</link>
      <guid>https://dev.to/jayesh_shinde/how-we-fixed-intermittent-ecs-image-not-found-errors-in-aws-cdk-477f</guid>
      <description>&lt;p&gt;At one point, our ECS deployments started failing in a way that felt random.&lt;/p&gt;

&lt;p&gt;Sometimes a deployment would work perfectly. Sometimes the service would try to roll forward and fail because the container image it expected was no longer available. Nothing was wrong with the application code. The problem was in the deployment asset flow.&lt;/p&gt;

&lt;p&gt;We were using AWS CDK to deploy container-based workloads, and like many teams, we were relying on CDK’s default bootstrap ECR repository for Docker image assets. That was convenient at first, but it became a problem once repository retention rules were tightened for cost control.&lt;/p&gt;

&lt;p&gt;In environments with frequent deployments, older intermediate images were being cleaned up faster than our deployment flow could safely tolerate. The result was intermittent ECS deploy failures caused by missing images.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Root Cause
&lt;/h2&gt;

&lt;p&gt;AWS CDK Docker assets are published during the &lt;strong&gt;asset publishing phase&lt;/strong&gt;, which happens before CloudFormation starts deploying stacks.&lt;/p&gt;

&lt;p&gt;That means two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CDK is not just defining infrastructure, it is also managing where deployable image assets are stored.&lt;/li&gt;
&lt;li&gt;If the default asset repository has aggressive cleanup policies, your deployments can become fragile.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is especially painful in non-production environments where deployment frequency is high and image churn is constant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Strategy We Took
&lt;/h2&gt;

&lt;p&gt;We wanted a solution that was simple, low-risk, and did not require redesigning the whole build pipeline.&lt;/p&gt;

&lt;p&gt;So instead of pushing ECS image assets to the shared default CDK ECR repository, we moved to a &lt;strong&gt;dedicated ECR repository per environment/application area&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At a high level, the fix looked like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;create a dedicated ECR repository ahead of time&lt;/li&gt;
&lt;li&gt;configure the CDK synthesizer to publish image assets there&lt;/li&gt;
&lt;li&gt;keep lifecycle control on that repository&lt;/li&gt;
&lt;li&gt;deploy the ECR stack first, then the app stacks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gave us isolation from the shared bootstrap repository while keeping the rest of the CDK deployment model mostly unchanged.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sample: Dedicated ECR Stack
&lt;/h2&gt;

&lt;p&gt;Here is a simplified example of creating a dedicated ECR repository with a lifecycle policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Stack&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;RemovalPolicy&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aws-cdk-lib&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Repository&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aws-cdk-lib/aws-ecr&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Construct&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;constructs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;AppEnvProps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ContainerAssetRepoStack&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Stack&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Construct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AppEnvProps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Repository&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AppContainerRepo&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;repositoryName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`myapp-assets-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;removalPolicy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;RemovalPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;RETAIN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;lifecycleRules&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;rulePriority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Keep only the latest 100 images&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;maxImageCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few important details here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;RETAIN&lt;/code&gt; protects the repository if the stack is deleted later.&lt;/li&gt;
&lt;li&gt;lifecycle rules still clean up old images over time.&lt;/li&gt;
&lt;li&gt;the repository name is normalized to lowercase, which is important for ECR.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Sample: Point CDK to the Dedicated Repo
&lt;/h2&gt;

&lt;p&gt;Once the repository exists, the application stack can tell CDK to publish image assets there using &lt;code&gt;DefaultStackSynthesizer&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;App&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;DefaultStackSynthesizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Stack&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aws-cdk-lib&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Construct&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;constructs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;ServiceStackProps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ServiceStack&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Stack&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Construct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ServiceStackProps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;synthesizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DefaultStackSynthesizer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;imageAssetsRepositoryName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`myapp-assets-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// ECS service, task definition, container asset usage, etc.&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;App&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ServiceStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ServiceStackDev&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;dev&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps the existing CDK asset publishing model, but moves the destination away from the shared default bootstrap repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Important Gotcha
&lt;/h2&gt;

&lt;p&gt;A stack dependency is &lt;strong&gt;not enough&lt;/strong&gt; if the same deployment run tries to create the ECR repository and publish assets into it.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because asset publishing happens before CloudFormation stack deployment.&lt;/p&gt;

&lt;p&gt;So if the repository does not already exist, the asset publish step can fail before your “repo stack” is even deployed.&lt;/p&gt;

&lt;p&gt;The safest pattern is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;deploy the ECR repository stack first&lt;/li&gt;
&lt;li&gt;run the normal application deployment after that&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That sequencing matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Another Important Gotcha: IAM Permissions
&lt;/h2&gt;

&lt;p&gt;Changing the repository target is not enough by itself.&lt;/p&gt;

&lt;p&gt;The identity or role that CDK uses to publish Docker assets must also have permission to push to the new ECR repository.&lt;/p&gt;

&lt;p&gt;That usually means allowing actions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ecr:PutImage&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:InitiateLayerUpload&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:UploadLayerPart&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:CompleteLayerUpload&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:BatchCheckLayerAvailability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:BatchGetImage&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:GetDownloadUrlForLayer&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:GetAuthorizationToken&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you forget this part, the deployment simply moves from “image missing” problems to “access denied” problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Worked Well for Us
&lt;/h2&gt;

&lt;p&gt;We liked this approach because it was a practical middle ground.&lt;/p&gt;

&lt;p&gt;It did not require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rebuilding our CI/CD image strategy from scratch&lt;/li&gt;
&lt;li&gt;changing every ECS service definition&lt;/li&gt;
&lt;li&gt;introducing a more complex app-owned image publishing flow immediately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it did give us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;predictable image retention&lt;/li&gt;
&lt;li&gt;environment-specific isolation&lt;/li&gt;
&lt;li&gt;fewer surprises during ECS deployments&lt;/li&gt;
&lt;li&gt;better control over cost and cleanup behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to Use This Pattern
&lt;/h2&gt;

&lt;p&gt;This approach makes sense if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;you already use CDK-managed Docker/container assets&lt;/li&gt;
&lt;li&gt;the default bootstrap ECR repository is shared across too many deployments&lt;/li&gt;
&lt;li&gt;retention rules on that shared repository are causing instability&lt;/li&gt;
&lt;li&gt;you want a fast, low-disruption improvement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want a more explicit long-term model, the next step is usually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;build image in CI&lt;/li&gt;
&lt;li&gt;push image to a named ECR repository yourself&lt;/li&gt;
&lt;li&gt;reference the image directly in ECS by repo and tag&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gives maximum control, but it also requires more changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;CDK defaults are great for getting started, but they are not always ideal once platform constraints like retention, cost control, and deployment frequency start to matter.&lt;/p&gt;

&lt;p&gt;In our case, moving Docker assets to dedicated ECR repositories was a small change with a big operational impact. It made deployments more predictable without forcing a major rework of the pipeline.&lt;/p&gt;

</description>
      <category>cdk</category>
      <category>aws</category>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>The Silent Connection Killer: MySQL2 and AWS Lambda's Freeze/Thaw Problem</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Thu, 05 Feb 2026 09:03:13 +0000</pubDate>
      <link>https://dev.to/jayesh_shinde/the-silent-connection-killer-mysql2-and-aws-lambdas-freezethaw-problem-1pek</link>
      <guid>https://dev.to/jayesh_shinde/the-silent-connection-killer-mysql2-and-aws-lambdas-freezethaw-problem-1pek</guid>
      <description>&lt;h2&gt;
  
  
  The Mystery Error
&lt;/h2&gt;

&lt;p&gt;You're running a Node.js Lambda with MySQL2, everything works great in testing, but production logs show intermittent failures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: Connection lost: The server closed the connection.code: "PROTOCOL_CONNECTION_LOST"
fatal: true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No pattern. No warning. Just random failures that make you question your life choices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Real Problem : How Lambda Actually Works
&lt;/h2&gt;

&lt;p&gt;Lambda doesn't spin up a fresh container for every request. AWS keeps containers "warm" for reuse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request 1 → Lambda runs → Response
                ↓
           [FREEZE] ← Container paused (not terminated)
                ↓     
         (5-15 min pass)
                ↓
           [THAW] ← Container resumed
                ↓
Request 2 → Lambda runs → 💥 PROTOCOL_CONNECTION_LOST

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;During freeze, your Lambda is literally paused. The JavaScript event loop stops. Timers stop. Everything stops.&lt;br&gt;
But here's the catch: the outside world doesn't stop.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Happens to Your Database Connection
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Lambda creates a MySQL connection pool&lt;/li&gt;
&lt;li&gt;Connections sit idle in the pool&lt;/li&gt;
&lt;li&gt;Lambda freezes (container paused)&lt;/li&gt;
&lt;li&gt;Real world time passes (5-30 minutes)&lt;/li&gt;
&lt;li&gt;Network timeouts occur, NAT gateways clear state, RDS Proxy cleans up&lt;/li&gt;
&lt;li&gt;The TCP socket dies, but your pool doesn't know&lt;/li&gt;
&lt;li&gt;Lambda thaws, tries to use the dead connection&lt;/li&gt;
&lt;li&gt;💥 PROTOCOL_CONNECTION_LOST&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Why idleTimeout Doesn't Help
&lt;/h2&gt;

&lt;p&gt;You might think: "I'll set idleTimeout: 60000 to clean up idle connections!"&lt;br&gt;
Here's why it doesn't work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Timer starts (60s countdown)
    ↓
Lambda FREEZES at 1s elapsed
    ↓
████████████████████████████████
█  15 minutes pass in REAL WORLD  █
█  Timer is PAUSED at 1s          █
████████████████████████████████
    ↓
Lambda THAWS - timer resumes at 1s
    ↓
Connection still in pool (timer thinks 2s passed)
    ↓
Connection is DEAD but pool doesn't know

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The timer doesn't run during freeze.&lt;/strong&gt; Your 60-second timeout is useless against a 15-minute freeze.&lt;/p&gt;

&lt;p&gt;The Solution: Detect and Retry&lt;br&gt;
Since we can't prevent stale connections, we detect them and retry transparently.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1: Enable TCP Keep-Alive
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const pool = mysql.createPool({
  ...config,
  enableKeepAlive: true,
  keepAliveInitialDelay: 10000,
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This helps get clear error codes (ECONNRESET, PROTOCOL_CONNECTION_LOST) instead of hanging indefinitely.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 2: Implement Retry Logic
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async executeQuery(sql, params) {
  const maxRetries = 1;
  let lastError = null;

  for (let attempt = 0; attempt &amp;lt;= maxRetries; attempt++) {
    let connection = null;

    try {
      connection = await this.getConnectionFromPool();
      const result = await this.executeQueryWithConnection(connection, sql, params);
      connection.release();
      return result;
    } catch (error) {
      // Non-recoverable error - throw immediately
      if (!this.isConnectionLostError(error)) {
        if (connection) connection.release();
        throw error;
      }

      // Connection lost - destroy stale connection
      if (connection) connection.destroy();

      // Retry if attempts left
      lastError = error;
      if (attempt &amp;lt; maxRetries) {
        console.warn("Connection lost, retrying...", { attempt: attempt + 1 });
        continue;
      }
    }
  }

  throw lastError;
}

isConnectionLostError(error) {
  const recoverableCodes = [
    "PROTOCOL_CONNECTION_LOST",  // Server closed connection
    "ECONNRESET",                // TCP reset
    "EPIPE",                     // Broken pipe
    "ETIMEDOUT",                 // Connection timeout
    "ECONNREFUSED",              // Connection refused
  ];
  return recoverableCodes.includes(error?.code);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Key Points:
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;connection.destroy()&lt;/code&gt; - Removes stale connection from pool (don't reuse it!)&lt;br&gt;
&lt;code&gt;connection.release()&lt;/code&gt; - Returns healthy connection to pool&lt;br&gt;
One retry is usually enough - The second attempt gets a fresh connection&lt;/p&gt;
&lt;h1&gt;
  
  
  What About Pool Settings?
&lt;/h1&gt;

&lt;p&gt;Do I Need to Tune connectionLimit, maxIdle, etc.?&lt;br&gt;
Short answer: Not really.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Helps with freeze/thaw?&lt;/th&gt;
&lt;th&gt;Why?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;idleTimeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Timer paused during freeze&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;maxIdle&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Marginally&lt;/td&gt;
&lt;td&gt;Fewer connections = fewer stale ones, but adds reconnection overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;connectionLimit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Doesn't affect stale connections&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retry logic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Handles stale connections at runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you're using RDS Proxy, it handles connection pooling at the infrastructure level. Keep your Lambda pool settings simple and let the retry logic do the heavy lifting.&lt;/p&gt;
&lt;h2&gt;
  
  
  Using RDS Proxy?
&lt;/h2&gt;

&lt;p&gt;RDS Proxy's "Idle client connection timeout" (default: 30 minutes) is separate from MySQL's wait_timeout. The proxy manages Lambda→Proxy connections independently.&lt;br&gt;
But even with RDS Proxy, connections can die due to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NAT gateway timeouts (typically 5-15 minutes for idle TCP)&lt;/li&gt;
&lt;li&gt;Network state table cleanup&lt;/li&gt;
&lt;li&gt;Proxy internal connection recycling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The retry logic is still your safety net.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The Final Architecture
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│                    Your Lambda                       │
├─────────────────────────────────────────────────────┤
│                                                      │
│   executeQuery()                                     │
│       ↓                                             │
│   for (attempt = 0; attempt &amp;lt;= 1; attempt++)        │
│       ↓                                             │
│   getConnection() → Try query                       │
│       ↓                                             │
│   Success? → return result                          │
│       ↓                                             │
│   Connection lost? → destroy() → retry              │
│       ↓                                             │
│   Other error? → release() → throw                  │
│                                                      │
└─────────────────────────────────────────────────────┘
           ↓
    [RDS Proxy] (optional)
           ↓
    [MySQL/Aurora]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lambda freezes, connections go stale&lt;/td&gt;
&lt;td&gt;Retry logic detects and recovers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pool doesn't know connections are dead&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;enableKeepAlive&lt;/code&gt; for faster detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;idleTimeout&lt;/code&gt; doesn't work during freeze&lt;/td&gt;
&lt;td&gt;Accept it, rely on retry instead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random &lt;code&gt;PROTOCOL_CONNECTION_LOST&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Transparent retry = users don't notice&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: You can't prevent stale connections in a serverless environment. But you can detect them instantly and retry transparently.&lt;/p&gt;
&lt;h1&gt;
  
  
  BUT....
&lt;/h1&gt;


&lt;h2&gt;
  
  
  The Problem With Just Retrying Once
&lt;/h2&gt;

&lt;p&gt;The previous article suggested detecting a stale connection error and retrying once. That works — but only if a &lt;strong&gt;single connection&lt;/strong&gt; went stale. After a longer Lambda freeze (10–15+ minutes), the entire pool goes stale. Here's the scenario:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pool has 5 connections → Lambda freezes → all 5 TCP sockets die

Request comes in after thaw:
→ gets conn #1 from _freeConnections → FAILS (stale)
→ retry: gets conn #2 from _freeConnections → FAILS (also stale!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One retry is not enough because &lt;code&gt;_freeConnections&lt;/code&gt; is a queue — the retry just picks the next dead connection in line.&lt;/p&gt;




&lt;h2&gt;
  
  
  How mysql2's Pool Actually Works Internally
&lt;/h2&gt;

&lt;p&gt;Looking at mysql2's &lt;code&gt;pool.js&lt;/code&gt; source, &lt;code&gt;getConnection()&lt;/code&gt; does this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified from mysql2 internals&lt;/span&gt;
&lt;span class="nf"&gt;getConnection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_freeConnections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_freeConnections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shift&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// FIFO — no health check&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;// if allConnections.length &amp;lt; connectionLimit → create a NEW connection&lt;/span&gt;
  &lt;span class="c1"&gt;// otherwise → queue the request in _connectionQueue&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;There is zero validation when pulling from &lt;code&gt;_freeConnections&lt;/code&gt;.&lt;/strong&gt; The pool hands you whatever is sitting there, stale or not.&lt;/p&gt;

&lt;p&gt;The inverse is also useful to know — when &lt;code&gt;_freeConnections&lt;/code&gt; is empty AND &lt;code&gt;_allConnections.length &amp;lt; connectionLimit&lt;/code&gt;, mysql2 will automatically create a &lt;strong&gt;brand new&lt;/strong&gt; TCP connection. This is the behavior we want to exploit.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix: Drain All Free Connections on a Stale Error
&lt;/h2&gt;

&lt;p&gt;Instead of retrying once and hoping the next connection is healthy, destroy &lt;strong&gt;every connection&lt;/strong&gt; in &lt;code&gt;_freeConnections&lt;/code&gt; the moment you detect a stale error. The pool's own logic then forces a fresh connection on retry.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;STALE_ERRORS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;PROTOCOL_CONNECTION_LOST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ECONNRESET&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;EPIPE&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ETIMEDOUT&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ECONNREFUSED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;drainFreeConnections&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Destroy all idle connections in one sweep&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;conn&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_freeConnections&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;destroy&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// sets conn._pool = null, removes from _allConnections&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_freeConnections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// clear the array in-place&lt;/span&gt;
  &lt;span class="c1"&gt;// pool._allConnections.length is now reduced → next getConnection()&lt;/span&gt;
  &lt;span class="c1"&gt;// sees: freeConnections empty + allConnections &amp;lt; connectionLimit&lt;/span&gt;
  &lt;span class="c1"&gt;// → creates a fresh TCP connection automatically&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;executeQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getConnection&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;release&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;destroy&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// kill the one that triggered the error&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;STALE_ERRORS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// Nuke all remaining free connections — they're all suspect&lt;/span&gt;
      &lt;span class="nf"&gt;drainFreeConnections&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="c1"&gt;// Retry — pool is now forced to open a fresh connection&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;freshConn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getConnection&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;freshConn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nx"&gt;freshConn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;release&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;retryErr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;freshConn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;destroy&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;retryErr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why &lt;code&gt;pool._freeConnections.length = 0&lt;/code&gt; instead of a &lt;code&gt;while&lt;/code&gt; loop with &lt;code&gt;shift()&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;pool._freeConnections&lt;/code&gt; is an array used as a &lt;strong&gt;FIFO queue&lt;/strong&gt; — mysql2 uses &lt;code&gt;shift()&lt;/code&gt; when getting connections and &lt;code&gt;push()&lt;/code&gt; when releasing them. Since we're destroying &lt;strong&gt;all&lt;/strong&gt; of them, iterating with &lt;code&gt;for...of&lt;/code&gt; then zeroing the length is simpler and safer than mutating the array mid-iteration with &lt;code&gt;shift()&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What &lt;code&gt;conn.destroy()&lt;/code&gt; does under the hood
&lt;/h3&gt;

&lt;p&gt;When you call &lt;code&gt;destroy()&lt;/code&gt;, mysql2 does:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified from connection._removeFromPool()&lt;/span&gt;
&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_allConnections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;splice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_allConnections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;indexOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_pool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;// TCP socket is closed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So after &lt;code&gt;drainFreeConnections&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;_freeConnections&lt;/code&gt; → empty ✅&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;_allConnections.length&lt;/code&gt; → drops back toward 0 ✅&lt;/li&gt;
&lt;li&gt;Next &lt;code&gt;getConnection()&lt;/code&gt; → pool sees room under &lt;code&gt;connectionLimit&lt;/code&gt; → creates fresh TCP connection ✅&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Comparison: Single Retry vs. Drain All
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;After 1 stale conn&lt;/th&gt;
&lt;th&gt;After full pool stale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single retry (article)&lt;/td&gt;
&lt;td&gt;✅ Works&lt;/td&gt;
&lt;td&gt;❌ Retry hits another stale conn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Drain all + retry (this approach)&lt;/td&gt;
&lt;td&gt;✅ Works&lt;/td&gt;
&lt;td&gt;✅ Pool forced to create fresh conn&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  A Note on &lt;code&gt;_freeConnections&lt;/code&gt; Being a Private API
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;_freeConnections&lt;/code&gt; is not exported in mysql2's TypeScript typings (it's missing from &lt;code&gt;Pool.d.ts&lt;/code&gt;). In TypeScript, you'll need a cast:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pool&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;_freeConnections&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It has been stable and present since the beginning of the library. But since it's not officially part of the public API, it's worth keeping an eye on across major version upgrades.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Prisma Doesn't Have This Problem
&lt;/h2&gt;

&lt;p&gt;If you're using Prisma, you may have noticed it doesn't suffer from the freeze/thaw stale connection issue as badly. There's a concrete reason for this — it's not magic, it's the Rust query engine.&lt;/p&gt;

&lt;p&gt;Prisma uses a connection pool built on the &lt;code&gt;mobc&lt;/code&gt; library inside its Rust engine. Before handing a connection to your query, it performs a &lt;strong&gt;time-gated pre-ping&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Connection pulled from pool
         ↓
Has more than 15 seconds passed since this connection was last used?
    YES → run SELECT 1
              ↓
         SELECT 1 succeeds? → proceed with query
         SELECT 1 fails?    → discard, open fresh connection
    NO  → skip ping, proceed directly (optimization for rapid queries)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is essentially the same pattern as SQLAlchemy's &lt;code&gt;pool_pre_ping=True&lt;/code&gt;, with a 15-second grace window to avoid pinging on every rapid-fire query.&lt;/p&gt;

&lt;p&gt;After a Lambda freeze of any meaningful duration (seconds to minutes), the timer has expired, so Prisma will ping &lt;strong&gt;before&lt;/strong&gt; your query even runs — and silently replace any dead connection. Your application code never sees the error.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it stacks up
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;mysql2 (raw pool)&lt;/th&gt;
&lt;th&gt;Prisma (Rust engine)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pre-ping on checkout&lt;/td&gt;
&lt;td&gt;❌ None&lt;/td&gt;
&lt;td&gt;✅ If &amp;gt;15s idle&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Handles full pool going stale&lt;/td&gt;
&lt;td&gt;❌ Needs manual drain logic&lt;/td&gt;
&lt;td&gt;✅ Each connection validated individually&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error surfaces to app code&lt;/td&gt;
&lt;td&gt;✅ Yes — you must handle it&lt;/td&gt;
&lt;td&gt;❌ Transparent — retried internally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overhead&lt;/td&gt;
&lt;td&gt;None (no extra queries)&lt;/td&gt;
&lt;td&gt;One &lt;code&gt;SELECT 1&lt;/code&gt; per connection after idle period&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Even Prisma Isn't Bulletproof
&lt;/h3&gt;

&lt;p&gt;Worth noting: Prisma's pre-ping protects against stale connections, but the 15-second threshold means a freeze shorter than 15 seconds could still theoretically slip through. And connection-level issues outside the pool (e.g. NAT gateway state tables, RDS Proxy recycling) can still cause failures that the pre-ping doesn't catch. Retry logic at the application layer remains a good safety net regardless of ORM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The key insight is to work &lt;em&gt;with&lt;/em&gt; mysql2's internal pool logic rather than against it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;On a stale connection error, don't just retry — &lt;strong&gt;drain &lt;code&gt;_freeConnections&lt;/code&gt; first&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;mysql2 will automatically open fresh connections to fill the gap (it's built into &lt;code&gt;getConnection()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Your retry then gets a genuinely new TCP connection instead of another dead one from the queue&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you want this behavior without managing it yourself, Prisma's Rust engine gives you a time-gated pre-ping out of the box — which is the more principled long-term solution for serverless MySQL workloads.&lt;/p&gt;

</description>
      <category>mysql</category>
      <category>lambda</category>
      <category>backenddevelopment</category>
      <category>prisma</category>
    </item>
    <item>
      <title>Building a Clean Event Pipeline in Spring: From Simple Events to Async Listeners to the Outbox Pattern</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sat, 03 Jan 2026 07:43:05 +0000</pubDate>
      <link>https://dev.to/jayesh_shinde/building-a-clean-event-pipeline-in-spring-from-simple-events-to-async-listeners-to-the-outbox-3pi7</link>
      <guid>https://dev.to/jayesh_shinde/building-a-clean-event-pipeline-in-spring-from-simple-events-to-async-listeners-to-the-outbox-3pi7</guid>
      <description>&lt;p&gt;Event‑driven architecture sounds simple on paper: &lt;em&gt;“emit an event when something happens.”&lt;/em&gt;&lt;br&gt;&lt;br&gt;
But once you start implementing it inside a real Spring Boot service, you quickly discover the hidden trade‑offs.&lt;/p&gt;

&lt;p&gt;In this post, we’ll walk through a real‑world progression:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;emitting domain events inside a service
&lt;/li&gt;
&lt;li&gt;handling them with &lt;code&gt;@EventListener&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;realizing enrichment logic slows down the request
&lt;/li&gt;
&lt;li&gt;making listeners async
&lt;/li&gt;
&lt;li&gt;adding a production‑grade executor
&lt;/li&gt;
&lt;li&gt;and finally touching the &lt;strong&gt;gold standard&lt;/strong&gt;: the &lt;strong&gt;Outbox Pattern&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s dive in.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;1. The initial requirement: emit an event inside the service&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Imagine a simple use case: when a user is created, we want to emit an event so other parts of the system can react.&lt;/p&gt;

&lt;p&gt;A clean way to do this in Spring is to wrap &lt;code&gt;ApplicationEventPublisher&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Component&lt;/span&gt;
&lt;span class="nd"&gt;@AllArgsConstructor&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;UserEventPublisher&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;ApplicationEventPublisher&lt;/span&gt; &lt;span class="n"&gt;applicationEventPublisher&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;T&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;applicationEventPublisher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;publishEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now inside your service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;userEventPublisher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;publish&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;UserCreatedEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;2. Handling the event with &lt;code&gt;@EventListener&lt;/code&gt;&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A simple listener:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Component&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;UserAuditListener&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="nd"&gt;@EventListener&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;handleUserCreateEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;UserCreatedEvent&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"User created: "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works beautifully… until you need to do more than just print.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;3. The problem: enrichment logic slows down the request&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s say before publishing to Kafka, you want to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fetch additional data from DB
&lt;/li&gt;
&lt;li&gt;call another service
&lt;/li&gt;
&lt;li&gt;enrich the event payload
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since &lt;code&gt;@EventListener&lt;/code&gt; is &lt;strong&gt;synchronous by default&lt;/strong&gt;, all this work blocks the original request thread.&lt;/p&gt;

&lt;p&gt;Your API response time suddenly spikes.&lt;/p&gt;

&lt;p&gt;Not good.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;4. Making listeners async with &lt;code&gt;@Async&lt;/code&gt; and &lt;code&gt;@EnableAsync&lt;/code&gt;&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Spring makes this easy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@EnableAsync&lt;/span&gt;
&lt;span class="nd"&gt;@SpringBootApplication&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;App&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And in the listener:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Async&lt;/span&gt;
&lt;span class="nd"&gt;@EventListener&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;handleUserCreateEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;UserCreatedEvent&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// runs in a background thread&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the main request returns immediately while the listener does its work asynchronously.&lt;/p&gt;

&lt;p&gt;But there’s a catch…&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;5. The default executor is not production‑grade&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you don’t configure anything, Spring uses &lt;code&gt;SimpleAsyncTaskExecutor&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;creates a new thread per task
&lt;/li&gt;
&lt;li&gt;no pooling
&lt;/li&gt;
&lt;li&gt;no backpressure
&lt;/li&gt;
&lt;li&gt;no monitoring
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is fine for demos, not for real systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;6. Adding a custom executor&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A better approach is to define your own thread pool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Configuration&lt;/span&gt;
&lt;span class="nd"&gt;@EnableAsync&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AsyncConfig&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="nd"&gt;@Bean&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"taskExecutor"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Executor&lt;/span&gt; &lt;span class="nf"&gt;taskExecutor&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Executors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newCachedThreadPool&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now all &lt;code&gt;@Async&lt;/code&gt; methods use this executor.&lt;/p&gt;

&lt;p&gt;You can replace it with a tuned &lt;code&gt;ThreadPoolTaskExecutor&lt;/code&gt; for even more control.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;7. The gold standard: the Outbox Pattern&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Async listeners solve the latency problem, but they don’t solve the &lt;strong&gt;reliability&lt;/strong&gt; problem.&lt;/p&gt;

&lt;p&gt;What if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the DB transaction commits
&lt;/li&gt;
&lt;li&gt;but the async listener fails before sending to Kafka?
&lt;/li&gt;
&lt;li&gt;or the service crashes?
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You lose the event.&lt;/p&gt;

&lt;p&gt;This is why mature systems use the &lt;strong&gt;Outbox Pattern&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How the Outbox Pattern works (high‑level)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Write the event into an “outbox” table inside the same DB transaction&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the user is created, the outbox record is also created
&lt;/li&gt;
&lt;li&gt;Atomic, consistent, no partial failures&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A background process reads the outbox table&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
This can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a scheduled Spring job
&lt;/li&gt;
&lt;li&gt;a Kafka Connect Debezium connector
&lt;/li&gt;
&lt;li&gt;a lightweight polling thread
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The background process publishes the event to Kafka&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;After successful publish, the outbox record is marked as processed&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why this is the gold standard&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;no lost events
&lt;/li&gt;
&lt;li&gt;no double‑publishing
&lt;/li&gt;
&lt;li&gt;no dependency on async listeners
&lt;/li&gt;
&lt;li&gt;fully decoupled from request latency
&lt;/li&gt;
&lt;li&gt;battle‑tested at Uber, Netflix, Stripe, Shopify
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;8. Summary&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here’s the journey we walked through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with simple Spring events
&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;@EventListener&lt;/code&gt; to react to them
&lt;/li&gt;
&lt;li&gt;Realize enrichment logic slows down the request
&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;@Async&lt;/code&gt; + &lt;code&gt;@EnableAsync&lt;/code&gt; to make listeners non‑blocking
&lt;/li&gt;
&lt;li&gt;Add a custom executor for production‑grade async processing
&lt;/li&gt;
&lt;li&gt;Finally, adopt the &lt;strong&gt;Outbox Pattern&lt;/strong&gt; for guaranteed delivery and reliability
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This progression mirrors how real systems evolve as they scale.&lt;/p&gt;

&lt;p&gt;If you’re building event‑driven microservices, the outbox pattern is the foundation you eventually want to reach.&lt;/p&gt;

</description>
      <category>java</category>
      <category>springboot</category>
      <category>kafka</category>
      <category>eventdriven</category>
    </item>
    <item>
      <title>How a Cache Invalidation Bug Nearly Took Down Our System - And What We Changed After</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Fri, 05 Dec 2025 01:57:56 +0000</pubDate>
      <link>https://dev.to/jayesh_shinde/how-a-cache-invalidation-bug-nearly-took-down-our-system-and-what-we-changed-after-2dd2</link>
      <guid>https://dev.to/jayesh_shinde/how-a-cache-invalidation-bug-nearly-took-down-our-system-and-what-we-changed-after-2dd2</guid>
      <description>&lt;p&gt;A few weeks ago, we had one of those production incidents that quietly start in the background and explode right when the traffic peaks.&lt;br&gt;
This one involved &lt;strong&gt;Aurora MySQL&lt;/strong&gt;, a &lt;strong&gt;Lambda with a 30-second timeout&lt;/strong&gt;, and a poorly-designed &lt;strong&gt;cache invalidation strategy&lt;/strong&gt; that ended up flooding our database.&lt;/p&gt;

&lt;p&gt;Here’s the story, what went wrong, and the changes we made so it never happens again.&lt;/p&gt;


&lt;h2&gt;
  
  
  🎬 The Setup
&lt;/h2&gt;

&lt;p&gt;The night before the incident, we upgraded our &lt;strong&gt;Aurora MySQL engine version&lt;/strong&gt;.&lt;br&gt;
Everything looked good. No alarms. No red flags.&lt;/p&gt;

&lt;p&gt;The next morning around &lt;strong&gt;8 AM&lt;/strong&gt;, our daily job kicked in — the one responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deleting the stale “master data” cache&lt;/li&gt;
&lt;li&gt;refetching fresh master data from the DB&lt;/li&gt;
&lt;li&gt;storing it back in cache&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This master dataset is used Application to work correctly, so if the cache isn’t warm, the DB gets hammered.&lt;/p&gt;


&lt;h2&gt;
  
  
  💥 The Explosion
&lt;/h2&gt;

&lt;p&gt;Right after the engine upgrade, a specific query in the Lambda suddenly started taking &lt;strong&gt;30+ seconds&lt;/strong&gt;.&lt;br&gt;
But our Lambda had a &lt;strong&gt;30-second timeout&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So what happened?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The cacheInvalidate → cacheRebuild flow failed.&lt;/li&gt;
&lt;li&gt;The cache remained empty.&lt;/li&gt;
&lt;li&gt;Every user request resulted in a &lt;strong&gt;cache miss&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;All those requests hit the DB directly.&lt;/li&gt;
&lt;li&gt;Aurora CPU spiked to &lt;strong&gt;99%&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Application responses stalled across the board.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Classic &lt;strong&gt;cache stampede&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We eventually triggered a &lt;strong&gt;failover&lt;/strong&gt;, and luckily the same query ran in &lt;em&gt;~28.7 seconds&lt;/em&gt; on the new writer, just under the Lambda timeout. That bought us a few minutes to stabilize.&lt;/p&gt;

&lt;p&gt;Later that night, we found the real culprit:&lt;br&gt;
➡️ &lt;strong&gt;The query needed a new index&lt;/strong&gt;, and the upgrade changed its execution plan.&lt;/p&gt;

&lt;p&gt;We created the index via a hotfix, and the DB stabilized.&lt;/p&gt;

&lt;p&gt;But the deeper problem was our cache invalidation approach.&lt;/p&gt;


&lt;h2&gt;
  
  
  🧹 Our Original Cache Invalidation: Delete First, Hope Later
&lt;/h2&gt;

&lt;p&gt;Our initial flow was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Delete the existing cache key&lt;/li&gt;
&lt;li&gt;Fetch fresh data from DB&lt;/li&gt;
&lt;li&gt;Save it back to cache&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If step 2 fails, everything collapses.&lt;/p&gt;

&lt;p&gt;It’s simple… until it isn’t.&lt;br&gt;
In our case, the Lambda failed to fetch fresh data, so the cache stayed empty.&lt;/p&gt;


&lt;h2&gt;
  
  
  🔧 What We Changed (and Recommend)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;1. Never delete the cache before you have fresh data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We inverted the flow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fetch → Validate → Update cache&lt;/li&gt;
&lt;li&gt;Only delete if we &lt;em&gt;already&lt;/em&gt; have fresh data ready&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This eliminates the “empty cache” window.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;2. Use “stale rollover” instead of blunt deletion&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If the refresh job fails, we now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;rename the key&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;"Master-Data"&lt;/code&gt; → &lt;code&gt;"Master-Data-Stale"&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;keep the old value available&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;add an internal notification so the team can investigate&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures that even if the DB is slow or down, the system still has &lt;em&gt;something&lt;/em&gt; to serve.&lt;/p&gt;

&lt;p&gt;It’s not ideal, but it prevents a meltdown.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;3. API layer now returns stale data when fresh data is unavailable&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The API logic became:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Try to read &lt;code&gt;"Master-Data"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If not found:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Attempt to rebuild (only if allowed)&lt;/li&gt;
&lt;li&gt;If rebuild fails → return stale data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This avoids cascading failures.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;4. Add a Redis distributed lock to prevent cache stampede&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Without this, even if stale data existed, multiple API nodes or Lambdas could all try to rebuild simultaneously — hammering the DB again.&lt;/p&gt;

&lt;p&gt;With a Redis lock:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only &lt;em&gt;one&lt;/em&gt; request gets the lock and rebuilds&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Others:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do &lt;strong&gt;not&lt;/strong&gt; hit DB&lt;/li&gt;
&lt;li&gt;Simply return stale data&lt;/li&gt;
&lt;li&gt;Wait for the winner to repopulate the cache&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This one change alone eliminates 90% of stampede risk.&lt;/p&gt;
&lt;h3&gt;
  
  
  Node.js — Acquire Distributed Lock (Redis)
&lt;/h3&gt;

&lt;p&gt;Below is a simple Redis-based lock using SET NX PX (no external library).&lt;br&gt;
You can replace redis client with ioredis or node-redis based on your stack.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// redis.js
const { createClient } = require("redis");

const redis = createClient({
  url: process.env.REDIS_URL
});
redis.connect();

module.exports = redis;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Acquiring and Releasing the Lock
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// lock.js
const redis = require("./redis");
const { randomUUID } = require("crypto");

const LOCK_KEY = "lock:master-data-refresh";
const LOCK_TTL = 10000; // 10 seconds

async function acquireLock() {
  const lockId = randomUUID();

  const result = await redis.set(LOCK_KEY, lockId, {
    NX: true,
    PX: LOCK_TTL
  });

  if (result === "OK") {
    return lockId; // lock acquired
  }

  return null; // lock not acquired
}

async function releaseLock(lockId) {
  const current = await redis.get(LOCK_KEY);

  if (current === lockId) {
    await redis.del(LOCK_KEY);
  }
}

module.exports = { acquireLock, releaseLock };

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Usage
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const { acquireLock, releaseLock } = require("./lock");

async function refreshMasterData() {
  const lockId = await acquireLock();

  if (!lockId) {
    console.log("Another request is refreshing. Returning stale data.");
    return getStaleData();
  }

  try {
    const newData = await fetchFromDB();
    await saveToCache(newData);
    return newData;
  } finally {
    await releaseLock(lockId);
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;5. Add observability around refresh times&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We now record:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;query execution time&lt;/li&gt;
&lt;li&gt;cache refresh duration&lt;/li&gt;
&lt;li&gt;lock acquisition metrics&lt;/li&gt;
&lt;li&gt;alerts when a refresh exceeds a threshold&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to catch slowdowns &lt;em&gt;before&lt;/em&gt; timeout happens.&lt;/p&gt;




&lt;h2&gt;
  
  
  📝 Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Engine upgrades can change execution plans&lt;/strong&gt;, sometimes dramatically.&lt;/li&gt;
&lt;li&gt;Always benchmark critical queries after major DB changes.&lt;/li&gt;
&lt;li&gt;Cache invalidation strategies must assume that &lt;strong&gt;refresh can fail&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Serving &lt;strong&gt;stale-but-valid data&lt;/strong&gt; is often better than serving errors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed locks&lt;/strong&gt; are essential in preventing cache stampede.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🚀 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The incident was stressful, but the learnings were worth it.&lt;br&gt;
Caching problems rarely show up during normal traffic — they appear right when your system is busiest.&lt;/p&gt;

&lt;p&gt;If you have a similar “delete-then-refresh” pattern somewhere in your application… you may want to review it before it reviews you.&lt;/p&gt;

</description>
      <category>node</category>
      <category>redis</category>
      <category>aws</category>
      <category>mysql</category>
    </item>
    <item>
      <title>🧩 From 15 Minutes to Infinite: Scaling STT Jobs with AWS Batch</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sun, 09 Nov 2025 05:25:17 +0000</pubDate>
      <link>https://dev.to/jayesh_shinde/how-we-fixed-missing-transcripts-with-aws-batch-e7e</link>
      <guid>https://dev.to/jayesh_shinde/how-we-fixed-missing-transcripts-with-aws-batch-e7e</guid>
      <description>&lt;h3&gt;
  
  
  💡 The Problem
&lt;/h3&gt;

&lt;p&gt;We recently ran into a production issue — our &lt;strong&gt;Speech-to-Text (STT)&lt;/strong&gt; service stopped working for a few hours.&lt;br&gt;
The feature was fixed quickly, but the &lt;strong&gt;transcripts for that downtime were missing&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Luckily, in &lt;strong&gt;Amazon Connect&lt;/strong&gt;, all call recordings are stored in &lt;strong&gt;S3&lt;/strong&gt;.&lt;br&gt;
So the audio was there, but no transcripts.&lt;/p&gt;

&lt;p&gt;We needed to reprocess all those missed files — &lt;strong&gt;fast&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  🧠 First Attempt: Lambda (and its Limitations)
&lt;/h3&gt;

&lt;p&gt;We quickly built a &lt;strong&gt;Lambda&lt;/strong&gt; function to process unprocessed files from S3.&lt;/p&gt;

&lt;p&gt;It worked fine — until it didn’t.&lt;br&gt;
AWS Lambda has a &lt;strong&gt;15-minute execution limit&lt;/strong&gt;, and processing large audio files can easily exceed that.&lt;/p&gt;

&lt;p&gt;We could have switched to EC2, but that felt like using a hammer for a small screw — no auto-scaling, no graceful shutdown, no built-in retry or job management.&lt;/p&gt;

&lt;p&gt;We needed something that behaved &lt;strong&gt;like a job&lt;/strong&gt;, not a script.&lt;/p&gt;


&lt;h3&gt;
  
  
  🚀 Enter AWS Batch + Fargate
&lt;/h3&gt;

&lt;p&gt;That’s when &lt;strong&gt;AWS Batch&lt;/strong&gt; came to the rescue.&lt;br&gt;
It’s perfect for this kind of workload — long-running, batch-style, event-driven jobs.&lt;/p&gt;

&lt;p&gt;Here’s the setup we used:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Created a Compute Environment&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Backed by &lt;strong&gt;AWS Fargate&lt;/strong&gt; → no EC2 management.&lt;/li&gt;
&lt;li&gt;Scales automatically depending on job load.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Defined a Job Queue&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;All reprocessing jobs will be submitted here.&lt;/li&gt;
&lt;li&gt;The queue ensures controlled concurrency and retries.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Built a Job Definition&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Packaged our STT processing logic as a &lt;strong&gt;Docker image&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Uploaded it to &lt;strong&gt;Amazon ECR&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Defined required vCPU and memory for each job.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Triggered via Lambda&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;A small Lambda fetches a list of unprocessed S3 files.&lt;/li&gt;
&lt;li&gt;For each batch (say 50 files), it &lt;strong&gt;submits a Batch Job&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  ⚙️ The Flow in Action
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Lambda →&lt;/strong&gt; Checks for unprocessed audio files in S3.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda → AWS Batch:&lt;/strong&gt; Submits a job to process them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Batch (Fargate)&lt;/strong&gt; spins up compute, runs the job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Job →&lt;/strong&gt; Downloads audio → runs STT → uploads transcript → updates metadata.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fargate shuts down automatically&lt;/strong&gt; when the job finishes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No idle servers, no manual cleanup, no stress.&lt;/p&gt;


&lt;h3&gt;
  
  
  🧩 Why This Design Rocks
&lt;/h3&gt;

&lt;p&gt;✅ &lt;strong&gt;Serverless all the way&lt;/strong&gt; — Lambda + Fargate + S3&lt;br&gt;
✅ &lt;strong&gt;Auto-scaling compute&lt;/strong&gt; — no EC2 to babysit&lt;br&gt;
✅ &lt;strong&gt;Long-running safe zone&lt;/strong&gt; — runs beyond Lambda’s 15-min cap&lt;br&gt;
✅ &lt;strong&gt;Reusable&lt;/strong&gt; — we can reprocess any backlog anytime&lt;br&gt;
✅ &lt;strong&gt;Cost-efficient&lt;/strong&gt; — pay only for what’s used&lt;/p&gt;


&lt;h3&gt;
  
  
  🪄 Bonus Tip
&lt;/h3&gt;

&lt;p&gt;You can even schedule a &lt;strong&gt;“missed transcript” job&lt;/strong&gt; to run daily or weekly,&lt;br&gt;
checking for any files without transcripts and triggering a Batch job automatically.&lt;/p&gt;


&lt;h2&gt;
  
  
  🧩 Understanding AWS Batch Scaling
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;AWS Batch&lt;/strong&gt;, the number of tasks (containers) that run &lt;strong&gt;in parallel&lt;/strong&gt; depends on &lt;strong&gt;three things&lt;/strong&gt; working together:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compute Environment capacity&lt;/strong&gt;&lt;br&gt;
→ e.g., your environment has a maximum of &lt;code&gt;10 vCPUs&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Job Definition requirements&lt;/strong&gt;&lt;br&gt;
→ e.g., each job needs &lt;code&gt;1 vCPU&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;How many jobs are in the queue (and their array size, if used).&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;


&lt;h3&gt;
  
  
  🔹 Case 1: You Submit Multiple Independent Jobs
&lt;/h3&gt;

&lt;p&gt;If you submit &lt;strong&gt;10 jobs&lt;/strong&gt;, each with &lt;code&gt;1 vCPU&lt;/code&gt;, and your environment allows &lt;code&gt;10 vCPUs&lt;/code&gt;,&lt;br&gt;
then AWS Batch can &lt;strong&gt;run all 10 in parallel&lt;/strong&gt; (subject to available Fargate capacity).&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# pseudo example&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;1..10&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;aws batch submit-job &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--job-name&lt;/span&gt; process-audio-&lt;span class="nv"&gt;$i&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--job-queue&lt;/span&gt; my-queue &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--job-definition&lt;/span&gt; my-job-def
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each job = 1 vCPU → up to 10 can run simultaneously.&lt;/p&gt;

&lt;p&gt;AWS Batch’s &lt;strong&gt;Job Scheduler&lt;/strong&gt; will automatically pack as many as possible based on available compute.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔹 Case 2: You Use an Array Job
&lt;/h3&gt;

&lt;p&gt;Instead of manually looping, you can submit &lt;strong&gt;an array job&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws batch submit-job &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--job-name&lt;/span&gt; process-audios &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--job-queue&lt;/span&gt; my-queue &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--job-definition&lt;/span&gt; my-job-def &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--array-properties&lt;/span&gt; &lt;span class="nv"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates &lt;strong&gt;10 child jobs&lt;/strong&gt; under a single parent, each running independently (great for S3 list chunking).&lt;/p&gt;

&lt;p&gt;Same result — 10 parallel containers, each with 1 vCPU.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔹 Case 3: You Submit a Single Job that Needs More vCPUs
&lt;/h3&gt;

&lt;p&gt;If you set in your &lt;strong&gt;job definition&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"vcpus"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and your environment has 10 total vCPUs →&lt;br&gt;
then &lt;strong&gt;Batch will reserve 4 vCPUs&lt;/strong&gt; for that job, leaving room for other smaller jobs.&lt;/p&gt;

&lt;p&gt;So the compute environment doesn’t spawn “10 copies automatically” —&lt;br&gt;
it just enforces &lt;strong&gt;a maximum pool of total CPU&lt;/strong&gt; that concurrent jobs can consume.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚙️ TL;DR — How to Scale
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;What to Do&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Run multiple tasks concurrently&lt;/td&gt;
&lt;td&gt;Submit multiple jobs or an array job&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Each job’s CPU need&lt;/td&gt;
&lt;td&gt;Defined in Job Definition (e.g., 1 vCPU)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max parallel limit&lt;/td&gt;
&lt;td&gt;Based on compute environment capacity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Control at runtime&lt;/td&gt;
&lt;td&gt;You can pass &lt;code&gt;--array-properties size=N&lt;/code&gt; dynamically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scaling behavior&lt;/td&gt;
&lt;td&gt;Batch automatically scales Fargate/EC2 capacity up/down&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  🏁 Closing Thoughts
&lt;/h3&gt;

&lt;p&gt;This experience reminded me —&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“When your script starts feeling like a job, give it job-like powers.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AWS Batch (especially with Fargate) is often underrated,&lt;br&gt;
but it’s a powerful tool when you need &lt;strong&gt;on-demand, containerized, long-running compute&lt;/strong&gt;&lt;br&gt;
without managing any servers.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>fargate</category>
      <category>lambda</category>
      <category>node</category>
    </item>
    <item>
      <title>Reusing HTTP and SDK clients in AWS Lambda to avoid “too many open files” (FD) errors</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Wed, 15 Oct 2025 05:04:25 +0000</pubDate>
      <link>https://dev.to/jayesh_shinde/reusing-http-and-sdk-clients-in-aws-lambda-to-avoid-too-many-open-files-fd-errors-1p24</link>
      <guid>https://dev.to/jayesh_shinde/reusing-http-and-sdk-clients-in-aws-lambda-to-avoid-too-many-open-files-fd-errors-1p24</guid>
      <description>&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;We hit sporadic network errors in a high-throughput Lambda that made HTTP calls (Axios) and AWS SDK calls.&lt;/li&gt;
&lt;li&gt;Root cause: creating new HTTP clients/agents per invocation ballooned the number of open sockets (file descriptors).&lt;/li&gt;
&lt;li&gt;Fix: initialize clients and their &lt;code&gt;https.Agent&lt;/code&gt; once at module scope with keep-alive and reuse them across warm invocations. For AWS SDK v2, also set &lt;code&gt;AWS_NODEJS_CONNECTION_REUSE_ENABLED=1&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The scenario
&lt;/h3&gt;

&lt;p&gt;We had a Lambda that was invoked asynchronously to process a large dataset (thousands of events). Inside the handler, we created an Axios client and AWS SDK client(s) for each invocation. Under sustained concurrency, we started seeing intermittent network failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptoms we saw
&lt;/h3&gt;

&lt;p&gt;These popped up in CloudWatch logs while the Lambda was busy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“too many open files” errors:

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Error: EMFILE: too many open files, open&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;NodeError: getaddrinfo ENFILE&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Connection instability:

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;AxiosError: socket hang up&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Error: read ECONNRESET&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Error: connect ECONNRESET&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Occasional timeouts and throttling-like behavior despite healthy downstream services&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;These were worse during bursts when many async invocations overlapped.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s really happening (FDs and sockets in Lambda)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Every TCP connection (HTTP/HTTPS) consumes a file descriptor (FD).&lt;/li&gt;
&lt;li&gt;Lambda execution environments have a relatively low per-process FD limit (commonly around 1024).&lt;/li&gt;
&lt;li&gt;If you create a new HTTP client (and thus a new &lt;code&gt;https.Agent&lt;/code&gt;) per invocation, each agent can open many sockets. Under high concurrency, you exhaust FDs, leading to the errors above.&lt;/li&gt;
&lt;li&gt;Lambda reuses the same execution environment for multiple “warm” invocations. Objects created at module scope are kept alive and reused, which is exactly what we want for clients and connection pools.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Node’s &lt;code&gt;https.Agent&lt;/code&gt; matters
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The agent controls connection pooling and keep-alive.&lt;/li&gt;
&lt;li&gt;Creating a new agent per invocation increases the number of socket pools and the total sockets in use.&lt;/li&gt;
&lt;li&gt;Reusing a single agent keeps the number of open sockets bounded and allows connection reuse across requests, reducing FD pressure and latency.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The anti-pattern (what we had)
&lt;/h3&gt;

&lt;p&gt;Creating new clients and agents inside the handler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;axios&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Anti-pattern: runs on every invocation&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;httpsAgent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="c1"&gt;// new agent each time&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://api.example.com/data&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same issue with AWS SDK if you &lt;code&gt;new&lt;/code&gt; a client per invocation, especially if you also create its own agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  The fix (module-level reuse with keep-alive)
&lt;/h3&gt;

&lt;p&gt;Move client and agent creation to module scope so they’re created once per warm environment and then reused.&lt;/p&gt;

&lt;h4&gt;
  
  
  Axios
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;axios&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;httpsAgent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;keepAlive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;maxSockets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;// tune based on expected concurrency per environment&lt;/span&gt;
  &lt;span class="na"&gt;maxFreeSockets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// socket idle timeout&lt;/span&gt;
  &lt;span class="na"&gt;freeSocketTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="nx"&gt;httpsAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://api.example.com/data&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  AWS SDK v3
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;NodeHttpHandler&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@aws-sdk/node-http-handler&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;S3Client&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@aws-sdk/client-s3&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;httpsAgent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;keepAlive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;maxSockets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;maxFreeSockets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;freeSocketTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;S3Client&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;AWS_REGION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;requestHandler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;NodeHttpHandler&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;httpsAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;connectionTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;socketTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listBuckets&lt;/span&gt;&lt;span class="p"&gt;({});&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  AWS SDK v2
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Reuse clients, and enable connection reuse via env var.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aws-sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Also set in Lambda env: AWS_NODEJS_CONNECTION_REUSE_ENABLED=1&lt;/span&gt;
&lt;span class="nx"&gt;AWS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;AWS_REGION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;httpOptions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;keepAlive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;maxSockets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;S3&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listBuckets&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;promise&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results after the change
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;FD-related errors (EMFILE, ENFILE, socket hang ups) disappeared under the same workload.&lt;/li&gt;
&lt;li&gt;Lower p95 latency due to connection reuse.&lt;/li&gt;
&lt;li&gt;Fewer outbound connection spikes visible on NAT Gateway/ENI metrics (for VPC Lambdas).&lt;/li&gt;
&lt;li&gt;More predictable behavior during bursts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Bonus mitigations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Concurrency control: use SQS with a sane &lt;code&gt;maxConcurrency&lt;/code&gt;/&lt;code&gt;batchSize&lt;/code&gt;, reserved concurrency, or step-wise throttling to prevent bursts from scaling FD usage across many environments at once.&lt;/li&gt;
&lt;li&gt;Timeouts and retries: set realistic timeouts; add backoff with jitter to avoid synchronized retries.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;context.callbackWaitsForEmptyEventLoop = false&lt;/code&gt;: can help the handler return even if the agent keeps idle sockets open (don’t overuse).&lt;/li&gt;
&lt;li&gt;Consider &lt;code&gt;undici&lt;/code&gt; for HTTP in Node 18+; it provides efficient HTTP/1.1 keep-alive by default.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Quick checklist
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Initialize HTTP clients and SDK clients at module scope.&lt;/li&gt;
&lt;li&gt;Use a shared &lt;code&gt;https.Agent&lt;/code&gt; with &lt;code&gt;keepAlive: true&lt;/code&gt;; set &lt;code&gt;maxSockets&lt;/code&gt;, &lt;code&gt;maxFreeSockets&lt;/code&gt;, and timeouts.&lt;/li&gt;
&lt;li&gt;For AWS SDK v2, set &lt;code&gt;AWS_NODEJS_CONNECTION_REUSE_ENABLED=1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Avoid creating clients/agents inside loops or inside the handler.&lt;/li&gt;
&lt;li&gt;Monitor and tune under realistic concurrency.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Closing thoughts
&lt;/h3&gt;

&lt;p&gt;FD exhaustion is easy to miss until traffic scales. In serverless, the simplest lever is to reuse resources across warm invocations. One shared agent + one shared client per execution environment eliminates a whole class of flaky, intermittent network issues.&lt;/p&gt;

</description>
      <category>lambda</category>
      <category>serverless</category>
      <category>node</category>
      <category>aws</category>
    </item>
    <item>
      <title>🧠 How We Upgraded Our WordPress Search with OpenSearch Neural + Cohere for Multilingual Semantic Search</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sat, 11 Oct 2025 04:32:51 +0000</pubDate>
      <link>https://dev.to/jayesh_shinde/how-we-upgraded-our-wordpress-search-with-opensearch-neural-cohere-for-multilingual-semantic-4j56</link>
      <guid>https://dev.to/jayesh_shinde/how-we-upgraded-our-wordpress-search-with-opensearch-neural-cohere-for-multilingual-semantic-4j56</guid>
      <description>&lt;p&gt;At our company, we use WordPress as a knowledge base (KB) for internal articles.&lt;br&gt;&lt;br&gt;
We indexed those articles in &lt;strong&gt;OpenSearch&lt;/strong&gt;, but the default keyword search felt… old-school.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Searching &lt;strong&gt;“Netflix subscription”&lt;/strong&gt; missed “How to manage ネットフリックス plans”.&lt;/li&gt;
&lt;li&gt;Searching &lt;strong&gt;“AWS cost optimization”&lt;/strong&gt; returned random hits because keywords didn’t align.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we upgraded our search to &lt;strong&gt;semantic search&lt;/strong&gt; using the &lt;strong&gt;OpenSearch ML plugin&lt;/strong&gt; + &lt;strong&gt;Cohere embeddings served via Amazon Bedrock&lt;/strong&gt; — a great combination for &lt;strong&gt;multilingual understanding&lt;/strong&gt; and secure enterprise integration.&lt;/p&gt;


&lt;h2&gt;
  
  
  ⚙️ The Problem
&lt;/h2&gt;

&lt;p&gt;Our initial setup used simple keyword mapping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even after tweaking analyzers, users searching in Japanese (katakana) or English weren’t getting expected matches.&lt;/p&gt;

&lt;p&gt;For example, “netflix” and “ネットフリックス” should be the same — but OpenSearch treated them as completely different tokens.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 The Plan
&lt;/h2&gt;

&lt;p&gt;We wanted to add &lt;strong&gt;semantic search&lt;/strong&gt; on top of our existing index.&lt;br&gt;
That means converting both documents &lt;strong&gt;and&lt;/strong&gt; queries into &lt;strong&gt;vectors&lt;/strong&gt; using the same embedding model, and comparing them using cosine similarity.&lt;/p&gt;

&lt;p&gt;Our plan:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create an &lt;strong&gt;ML connector&lt;/strong&gt; for the Cohere embedding API&lt;/li&gt;
&lt;li&gt;Create a &lt;strong&gt;model&lt;/strong&gt; for document embeddings (input_type = &lt;code&gt;search_document&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Create another &lt;strong&gt;model&lt;/strong&gt; for query embeddings (input_type = &lt;code&gt;search_query&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Update our &lt;strong&gt;ingestion pipeline&lt;/strong&gt; to generate vectors during indexing&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;script_score&lt;/strong&gt; (or kNN query) to retrieve the best semantic matches&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  🔌 Step 1: Create a Connector (to Cohere)
&lt;/h2&gt;

&lt;p&gt;First, we create an &lt;strong&gt;ML connector&lt;/strong&gt; in OpenSearch to call Cohere’s API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST _plugins/_ml/connectors/_create
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-doc-connector"&lt;/span&gt;,
  &lt;span class="s2"&gt;"description"&lt;/span&gt;: &lt;span class="s2"&gt;"Amazon Bedrock connector for Cohere embeddings (document)"&lt;/span&gt;,
  &lt;span class="s2"&gt;"protocol"&lt;/span&gt;: &lt;span class="s2"&gt;"aws_sigv4"&lt;/span&gt;,
  &lt;span class="s2"&gt;"parameters"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"region"&lt;/span&gt;: &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;,
    &lt;span class="s2"&gt;"service_name"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock"&lt;/span&gt;,
    &lt;span class="s2"&gt;"model_id"&lt;/span&gt;: &lt;span class="s2"&gt;"cohere.embed-multilingual-v3"&lt;/span&gt;,
    &lt;span class="s2"&gt;"input_type"&lt;/span&gt;: &lt;span class="s2"&gt;"search_document"&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here:&lt;/p&gt;

&lt;p&gt;protocol: aws_sigv4 makes OpenSearch sign Bedrock API calls with IAM credentials.&lt;/p&gt;

&lt;p&gt;model_id refers to Cohere’s multilingual embedding model hosted on Bedrock.&lt;/p&gt;

&lt;p&gt;We use input_type=search_document for document-level embeddings.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 Step 2: Register the Model
&lt;/h2&gt;

&lt;p&gt;Now, we create a model using the connector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST _plugins/_ml/models/_register
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-doc-embed"&lt;/span&gt;,
  &lt;span class="s2"&gt;"function_name"&lt;/span&gt;: &lt;span class="s2"&gt;"embedding"&lt;/span&gt;,
  &lt;span class="s2"&gt;"connector_id"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-doc-connector"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

POST _plugins/_ml/models/bedrock-cohere-doc-embed/_deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This model will be used in our ingestion pipeline.&lt;/p&gt;

&lt;p&gt;Then deploy it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST _plugins/_ml/models/cohere-doc-embed-model/_deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenSearch now knows how to call Cohere to embed documents.&lt;/p&gt;




&lt;h2&gt;
  
  
  💾 Step 3: Create the Index with a Vector Field
&lt;/h2&gt;

&lt;p&gt;Next, we create our knowledge base index that includes a vector field for embeddings.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;PUT kb-articles
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"settings"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"index"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"knn"&lt;/span&gt;: &lt;span class="nb"&gt;true&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="s2"&gt;"mappings"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"properties"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"type"&lt;/span&gt;: &lt;span class="s2"&gt;"text"&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;,
      &lt;span class="s2"&gt;"content"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"type"&lt;/span&gt;: &lt;span class="s2"&gt;"text"&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;,
      &lt;span class="s2"&gt;"embedding"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"type"&lt;/span&gt;: &lt;span class="s2"&gt;"knn_vector"&lt;/span&gt;,
        &lt;span class="s2"&gt;"dimension"&lt;/span&gt;: 1024,
        &lt;span class="s2"&gt;"method"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
          &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"hnsw"&lt;/span&gt;,
          &lt;span class="s2"&gt;"space_type"&lt;/span&gt;: &lt;span class="s2"&gt;"cosinesimil"&lt;/span&gt;,
          &lt;span class="s2"&gt;"engine"&lt;/span&gt;: &lt;span class="s2"&gt;"nmslib"&lt;/span&gt;,
          &lt;span class="s2"&gt;"parameters"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;"ef_construction"&lt;/span&gt;: 512,
            &lt;span class="s2"&gt;"m"&lt;/span&gt;: 16
          &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The vector field embedding will hold our document-level embeddings from Cohere (1024 dimensions).
&lt;/h2&gt;

&lt;h2&gt;
  
  
  🧠 Step 4: Add an Ingest Pipeline
&lt;/h2&gt;

&lt;p&gt;We’ll generate document embeddings automatically during ingestion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;PUT _ingest/pipeline/kb-embed-pipeline
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"processors"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;
    &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"ml_inference"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"model_id"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-doc-embed"&lt;/span&gt;,
        &lt;span class="s2"&gt;"input_map"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"text"&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;,
        &lt;span class="s2"&gt;"output_map"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"embedding"&lt;/span&gt;: &lt;span class="s2"&gt;"embedding"&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, when we index a document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST kb-articles/_doc?pipeline&lt;span class="o"&gt;=&lt;/span&gt;kb-embed-pipeline
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"Netflix subscription help"&lt;/span&gt;,
  &lt;span class="s2"&gt;"content"&lt;/span&gt;: &lt;span class="s2"&gt;"Steps to manage your Netflix account and billing."&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenSearch automatically calls Amazon Bedrock, retrieves Cohere’s embedding, and stores it in the embedding vector field.&lt;/p&gt;




&lt;h2&gt;
  
  
  💬 Step 5: Handle User Queries with Another Connector
&lt;/h2&gt;

&lt;p&gt;Initially, I used the same connector (with input_type=search_document) for both documents and queries.&lt;br&gt;
That caused a mismatch — “ネットフリックス” (Katakana) and “Netflix” were still not matching.&lt;/p&gt;

&lt;p&gt;The fix was to create another connector and model specifically for query embeddings.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST _plugins/_ml/connectors/_create
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-query-connector"&lt;/span&gt;,
  &lt;span class="s2"&gt;"description"&lt;/span&gt;: &lt;span class="s2"&gt;"Amazon Bedrock connector for Cohere embeddings (query)"&lt;/span&gt;,
  &lt;span class="s2"&gt;"protocol"&lt;/span&gt;: &lt;span class="s2"&gt;"aws_sigv4"&lt;/span&gt;,
  &lt;span class="s2"&gt;"parameters"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"region"&lt;/span&gt;: &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;,
    &lt;span class="s2"&gt;"service_name"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock"&lt;/span&gt;,
    &lt;span class="s2"&gt;"model_id"&lt;/span&gt;: &lt;span class="s2"&gt;"cohere.embed-multilingual-v3"&lt;/span&gt;,
    &lt;span class="s2"&gt;"input_type"&lt;/span&gt;: &lt;span class="s2"&gt;"search_query"&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and then register and deploy another model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST _plugins/_ml/models/_register
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-query-embed"&lt;/span&gt;,
  &lt;span class="s2"&gt;"function_name"&lt;/span&gt;: &lt;span class="s2"&gt;"embedding"&lt;/span&gt;,
  &lt;span class="s2"&gt;"connector_id"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-query-connector"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

POST _plugins/_ml/models/bedrock-cohere-query-embed/_deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures both documents and queries are embedded in compatible vector spaces.&lt;/p&gt;

&lt;p&gt;Now, when we run a search, we first generate a &lt;strong&gt;query embedding&lt;/strong&gt; via the ML model’s &lt;code&gt;/predict&lt;/code&gt; endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST _plugins/_ml/models/bedrock-cohere-query-embed/_predict
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"text"&lt;/span&gt;: &lt;span class="s2"&gt;"ネットフリックス subscription"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔍 Step 6: Semantic Search Query
&lt;/h2&gt;

&lt;p&gt;Finally, we plug the embedding into a &lt;code&gt;script_score&lt;/code&gt; query to rank results by cosine similarity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST kb-articles/_search
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"size"&lt;/span&gt;: 5,
  &lt;span class="s2"&gt;"query"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"script_score"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"query"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"match_all"&lt;/span&gt;: &lt;span class="o"&gt;{}&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;,
      &lt;span class="s2"&gt;"script"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"source"&lt;/span&gt;: &lt;span class="s2"&gt;"cosineSimilarity(params.query_vector, 'embedding') + 1.0"&lt;/span&gt;,
        &lt;span class="s2"&gt;"params"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
          &lt;span class="s2"&gt;"query_vector"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;/&lt;span class="k"&gt;*&lt;/span&gt; embedding array from _predict &lt;span class="k"&gt;*&lt;/span&gt;/]
        &lt;span class="o"&gt;}&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="s2"&gt;"sort"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;
    &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"_score"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"order"&lt;/span&gt;: &lt;span class="s2"&gt;"desc"&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now “Netflix” and “ネットフリックス” both match beautifully.&lt;br&gt;
🎯 Cohere’s multilingual embeddings + OpenSearch vector search did the trick.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 Why We Chose Cohere
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strong multilingual understanding&lt;/strong&gt; — perfect for English + Japanese content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Easy integration&lt;/strong&gt; — just plug in an API key via OpenSearch ML connector&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent embedding dimensions&lt;/strong&gt; — works well with &lt;code&gt;knn_vector&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast inference&lt;/strong&gt; — good for production-scale pipelines&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🪛 Troubleshooting Tips
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Katakana &amp;amp; English not matching&lt;/td&gt;
&lt;td&gt;Used &lt;code&gt;search_document&lt;/code&gt; for query embeddings&lt;/td&gt;
&lt;td&gt;Create a separate connector with &lt;code&gt;input_type=search_query&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;“Dimension mismatch” errors&lt;/td&gt;
&lt;td&gt;Wrong embedding model or field dimension&lt;/td&gt;
&lt;td&gt;Make sure both model &amp;amp; field use same &lt;code&gt;dimension&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference timeout&lt;/td&gt;
&lt;td&gt;Cohere API rate limits&lt;/td&gt;
&lt;td&gt;Batch or cache embeddings during ingestion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weird scores&lt;/td&gt;
&lt;td&gt;Missing &lt;code&gt;space_type: cosinesimil&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Use cosine similarity in mapping&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  ✅ Summary
&lt;/h2&gt;

&lt;p&gt;We started with basic keyword search and ended up with &lt;strong&gt;multilingual semantic search&lt;/strong&gt; using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenSearch ML plugin&lt;/li&gt;
&lt;li&gt;Cohere embedding models&lt;/li&gt;
&lt;li&gt;Cosine similarity on &lt;code&gt;knn_vector&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Separate connectors for documents vs. queries&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ✨ Outcome
&lt;/h2&gt;

&lt;p&gt;After integrating Cohere via Bedrock and separating the input types:&lt;/p&gt;

&lt;p&gt;✅ We can now search using phrases, not just keywords&lt;/p&gt;

&lt;p&gt;🌏 Cross-lingual search works — “Netflix” ≈ “ネットフリックス”&lt;/p&gt;

&lt;p&gt;💬 Semantic matching improved drastically (e.g., “streaming issue” finds “再生 エラー”)&lt;/p&gt;

&lt;p&gt;📈 Search relevance and recall are noticeably better, even for non-English content&lt;/p&gt;

&lt;p&gt;The combination of OpenSearch ML plugin + Cohere embeddings via Bedrock turned our keyword search into a truly semantic multilingual search engine — all running within the AWS ecosystem.&lt;/p&gt;




&lt;p&gt;💡 &lt;em&gt;If you’re building multilingual or brand-sensitive search, don’t skip the &lt;code&gt;input_type&lt;/code&gt; difference — it can make or break your semantic matching.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>opensearch</category>
      <category>semanticsearch</category>
      <category>wordpress</category>
      <category>aws</category>
    </item>
    <item>
      <title>🛠️ Fixing Lost SecurityContext and Correlation IDs in Async Calls with Spring Boot</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sun, 05 Oct 2025 11:57:58 +0000</pubDate>
      <link>https://dev.to/jayesh_shinde/fixing-lost-securitycontext-and-correlation-ids-in-async-calls-with-spring-boot-4pc8</link>
      <guid>https://dev.to/jayesh_shinde/fixing-lost-securitycontext-and-correlation-ids-in-async-calls-with-spring-boot-4pc8</guid>
      <description>&lt;p&gt;When we started parallelizing API calls in our Spring Boot service using &lt;code&gt;CompletableFuture&lt;/code&gt; and a custom &lt;code&gt;ExecutorService&lt;/code&gt;, everything looked great… until we checked the logs.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Our &lt;strong&gt;JWT &lt;code&gt;SecurityContext&lt;/code&gt;&lt;/strong&gt; wasn’t available in the async threads.
&lt;/li&gt;
&lt;li&gt;Our &lt;strong&gt;MDC correlation IDs&lt;/strong&gt; (used for distributed tracing/log correlation) were missing too.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That meant downstream services didn’t know &lt;em&gt;who&lt;/em&gt; was calling, and our logs lost the ability to tie requests together. Not good.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚨 The Problem
&lt;/h2&gt;

&lt;p&gt;Spring Security stores authentication in a &lt;code&gt;ThreadLocal&lt;/code&gt; (&lt;code&gt;SecurityContextHolder&lt;/code&gt;).&lt;br&gt;&lt;br&gt;
SLF4J’s MDC (Mapped Diagnostic Context) also uses &lt;code&gt;ThreadLocal&lt;/code&gt; to store correlation IDs.  &lt;/p&gt;

&lt;p&gt;When you hop threads (e.g., via &lt;code&gt;CompletableFuture.supplyAsync&lt;/code&gt;), those &lt;code&gt;ThreadLocal&lt;/code&gt; values don’t magically follow along. So in worker threads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SecurityContextHolder.getContext()&lt;/code&gt; → empty
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MDC.get("correlationId")&lt;/code&gt; → null
&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  ✅ The Solution: Wrap the Executor
&lt;/h2&gt;

&lt;p&gt;We solved this by wrapping our &lt;code&gt;ExecutorService&lt;/code&gt; in a lightweight decorator that &lt;strong&gt;captures the MDC + SecurityContext from the submitting thread&lt;/strong&gt; and restores them inside the worker thread.&lt;/p&gt;

&lt;p&gt;Here’s the implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ContextPropagatingExecutorService&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;AbstractExecutorService&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;ExecutorService&lt;/span&gt; &lt;span class="n"&gt;delegate&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nf"&gt;ContextPropagatingExecutorService&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ExecutorService&lt;/span&gt; &lt;span class="n"&gt;delegate&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;delegate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;delegate&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;Runnable&lt;/span&gt; &lt;span class="nf"&gt;wrap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Runnable&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;mdc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;MDC&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCopyOfContextMap&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;SecurityContext&lt;/span&gt; &lt;span class="n"&gt;securityContext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SecurityContextHolder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getContext&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mdc&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="no"&gt;MDC&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setContextMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mdc&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;securityContext&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;SecurityContextHolder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setContext&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;securityContext&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="no"&gt;MDC&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;clear&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
                &lt;span class="nc"&gt;SecurityContextHolder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;clearContext&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="o"&gt;};&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Runnable&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;delegate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wrap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// delegate lifecycle methods...&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  ⚙️ Wiring It Up
&lt;/h2&gt;

&lt;p&gt;In our Spring Boot config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Configuration&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AsyncConfig&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="nd"&gt;@Bean&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"executorService"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;ExecutorService&lt;/span&gt; &lt;span class="nf"&gt;executorService&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;ExecutorService&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Executors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newCachedThreadPool&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;ContextPropagatingExecutorService&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, whenever we do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;AccountDTO&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;fromAccount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;supplyAsync&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;accountClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getAccountsById&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="n"&gt;executorService&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;…the async thread has the &lt;strong&gt;same SecurityContext and MDC&lt;/strong&gt; as the request thread.  &lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Before vs After
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SecurityContextHolder.getContext()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Empty in async thread&lt;/td&gt;
&lt;td&gt;Correctly populated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MDC.get("correlationId")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;null&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same correlation ID as request thread&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logs&lt;/td&gt;
&lt;td&gt;Missing trace IDs&lt;/td&gt;
&lt;td&gt;Full traceability across async calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Downstream services&lt;/td&gt;
&lt;td&gt;No JWT propagated&lt;/td&gt;
&lt;td&gt;JWT available for Feign/RestTemplate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  🔑 Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ThreadLocals don’t cross thread boundaries&lt;/strong&gt; — you need to propagate them manually.
&lt;/li&gt;
&lt;li&gt;Wrapping your &lt;code&gt;ExecutorService&lt;/code&gt; is a clean, reusable fix.
&lt;/li&gt;
&lt;li&gt;This pattern works not just for MDC + SecurityContext, but for any contextual data you need across async boundaries.
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🚀 Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;If you’re building microservices with Spring Boot and using async execution (&lt;code&gt;CompletableFuture&lt;/code&gt;, &lt;code&gt;@Async&lt;/code&gt;, Kafka listeners, etc.), don’t forget about context propagation. Without it, your logs and security checks will silently break.  &lt;/p&gt;

&lt;p&gt;Wrapping your executor is a small change that pays off big in &lt;strong&gt;observability&lt;/strong&gt; and &lt;strong&gt;security consistency&lt;/strong&gt;.  &lt;/p&gt;




</description>
      <category>springboot</category>
      <category>java</category>
      <category>backenddevelopment</category>
      <category>programming</category>
    </item>
    <item>
      <title>Why My CDK Deploys Started Failing After Org Added Strict SCP Rules</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sat, 27 Sep 2025 07:56:29 +0000</pubDate>
      <link>https://dev.to/jayesh_shinde/why-my-cdk-deploys-started-failing-after-org-added-strict-scp-rules-o88</link>
      <guid>https://dev.to/jayesh_shinde/why-my-cdk-deploys-started-failing-after-org-added-strict-scp-rules-o88</guid>
      <description>&lt;p&gt;Recently, I ran into a head‑scratcher while deploying a CDK stack. Everything used to work fine, but once my organization introduced &lt;strong&gt;strict SCP rules based on tags&lt;/strong&gt;, my &lt;code&gt;cdk deploy&lt;/code&gt; started failing with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AccessDenied: action cloudformation:CreateChangeSet is not authorized
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first glance, it didn’t make sense. I was &lt;em&gt;already tagging everything&lt;/em&gt; in my CDK code. I even had a loop that pulled tags from &lt;code&gt;props.Tags&lt;/code&gt; and attached them with &lt;code&gt;cdk.Tags.of(this).add(...)&lt;/code&gt;. These tags used to flow down nicely to all resources — including the CloudFormation stack itself.&lt;/p&gt;

&lt;p&gt;So why did it suddenly stop working? 🤔&lt;/p&gt;




&lt;h3&gt;
  
  
  What Changed? SCPs and Request‑Time Enforcement
&lt;/h3&gt;

&lt;p&gt;The key is &lt;strong&gt;where SCP rules get evaluated&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An SCP can restrict not only what resources exist, but also &lt;em&gt;what API calls are allowed&lt;/em&gt;. In my case, the org had a policy like this:(dummy example)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Deny"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cloudformation:CreateChangeSet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"StringNotEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"aws:RequestTag/Org"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ABC"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means: &lt;em&gt;if the &lt;code&gt;CreateChangeSet&lt;/code&gt; request doesn’t already include the required tag, the call is blocked right away.&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  CDK Tagging: Two Different Worlds
&lt;/h3&gt;

&lt;p&gt;This is where CDK behavior matters:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;StackProps.tags&lt;/strong&gt;&lt;br&gt;
When you pass tags into the &lt;code&gt;super(scope, id, props)&lt;/code&gt; constructor, CDK includes those tags in the &lt;code&gt;CreateChangeSet&lt;/code&gt; API call. These show up as &lt;code&gt;RequestTags&lt;/code&gt;. That’s what SCPs check.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;cdk.Tags.of(resource).add(...)&lt;/strong&gt;&lt;br&gt;
This method attaches tags to resources &lt;strong&gt;inside the CloudFormation template&lt;/strong&gt;. They are applied &lt;em&gt;after&lt;/em&gt; the stack is already created.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So my old approach of looping through &lt;code&gt;props.Tags&lt;/code&gt; and calling &lt;code&gt;cdk.Tags.of(this).add(...)&lt;/code&gt; worked fine in the past, but now fails because the SCP never lets the stack get created in the first place. The required tags simply aren’t present yet at request time.&lt;/p&gt;




&lt;h3&gt;
  
  
  Fix: Pass Tags via StackProps
&lt;/h3&gt;

&lt;p&gt;The solution was simple once I understood the difference. Previously my props had &lt;strong&gt;Tags&lt;/strong&gt; property which I replaced with &lt;strong&gt;tags&lt;/strong&gt; which props:cdk.StackProps expects and uses to initialize CDK’s in-memory construct tree (the Stack object inside your app).&lt;br&gt;
When you run cdk deploy, the CLI uses the CloudFormation SDK to call:&lt;/p&gt;

&lt;p&gt;CreateChangeSet (or UpdateChangeSet) → this is the first API call to AWS.&lt;/p&gt;

&lt;p&gt;This is where stack-level tags (props.tags) are injected into the request payload&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyStack&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Stack&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Construct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="nx"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;StackProps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;of&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MyStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;MyTaggedStack&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;Org&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ABC&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;Owner&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TeamA&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;StackProps.tags&lt;/strong&gt; → go straight into the &lt;code&gt;CreateChangeSet&lt;/code&gt; request (SCP passes ✅).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cdk.Tags.of(this).add(...)&lt;/strong&gt; → still ensures all resources get the same tags after creation.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Takeaway
&lt;/h3&gt;

&lt;p&gt;If your organization enforces strict &lt;strong&gt;SCP rules on &lt;code&gt;cloudformation:CreateChangeSet&lt;/code&gt;&lt;/strong&gt;, you can’t rely on &lt;code&gt;cdk.Tags.of(...)&lt;/code&gt; alone. Those tags arrive too late. You need to use &lt;strong&gt;&lt;code&gt;StackProps.tags&lt;/code&gt;&lt;/strong&gt; so the tags are present &lt;em&gt;in the request itself&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It’s a subtle but important difference — and once I understood it, the “AccessDenied” error finally made sense.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cdk</category>
      <category>cloudformation</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
