DEV Community: Jayesh Shinde

Stop Using IAM Access Keys: Secure Cross-Cloud Workloads with OIDC Federation

Jayesh Shinde — Thu, 11 Jun 2026 02:21:33 +0000

As developers and DevOps engineers, we’ve all been there. You have an external service—maybe an Azure Dynamics 365 (D365) business application or a GitHub Actions CI/CD pipeline—that needs to upload a file to Amazon S3 or trigger an AWS Lambda function.

The easiest path? Create an AWS IAM User, generate a pair of static AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY credentials, dump them into your external service secrets, and call it a day.

Stop doing this. 🛑

According to the AWS Well-Architected Framework, long-lived access keys are one of the highest security risks to a cloud environment. If those keys are leaked, hardcoded by accident, or left unrotated, your entire AWS perimeter is compromised.

The solution? Workload Identity Federation via OpenID Connect (OIDC).

In this post, we’ll look at why you need to ditch IAM users and exactly how to connect external workloads like Azure D365 and GitHub Actions securely using short-lived, temporary tokens.

Why NOT IAM Users?

Let’s look at the numbers. Managing static keys manually vs. assuming dynamic roles:

Concern	IAM User (Access Keys)	IAM Role (OIDC / Assume Role)
Credential Rotation	Manual, tedious, and error-prone	Automatic (handled by AWS STS)
Leakage Risk	High (long-lived keys last forever)	Low (short-lived tokens expire in <1 hr)
Auditability	Hard to trace back to specific sessions	Clear session-based trails in CloudTrail
Scalability	1 user key per service to track	1 IAM role, multiple trusted identity claim mappings
AWS Status	🛑 Discouraged	Preferred

The Core Concept: Workload Identity Federation

Instead of authenticating with a pre-shared password (an Access Key), AWS and your external identity provider establish a cryptographic trust.


[External Workload] ──(Requests OIDC JWT)──> [Identity Provider (Entra ID / GitHub)]
│                                                      │
│ (Presents JWT Token)                                 │
▼                                                      ▼
[AWS STS AssumeRoleWithWebIdentity] <──(Verifies Token)─────────┘
│
▼
[Temporary AWS Credentials Granted] ───> [Access AWS Resources (S3, Lambda, etc.)]

AWS acts as the validation authority, inspecting the incoming JSON Web Token (JWT) from your external provider, matching it against rules you define, and granting temporary AWS IAM credentials valid for only a brief window.

Scenario 1: Azure D365 / Entra ID ➔ AWS

If you have a business workflow running in Microsoft Dynamics 365 or an Azure Function trying to authenticate to AWS, this is the gold-standard implementation.

1. Register Microsoft as an Identity Provider in AWS IAM

Navigate to AWS IAM ➔ Identity Providers ➔ Add Provider.
Select OpenID Connect.
Provider URL: https://login.microsoftonline.com/<YOUR_AZURE_TENANT_ID>/v2.0
Audience: Enter the Application (Client) ID of your Azure App Registration.

2. Configure the IAM Role & Trust Policy

When creating the IAM role that D365 will assume, the Trust Policy must enforce strict conditions. Do not just check the audience (aud); you must pin the specific subject (sub) to ensure other tenants or apps inside your Azure ecosystem can't hijack the role.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::<AWS_ACCOUNT_ID>:oidc-provider/[login.microsoftonline.com/](https://login.microsoftonline.com/)<AZURE_TENANT_ID>/v2.0"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "[login.microsoftonline.com/](https://login.microsoftonline.com/)<AZURE_TENANT_ID>/v2.0:aud": "<AZURE_APP_CLIENT_ID>",
          "[login.microsoftonline.com/](https://login.microsoftonline.com/)<AZURE_TENANT_ID>/v2.0:sub": "<AZURE_SERVICE_PRINCIPAL_OBJECT_ID>"
        }
      }
    }
  ]
}

Scenario 2: GitHub Actions ➔ AWS

The exact same principles apply to your CI/CD pipelines. No more storing AWS keys in GitHub Secrets.

1. Register GitHub as an Identity Provider in AWS IAM

Go to AWS IAM ➔ Identity Providers ➔ Add Provider.
Provider URL: https://token.actions.githubusercontent.com
Audience: sts.amazonaws.com

2. Configure the GitHub IAM Trust Policy

To keep things secure, your condition should restrict role-assumption to a specific GitHub Organization, Repository, or even a specific git branch.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::<AWS_ACCOUNT_ID>:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:repo": "your-github-org/your-repo-name:*"
        }
      }
    }
  ]
}

3. Use it in your GitHub Actions Workflow

In your .github/workflows/deploy.yml, make sure to grant the workflow id-token: write permissions so it can request the JWT token from GitHub's OIDC engine.

name: Deploy to AWS
on: [push]

permissions:
  id-token: write # Mandatory for OIDC federation
  contents: read

jobs:
  AWSLogin:
    runs-on: ubuntu-latest
    steps:
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::<AWS_ACCOUNT_ID>:role/YourGitHubOidcRole
          aws-region: us-east-1

      - name: Test AWS Connection
        run: aws s3 ls

Pro-Tips for Production Implementations 💡

Infrastructure-as-Code (IaC) Thumbprints: If you are automating this setup with Terraform or CloudFormation, AWS requires a server certificate thumbprint for the OIDC provider. For Azure and GitHub, make sure your automated script dynamically fetches or references the correct root CA thumbprints to ensure authentication doesn't suddenly fail when Microsoft or GitHub updates their SSL certificates.
The "Multi-Tenant" Trap: In Azure, always make your App Registration Single-Tenant unless you have an explicit multi-organization architecture requirement. This is an excellent defense-in-depth practice.
What if I have an on-premises legacy workload? If you have legacy servers with no centralized OIDC identity provider, don't revert to IAM users right away! Look into AWS IAM Roles Anywhere. It allows local legacy infrastructure to use local X.509 certificates to securely request short-lived tokens from AWS STS.

Explaining the `aud` Difference Between Scenario 1 and Scenario 2

The aud (audience) claim in a JWT token answers the question: "Who is this token intended for?" The difference reflects the architecture of each identity provider.

Scenario 1: Azure D365 / Entra ID → AWS

"login.microsoftonline.com/<AZURE_TENANT_ID>/v2.0:aud": "<AZURE_APP_CLIENT_ID>"

Audience = Your Azure App Registration's Client ID

When Azure Entra ID (formerly Azure AD) issues a JWT, the aud is set to the specific application that requested the token — your Azure App Registration.
This is because Azure's OIDC flow is application-centric: a service principal requests a token for a specific registered application (identified by its Client ID).
The token is scoped to YOUR app, so only your app's Client ID is the intended recipient.
Think of it as: "This token was issued by Azure, and it's meant to be consumed by MY specific application."

Scenario 2: GitHub Actions → AWS

"token.actions.githubusercontent.com:aud": "sts.amazonaws.com"

Audience = sts.amazonaws.com

When GitHub Actions issues a JWT, it sets the aud to the AWS STS service endpoint directly.
This is because GitHub's OIDC flow is destination-centric: GitHub knows tokens from its provider will be used to call AWS STS (AssumeRoleWithWebIdentity), so it stamps the token with the intended consumer — AWS STS itself.
The value sts.amazonaws.com is a well-known, fixed string defined by GitHub's OIDC implementation specifically for AWS integration.
Think of it as: "This token was issued by GitHub, and it's meant to be consumed by AWS STS."

Side-by-Side Summary

	Azure (Scenario 1)	GitHub (Scenario 2)
`aud` value	Your Azure App Client ID	`sts.amazonaws.com`
Why this value?	Azure scopes tokens to the registered application	GitHub hardcodes AWS STS as the audience for AWS workflows
Who defines it?	You define it when registering the Azure App	GitHub pre-defines it for AWS integrations
Flexibility	You control the audience per app	Fixed — must be `sts.amazonaws.com`

The Core Reason

Both identity providers speak OIDC, but they implement the aud claim differently because of their design philosophies.

Azure is a general-purpose identity platform — the audience is the application consuming the token (your app).
GitHub Actions is purpose-built for CI/CD — the audience is the cloud service the token will authenticate against (AWS STS), and GitHub pre-sets this to sts.amazonaws.com as a convention.

When you configure the AWS IAM OIDC condition, you must mirror exactly what each provider puts into the JWT's aud claim — that's why the two scenarios look different.

Summary: Choose Your Strategy

Workload Scenario	Recommended Authentication Mechanism
Azure D365 / Azure Functions	OIDC Federation (Entra ID) + IAM Role
GitHub Actions / GitLab CI	OIDC Federation (GitHub/GitLab Provider) + IAM Role
AWS Cross-Account Communication	Native IAM Role Cross-Account Trust Policy
Legacy On-Premises Server (No IdP)	AWS IAM Roles Anywhere (X.509 Certificates)

Migrating to Workload Identity Federation might take a few extra minutes of initial configuration compared to copying and pasting static keys, but the massive leap in cloud security makes it non-negotiable for modern systems.

The Mystery of the 37-Second Lambda Delay (And How AWS EventBridge Fooled Us)

Jayesh Shinde — Sun, 07 Jun 2026 09:31:20 +0000

The Mystery of the 37-Second Lambda Delay (And How AWS EventBridge Fooled Us)

We’ve all been there. Everything works flawlessly in your SQA environment, but the moment your code hits UAT, it behaves like it’s wading through molasses.

Recently, we ran into a bizarre ghost in our AWS infrastructure: a Node.js Lambda function, triggered on a regular 10-minute interval by Amazon EventBridge, was consistently taking 37 seconds to log its very first line of code.

We initially thought it was a classic cold start or VPC network issue, but the real culprit turned out to be much sneakier. Here is how we realized EventBridge completely fooled us.

The Problem: The 37-Second Wall

In SQA, the Lambda invoked within 2 to 3 seconds. In UAT, it took 37 seconds.

We tried changing the trigger to run every single minute, but the 37-second delay still happened. We even threw Provisioned Concurrency at it to force the containers to stay warm, but it made absolutely no difference. The START RequestId log line stubbornly refused to print until the 37th second of the minute.

The environment wasn't lagging; it was completely stalled before our code even kicked off.

How We Debugged It

The breakthrough came when we stopped looking at the CloudWatch timestamps and looked inside the actual event object payload passed into the Node.js handler:

{
  "source": "aws.events",
  "time": "2026-06-14T18:00:37Z",
  "resources": ["arn:aws:events:...:rule/ten-min-cron"]
}

When we looked at that "time" field generated by EventBridge, the lightbulb finally went on. The timestamp read exactly :37 seconds past the minute.

EventBridge wasn't even sending the event to our Lambda until the 37th second. Our Lambda wasn't lagging; it was executing the exact millisecond AWS handed it the job.

The Realization: EventBridge Jitter & The Redo

As it turns out, AWS explicitly states that EventBridge scheduled rules have a 60-second precision window. To prevent millions of customer crons from firing at exactly 12:00:00.000 and melting downstream services worldwide (the "thundering herd" problem), AWS intentionally jitters and staggers the execution across those first 60 seconds.

Our UAT cron rule just happened to get dealt a brutal 37-second delay slot by AWS's internal scheduling engine when the infrastructure was first built.

To prove it, we completely destroyed our UAT infrastructure and stood it up again from scratch. When the new EventBridge rule was created, AWS assigned it a completely different internal bucket. Boom—the delay instantly dropped to 8 seconds.

The Final Takeaway

If you are triggering Lambdas via an ALB or API Gateway, AWS treats it as live traffic and routes it in milliseconds. But with EventBridge crons, you are completely at the mercy of the schedule lottery.

Our code wasn't broken, and our network was fine. It was just luck of the draw with AWS's background clock engine. If your background tasks can handle running a few seconds late, save yourself the headache and just let EventBridge do its thing!

How we tracked down a mysterious latency issue in our AWS Lambda + RDS Proxy stack, and discovered Prisma was the culprit all along.

Jayesh Shinde — Sun, 17 May 2026 00:56:32 +0000

Our API Was Fine. Database Was Fine. So Why Were Queries Taking 16 Seconds?

It started with a support ticket. A customer-facing API that normally responds in 200ms was occasionally spiking to 16 seconds. Not every request — just enough to make people nervous.

We run a fairly standard serverless stack: AWS Lambda (Node.js), RDS Proxy in front of Aurora MySQL, and Prisma ORM handling all the database interactions. About 27 microservices, 13 database schemas, and roughly 14 Prisma client instances managed through a shared dependency injection container. It had been running fine for months.

So what changed? Honestly, nothing. And that's what made it so confusing.

The Ghost in the Logs

The first thing we did was check our application logs. A simple query — something like "fetch an customer info by ID" — was reported as taking 16 seconds from the Lambda's perspective. We grabbed the same query from the logs and ran it manually against the database.

3.2 ms

Yeah. Three milliseconds.

So the database was fast. The query was fast. But our application was experiencing 16 seconds of... something. The time was being spent somewhere between the Lambda and the database, and we had no idea where.

Following the Breadcrumbs to RDS Proxy

Since the query itself was not slow, we started looking at the connection layer. We use RDS Proxy to manage connection pooling for our Lambda fleet — standard practice to avoid overwhelming Aurora with hundreds of short-lived connections.

We pulled up CloudWatch and looked at a metric we had honestly never paid much attention to before:

DatabaseConnectionsCurrentlySessionPinned

And there it was. During business hours, we were seeing 400 to 870 pinned connections.

For those unfamiliar: RDS Proxy's whole purpose is to multiplex database connections. Multiple Lambda invocations should be sharing a small pool of backend connections. But when a connection gets "pinned," the proxy dedicates that backend connection exclusively to one client session. It can't be shared. It can't be reused. It just sits there, held hostage.

With 870 pinned connections, our proxy was essentially not proxying. Lambda invocations were queuing up, waiting for a pinned connection to free up, and that waiting time was showing up as query latency on the application side.

But why were connections getting pinned?

The RDS Proxy Logs Tell All

We dug into the RDS Proxy log group (/rds/proxy) using CloudWatch Log Insights:

fields @timestamp, @message
| filter @message like /pinned/ or @message like /Pinning/
| sort @timestamp desc
| limit 200

The dominant pinning reason, appearing hundreds of times per minute:

"A protocol-level prepared statement was detected"

Prepared statements. That's what was pinning every single connection.

Down the MySQL Protocol Rabbit Hole

Here's something most people don't realize about Prisma — and honestly, we didn't either until this investigation.

MySQL has two query protocols:

Text Protocol (COM_QUERY) — You send a SQL string, the server parses and executes it, and sends back the result. Stateless. RDS Proxy can multiplex these connections freely.

Binary Protocol (COM_STMT_PREPARE → COM_STMT_EXECUTE → COM_STMT_CLOSE) — You first ask the server to prepare a statement, which creates a server-side handle. Then you execute against that handle with bound parameters. The handle is tied to the specific connection.

RDS Proxy cannot multiplex connections that have prepared statement handles open. The moment it sees COM_STMT_PREPARE, it pins the connection.

And here's the kicker: Prisma's query engine uses the binary protocol for everything. Every findMany, every update, every create — they all go through COM_STMT_PREPARE under the hood. Your application code looks like innocent ORM calls, but on the wire, every single one is a prepared statement.

We had 14 Prisma clients, each running queries through the binary protocol. Every query pinned its connection. Multiply that across a fleet of Lambda invocations, and you get 870 pinned connections during peak hours.

First Attempt: `statement_cache_size=0`

Our first mitigation idea came from Prisma's documentation. There's a connection URL parameter called statement_cache_size that controls how many prepared statements are cached. We set it to zero:

mysql://user:pass@proxy:3306/mydb?statement_cache_size=0

The theory was: if we disable caching, Prisma will close each prepared statement immediately after execution instead of holding it open.

We deployed it. We watched the metrics. And initially, it looked promising — Prepared_stmt_count on the MySQL server dropped to near zero. But the actual DatabaseConnectionsCurrentlySessionPinned metric? Still 400-870.

After more digging, we figured out why. statement_cache_size=0 disables the cache, but it doesn't change the protocol. Prisma still sends COM_STMT_PREPARE for every query. Even though the statement is closed immediately after execution, RDS Proxy pins the connection the moment it sees that COM_STMT_PREPARE packet. The pin happens before the statement is even executed, let alone closed.

So statement_cache_size=0 was a hygiene improvement (it prevents prepared statement count from growing unbounded), but it didn't solve our actual problem.

Second Attempt: Reducing Pool Size

We tried reducing the connection pool size per Prisma client. The idea was: fewer connections per Lambda means fewer connections to pin, which means the proxy has more backend connections available.

connection_limit=5

It helped a little. The peaks were lower. But we were just managing the symptom — the pinning itself never went away. Every connection was still getting pinned, just fewer of them at once.

The Actual Fix: Prisma 7.8.0 and Driver Adapters

While digging through Prisma's changelog and GitHub issues, we discovered that Prisma 7.x had fundamentally changed how the ORM talks to databases.

In Prisma 5.x (our version), every query goes through a Rust query engine — a compiled binary that speaks the MySQL binary protocol. You literally ship a libquery_engine-rhel-openssl-3.0.x.so.node file with your Lambda. There's no way to make it use the text protocol.

In Prisma 7.x, the Rust engine is gone. Instead, you use driver adapters — thin wrappers around native JavaScript database drivers like mysql2 or mariadb. And crucially, @prisma/adapter-mariadb supports a useTextProtocol option.

// Prisma 5.x — Rust engine, binary protocol, pins RDS Proxy
const client = new PrismaClient({
  datasources: { db: { url: connectionUrl } }
});

// Prisma 7.x — Driver adapter, text protocol, RDS Proxy friendly
const adapter = new PrismaMariaDb(
  { host, port, user, password, database: 'mydb' },
  { useTextProtocol: true }
);
const client = new PrismaClient({ adapter });

With useTextProtocol: true, the adapter uses connection.query() (text protocol, COM_QUERY) instead of connection.execute() (binary protocol, COM_STMT_PREPARE). No prepared statements. No pinning.

The Migration Wasn't Trivial, But It Was Focused

Upgrading from Prisma 5.12 to 7.8.0 is a major version jump. Here's what we had to change:

Schema files (all 13 of them):

Generator changed from prisma-client-js to prisma-client
Removed binaryTargets entirely — no more Rust binary
Removed previewFeatures = ["tracing"] — tracing is GA now

Dependency injection container:

Replaced URL-based datasource configuration with adapter-based instantiation
Each of our 14 Prisma clients got its own adapter instance with useTextProtocol: true
Connection config changed from URL string to config object (host, port, user, password, SSL)

Build pipeline:

Deleted the entire block that copied the Rust query engine binary into Lambda bundles
Lambda bundles immediately got ~90% smaller

What stayed the same:

All our $extends wrappers (retry logic) worked without changes
esbuild bundling continued to output CJS — no ESM migration needed

Total effort was about 2-3 days including testing across all services.

The Results

After deploying Prisma 7.8.0 with driver adapters:

Metric	Before	After
`DatabaseConnectionsCurrentlySessionPinned`	400–870 (business hours)	Near zero
RDS Proxy pin reason	"protocol-level prepared statement"	Gone
Lambda cold start	~800ms – 1.5s	~250ms – 400ms
Lambda bundle size	~14 MB (Rust binary)	~1.6 MB (pure JS/TS)
Query latency (P99)	16s spikes	Consistent < 500ms

The mysterious 16-second latencies vanished. Not reduced — vanished.

Bonus: ORM Comparison

During our investigation we also evaluated whether switching ORMs entirely might be a cleaner path. Here's what we found:

Metric	Prisma 5.12 (Rust Engine)	Prisma 7.8.0 (Adapter)	Drizzle ORM
Engine Footprint	~14 MB (Heavy Rust Binary)	~1.6 MB (WASM / TS)	Minimal (Pure JS/TS)
Typical Cold Start	Very Poor (800ms – 1.5s+)	Good (250ms – 400ms)	Excellent (100ms – 200ms)
RDS Proxy Friendly?	No (Pins everything)	Yes (Via `useTextProtocol`)	Yes (Via `prepare: false`)
Type Safety Style	Generated Schema Client	Generated Schema Client	Code-first / Infer

Drizzle is objectively lighter and has better cold starts, but migrating 27 services away from Prisma would be a months-long project. Prisma 7.8.0 with driver adapters got us to "RDS Proxy friendly" without changing our query layer, model definitions, or testing patterns. For us, that was the right trade-off.

What We Learned

1. Your ORM's wire protocol matters more than you think.
We spent weeks optimizing queries that didn't need optimizing. The queries were fast. The protocol was the problem.

2. statement_cache_size=0 is not the same as "no prepared statements."
This one burned us. Disabling the cache still uses COM_STMT_PREPARE — it just closes the statement immediately. RDS Proxy doesn't care. It pins on the PREPARE, not on the cache.

3. RDS Proxy logs are your friend.
The /rds/proxy CloudWatch log group tells you exactly why connections are being pinned. We should have looked there on day one instead of chasing query performance.

4. CloudWatch metrics can mislead without context.
We initially saw DatabaseConnectionsCurrentlySessionPinned drop after deploying statement_cache_size=0 and thought we'd fixed it. The drop was real but temporary — a deployment artifact from Lambda instances recycling. The metric climbed right back up once traffic resumed.

5. Check your Rust binary targets.
While investigating, we also discovered our Lambda was bundling an OpenSSL 1.0.x query engine binary on an Amazon Linux 2023 runtime that uses OpenSSL 3.0.x. A latent mismatch that happened to work by accident but could have caused cryptic failures at any time.

Still Some Loose Ends

Even after the Prisma 7 migration, we found two residual pinning patterns in our RDS Proxy logs — both from raw SQL queries that use MySQL user variables (SET @val = ...). These pin for a different reason: "SQL changed session settings that the proxy doesn't track."

One we converted to a Prisma model query (it was using SELECT INTO @var when a simple findFirst + update would do). The other involves a stored procedure with an OUT parameter, which genuinely requires user variables — that's a stored proc refactor we've deferred.

The prepared statement pinning that was causing 400-870 pinned connections? That's completely gone.

TL;DR

If you're running Prisma with AWS RDS Proxy and experiencing mysterious latency:

Check DatabaseConnectionsCurrentlySessionPinned in CloudWatch
Check your RDS Proxy logs for "prepared statement was detected"
If that's your issue, statement_cache_size=0 won't fix it — the protocol is the problem
Upgrade to Prisma 7.x with @prisma/adapter-mariadb (or @prisma/adapter-mysql)
Set useTextProtocol: true on the adapter
Watch your pinned connections drop to zero

This post documents a real production investigation. The codebase manages ~27 microservices across 13 MySQL schemas on AWS Lambda with RDS Proxy. If you've hit something similar, I hope this saves you the weeks of debugging it took us.

How We Fixed Intermittent ECS Image-Not-Found Errors in AWS CDK

Jayesh Shinde — Fri, 13 Mar 2026 09:00:11 +0000

At one point, our ECS deployments started failing in a way that felt random.

Sometimes a deployment would work perfectly. Sometimes the service would try to roll forward and fail because the container image it expected was no longer available. Nothing was wrong with the application code. The problem was in the deployment asset flow.

We were using AWS CDK to deploy container-based workloads, and like many teams, we were relying on CDK’s default bootstrap ECR repository for Docker image assets. That was convenient at first, but it became a problem once repository retention rules were tightened for cost control.

In environments with frequent deployments, older intermediate images were being cleaned up faster than our deployment flow could safely tolerate. The result was intermittent ECS deploy failures caused by missing images.

The Root Cause

AWS CDK Docker assets are published during the asset publishing phase, which happens before CloudFormation starts deploying stacks.

That means two things:

CDK is not just defining infrastructure, it is also managing where deployable image assets are stored.
If the default asset repository has aggressive cleanup policies, your deployments can become fragile.

This is especially painful in non-production environments where deployment frequency is high and image churn is constant.

The Strategy We Took

We wanted a solution that was simple, low-risk, and did not require redesigning the whole build pipeline.

So instead of pushing ECS image assets to the shared default CDK ECR repository, we moved to a dedicated ECR repository per environment/application area.

At a high level, the fix looked like this:

create a dedicated ECR repository ahead of time
configure the CDK synthesizer to publish image assets there
keep lifecycle control on that repository
deploy the ECR stack first, then the app stacks

This gave us isolation from the shared bootstrap repository while keeping the rest of the CDK deployment model mostly unchanged.

Sample: Dedicated ECR Stack

Here is a simplified example of creating a dedicated ECR repository with a lifecycle policy:

import { Stack, RemovalPolicy } from 'aws-cdk-lib';
import { Repository } from 'aws-cdk-lib/aws-ecr';
import type { Construct } from 'constructs';

type AppEnvProps = {
  stage: string;
};

export class ContainerAssetRepoStack extends Stack {
  constructor(scope: Construct, id: string, props: AppEnvProps) {
    super(scope, id);

    new Repository(this, 'AppContainerRepo', {
      repositoryName: `myapp-assets-${props.stage.toLowerCase()}`,
      removalPolicy: RemovalPolicy.RETAIN,
      lifecycleRules: [
        {
          rulePriority: 1,
          description: 'Keep only the latest 100 images',
          maxImageCount: 100,
        },
      ],
    });
  }
}

A few important details here:

RETAIN protects the repository if the stack is deleted later.
lifecycle rules still clean up old images over time.
the repository name is normalized to lowercase, which is important for ECR.

Sample: Point CDK to the Dedicated Repo

Once the repository exists, the application stack can tell CDK to publish image assets there using DefaultStackSynthesizer.

import { App, DefaultStackSynthesizer, Stack } from 'aws-cdk-lib';
import type { Construct } from 'constructs';

type ServiceStackProps = {
  stage: string;
};

class ServiceStack extends Stack {
  constructor(scope: Construct, id: string, props: ServiceStackProps) {
    super(scope, id, {
      synthesizer: new DefaultStackSynthesizer({
        imageAssetsRepositoryName: `myapp-assets-${props.stage.toLowerCase()}`,
      }),
    });

    // ECS service, task definition, container asset usage, etc.
  }
}

const app = new App();

new ServiceStack(app, 'ServiceStackDev', {
  stage: 'dev',
});

This keeps the existing CDK asset publishing model, but moves the destination away from the shared default bootstrap repository.

One Important Gotcha

A stack dependency is not enough if the same deployment run tries to create the ECR repository and publish assets into it.

Why?

Because asset publishing happens before CloudFormation stack deployment.

So if the repository does not already exist, the asset publish step can fail before your “repo stack” is even deployed.

The safest pattern is:

deploy the ECR repository stack first
run the normal application deployment after that

That sequencing matters.

Another Important Gotcha: IAM Permissions

Changing the repository target is not enough by itself.

The identity or role that CDK uses to publish Docker assets must also have permission to push to the new ECR repository.

That usually means allowing actions such as:

ecr:PutImage
ecr:InitiateLayerUpload
ecr:UploadLayerPart
ecr:CompleteLayerUpload
ecr:BatchCheckLayerAvailability
ecr:BatchGetImage
ecr:GetDownloadUrlForLayer
ecr:GetAuthorizationToken

If you forget this part, the deployment simply moves from “image missing” problems to “access denied” problems.

Why This Worked Well for Us

We liked this approach because it was a practical middle ground.

It did not require:

rebuilding our CI/CD image strategy from scratch
changing every ECS service definition
introducing a more complex app-owned image publishing flow immediately

But it did give us:

predictable image retention
environment-specific isolation
fewer surprises during ECS deployments
better control over cost and cleanup behavior

When to Use This Pattern

This approach makes sense if:

you already use CDK-managed Docker/container assets
the default bootstrap ECR repository is shared across too many deployments
retention rules on that shared repository are causing instability
you want a fast, low-disruption improvement

If you want a more explicit long-term model, the next step is usually:

build image in CI
push image to a named ECR repository yourself
reference the image directly in ECS by repo and tag

That gives maximum control, but it also requires more changes.

Final Thought

CDK defaults are great for getting started, but they are not always ideal once platform constraints like retention, cost control, and deployment frequency start to matter.

In our case, moving Docker assets to dedicated ECR repositories was a small change with a big operational impact. It made deployments more predictable without forcing a major rework of the pipeline.

The Silent Connection Killer: MySQL2 and AWS Lambda's Freeze/Thaw Problem

Jayesh Shinde — Thu, 05 Feb 2026 09:03:13 +0000

The Mystery Error

You're running a Node.js Lambda with MySQL2, everything works great in testing, but production logs show intermittent failures:

Error: Connection lost: The server closed the connection.code: "PROTOCOL_CONNECTION_LOST"
fatal: true

No pattern. No warning. Just random failures that make you question your life choices.

Understanding the Real Problem : How Lambda Actually Works

Lambda doesn't spin up a fresh container for every request. AWS keeps containers "warm" for reuse:

Request 1 → Lambda runs → Response
                ↓
           [FREEZE] ← Container paused (not terminated)
                ↓     
         (5-15 min pass)
                ↓
           [THAW] ← Container resumed
                ↓
Request 2 → Lambda runs → 💥 PROTOCOL_CONNECTION_LOST

During freeze, your Lambda is literally paused. The JavaScript event loop stops. Timers stop. Everything stops.
But here's the catch: the outside world doesn't stop.

What Happens to Your Database Connection

Lambda creates a MySQL connection pool
Connections sit idle in the pool
Lambda freezes (container paused)
Real world time passes (5-30 minutes)
Network timeouts occur, NAT gateways clear state, RDS Proxy cleans up
The TCP socket dies, but your pool doesn't know
Lambda thaws, tries to use the dead connection
💥 PROTOCOL_CONNECTION_LOST

Why idleTimeout Doesn't Help

You might think: "I'll set idleTimeout: 60000 to clean up idle connections!"
Here's why it doesn't work:

Timer starts (60s countdown)
    ↓
Lambda FREEZES at 1s elapsed
    ↓
████████████████████████████████
█  15 minutes pass in REAL WORLD  █
█  Timer is PAUSED at 1s          █
████████████████████████████████
    ↓
Lambda THAWS - timer resumes at 1s
    ↓
Connection still in pool (timer thinks 2s passed)
    ↓
Connection is DEAD but pool doesn't know

The timer doesn't run during freeze. Your 60-second timeout is useless against a 15-minute freeze.

The Solution: Detect and Retry
Since we can't prevent stale connections, we detect them and retry transparently.

Step 1: Enable TCP Keep-Alive

const pool = mysql.createPool({
  ...config,
  enableKeepAlive: true,
  keepAliveInitialDelay: 10000,
});

This helps get clear error codes (ECONNRESET, PROTOCOL_CONNECTION_LOST) instead of hanging indefinitely.

Step 2: Implement Retry Logic

async executeQuery(sql, params) {
  const maxRetries = 1;
  let lastError = null;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    let connection = null;

    try {
      connection = await this.getConnectionFromPool();
      const result = await this.executeQueryWithConnection(connection, sql, params);
      connection.release();
      return result;
    } catch (error) {
      // Non-recoverable error - throw immediately
      if (!this.isConnectionLostError(error)) {
        if (connection) connection.release();
        throw error;
      }

      // Connection lost - destroy stale connection
      if (connection) connection.destroy();

      // Retry if attempts left
      lastError = error;
      if (attempt < maxRetries) {
        console.warn("Connection lost, retrying...", { attempt: attempt + 1 });
        continue;
      }
    }
  }

  throw lastError;
}

isConnectionLostError(error) {
  const recoverableCodes = [
    "PROTOCOL_CONNECTION_LOST",  // Server closed connection
    "ECONNRESET",                // TCP reset
    "EPIPE",                     // Broken pipe
    "ETIMEDOUT",                 // Connection timeout
    "ECONNREFUSED",              // Connection refused
  ];
  return recoverableCodes.includes(error?.code);
}

Key Points:

connection.destroy() - Removes stale connection from pool (don't reuse it!)
connection.release() - Returns healthy connection to pool
One retry is usually enough - The second attempt gets a fresh connection

What About Pool Settings?

Do I Need to Tune connectionLimit, maxIdle, etc.?
Short answer: Not really.

Setting	Helps with freeze/thaw?	Why?
`idleTimeout`	No	Timer paused during freeze
`maxIdle`	Marginally	Fewer connections = fewer stale ones, but adds reconnection overhead
`connectionLimit`	No	Doesn't affect stale connections
Retry logic	Yes	Handles stale connections at runtime

If you're using RDS Proxy, it handles connection pooling at the infrastructure level. Keep your Lambda pool settings simple and let the retry logic do the heavy lifting.

Using RDS Proxy?

RDS Proxy's "Idle client connection timeout" (default: 30 minutes) is separate from MySQL's wait_timeout. The proxy manages Lambda→Proxy connections independently.
But even with RDS Proxy, connections can die due to:

NAT gateway timeouts (typically 5-15 minutes for idle TCP)
Network state table cleanup
Proxy internal connection recycling

The retry logic is still your safety net.

The Final Architecture

┌─────────────────────────────────────────────────────┐
│                    Your Lambda                       │
├─────────────────────────────────────────────────────┤
│                                                      │
│   executeQuery()                                     │
│       ↓                                             │
│   for (attempt = 0; attempt <= 1; attempt++)        │
│       ↓                                             │
│   getConnection() → Try query                       │
│       ↓                                             │
│   Success? → return result                          │
│       ↓                                             │
│   Connection lost? → destroy() → retry              │
│       ↓                                             │
│   Other error? → release() → throw                  │
│                                                      │
└─────────────────────────────────────────────────────┘
           ↓
    [RDS Proxy] (optional)
           ↓
    [MySQL/Aurora]

Summary

Problem	Solution
Lambda freezes, connections go stale	Retry logic detects and recovers
Pool doesn't know connections are dead	`enableKeepAlive` for faster detection
`idleTimeout` doesn't work during freeze	Accept it, rely on retry instead
Random `PROTOCOL_CONNECTION_LOST`	Transparent retry = users don't notice

The key insight: You can't prevent stale connections in a serverless environment. But you can detect them instantly and retry transparently.

BUT....

The Problem With Just Retrying Once

The previous article suggested detecting a stale connection error and retrying once. That works — but only if a single connection went stale. After a longer Lambda freeze (10–15+ minutes), the entire pool goes stale. Here's the scenario:

Pool has 5 connections → Lambda freezes → all 5 TCP sockets die

Request comes in after thaw:
→ gets conn #1 from _freeConnections → FAILS (stale)
→ retry: gets conn #2 from _freeConnections → FAILS (also stale!)

One retry is not enough because _freeConnections is a queue — the retry just picks the next dead connection in line.

How mysql2's Pool Actually Works Internally

Looking at mysql2's pool.js source, getConnection() does this:

// Simplified from mysql2 internals
getConnection(cb) {
  if (this._freeConnections.length > 0) {
    const connection = this._freeConnections.shift(); // FIFO — no health check
    return cb(null, connection);
  }
  // if allConnections.length < connectionLimit → create a NEW connection
  // otherwise → queue the request in _connectionQueue
}

There is zero validation when pulling from _freeConnections. The pool hands you whatever is sitting there, stale or not.

The inverse is also useful to know — when _freeConnections is empty AND _allConnections.length < connectionLimit, mysql2 will automatically create a brand new TCP connection. This is the behavior we want to exploit.

The Fix: Drain All Free Connections on a Stale Error

Instead of retrying once and hoping the next connection is healthy, destroy every connection in _freeConnections the moment you detect a stale error. The pool's own logic then forces a fresh connection on retry.

const STALE_ERRORS = new Set([
  'PROTOCOL_CONNECTION_LOST',
  'ECONNRESET',
  'EPIPE',
  'ETIMEDOUT',
  'ECONNREFUSED',
]);

function drainFreeConnections(pool) {
  // Destroy all idle connections in one sweep
  for (const conn of pool._freeConnections) {
    conn.destroy(); // sets conn._pool = null, removes from _allConnections
  }
  pool._freeConnections.length = 0; // clear the array in-place
  // pool._allConnections.length is now reduced → next getConnection()
  // sees: freeConnections empty + allConnections < connectionLimit
  // → creates a fresh TCP connection automatically
}

async function executeQuery(pool, sql, params) {
  let connection;
  try {
    connection = await pool.getConnection();
    const [rows] = await connection.query(sql, params);
    connection.release();
    return rows;
  } catch (err) {
    if (connection) connection.destroy(); // kill the one that triggered the error

    if (STALE_ERRORS.has(err.code)) {
      // Nuke all remaining free connections — they're all suspect
      drainFreeConnections(pool);

      // Retry — pool is now forced to open a fresh connection
      const freshConn = await pool.getConnection();
      try {
        const [rows] = await freshConn.query(sql, params);
        freshConn.release();
        return rows;
      } catch (retryErr) {
        freshConn.destroy();
        throw retryErr;
      }
    }

    throw err;
  }
}

Why `pool._freeConnections.length = 0` instead of a `while` loop with `shift()`?

pool._freeConnections is an array used as a FIFO queue — mysql2 uses shift() when getting connections and push() when releasing them. Since we're destroying all of them, iterating with for...of then zeroing the length is simpler and safer than mutating the array mid-iteration with shift().

What `conn.destroy()` does under the hood

When you call destroy(), mysql2 does:

// Simplified from connection._removeFromPool()
this._allConnections.splice(this._allConnections.indexOf(connection), 1);
connection._pool = null;
// TCP socket is closed

So after drainFreeConnections:

_freeConnections → empty ✅
_allConnections.length → drops back toward 0 ✅
Next getConnection() → pool sees room under connectionLimit → creates fresh TCP connection ✅

Comparison: Single Retry vs. Drain All

Approach	After 1 stale conn	After full pool stale
Single retry (article)	✅ Works	❌ Retry hits another stale conn
Drain all + retry (this approach)	✅ Works	✅ Pool forced to create fresh conn

A Note on `_freeConnections` Being a Private API

_freeConnections is not exported in mysql2's TypeScript typings (it's missing from Pool.d.ts). In TypeScript, you'll need a cast:

(pool as any)._freeConnections

It has been stable and present since the beginning of the library. But since it's not officially part of the public API, it's worth keeping an eye on across major version upgrades.

Why Prisma Doesn't Have This Problem

If you're using Prisma, you may have noticed it doesn't suffer from the freeze/thaw stale connection issue as badly. There's a concrete reason for this — it's not magic, it's the Rust query engine.

Prisma uses a connection pool built on the mobc library inside its Rust engine. Before handing a connection to your query, it performs a time-gated pre-ping:

Connection pulled from pool
         ↓
Has more than 15 seconds passed since this connection was last used?
    YES → run SELECT 1
              ↓
         SELECT 1 succeeds? → proceed with query
         SELECT 1 fails?    → discard, open fresh connection
    NO  → skip ping, proceed directly (optimization for rapid queries)

This is essentially the same pattern as SQLAlchemy's pool_pre_ping=True, with a 15-second grace window to avoid pinging on every rapid-fire query.

After a Lambda freeze of any meaningful duration (seconds to minutes), the timer has expired, so Prisma will ping before your query even runs — and silently replace any dead connection. Your application code never sees the error.

How it stacks up

	mysql2 (raw pool)	Prisma (Rust engine)
Pre-ping on checkout	❌ None	✅ If >15s idle
Handles full pool going stale	❌ Needs manual drain logic	✅ Each connection validated individually
Error surfaces to app code	✅ Yes — you must handle it	❌ Transparent — retried internally
Overhead	None (no extra queries)	One `SELECT 1` per connection after idle period

Even Prisma Isn't Bulletproof

Worth noting: Prisma's pre-ping protects against stale connections, but the 15-second threshold means a freeze shorter than 15 seconds could still theoretically slip through. And connection-level issues outside the pool (e.g. NAT gateway state tables, RDS Proxy recycling) can still cause failures that the pre-ping doesn't catch. Retry logic at the application layer remains a good safety net regardless of ORM.

Summary

The key insight is to work with mysql2's internal pool logic rather than against it:

On a stale connection error, don't just retry — drain _freeConnections first
mysql2 will automatically open fresh connections to fill the gap (it's built into getConnection())
Your retry then gets a genuinely new TCP connection instead of another dead one from the queue

If you want this behavior without managing it yourself, Prisma's Rust engine gives you a time-gated pre-ping out of the box — which is the more principled long-term solution for serverless MySQL workloads.

Building a Clean Event Pipeline in Spring: From Simple Events to Async Listeners to the Outbox Pattern

Jayesh Shinde — Sat, 03 Jan 2026 07:43:05 +0000

Event‑driven architecture sounds simple on paper: “emit an event when something happens.”

But once you start implementing it inside a real Spring Boot service, you quickly discover the hidden trade‑offs.

In this post, we’ll walk through a real‑world progression:

emitting domain events inside a service
handling them with @EventListener
realizing enrichment logic slows down the request
making listeners async
adding a production‑grade executor
and finally touching the gold standard: the Outbox Pattern

Let’s dive in.

1. The initial requirement: emit an event inside the service

Imagine a simple use case: when a user is created, we want to emit an event so other parts of the system can react.

A clean way to do this in Spring is to wrap ApplicationEventPublisher:

@Component
@AllArgsConstructor
public class UserEventPublisher {
    private final ApplicationEventPublisher applicationEventPublisher;

    public <T> void publish(T event) {
        applicationEventPublisher.publishEvent(event);
    }
}

Now inside your service:

userEventPublisher.publish(new UserCreatedEvent(userId, email));

2. Handling the event with `@EventListener`

A simple listener:

@Component
public class UserAuditListener {

    @EventListener
    public void handleUserCreateEvent(UserCreatedEvent event) {
        System.out.println("User created: " + event);
    }
}

This works beautifully… until you need to do more than just print.

3. The problem: enrichment logic slows down the request

Let’s say before publishing to Kafka, you want to:

fetch additional data from DB
call another service
enrich the event payload

Since @EventListener is synchronous by default, all this work blocks the original request thread.

Your API response time suddenly spikes.

Not good.

4. Making listeners async with `@Async` and `@EnableAsync`

Spring makes this easy:

@EnableAsync
@SpringBootApplication
public class App { }

And in the listener:

@Async
@EventListener
public void handleUserCreateEvent(UserCreatedEvent event) {
    // runs in a background thread
}

Now the main request returns immediately while the listener does its work asynchronously.

But there’s a catch…

5. The default executor is not production‑grade

If you don’t configure anything, Spring uses SimpleAsyncTaskExecutor:

creates a new thread per task
no pooling
no backpressure
no monitoring

This is fine for demos, not for real systems.

6. Adding a custom executor

A better approach is to define your own thread pool:

@Configuration
@EnableAsync
public class AsyncConfig {

    @Bean(name = "taskExecutor")
    public Executor taskExecutor() {
        return Executors.newCachedThreadPool();
    }
}

Now all @Async methods use this executor.

You can replace it with a tuned ThreadPoolTaskExecutor for even more control.

7. The gold standard: the Outbox Pattern

Async listeners solve the latency problem, but they don’t solve the reliability problem.

What if:

the DB transaction commits
but the async listener fails before sending to Kafka?
or the service crashes?

You lose the event.

This is why mature systems use the Outbox Pattern.

How the Outbox Pattern works (high‑level)

Write the event into an “outbox” table inside the same DB transaction
- If the user is created, the outbox record is also created
- Atomic, consistent, no partial failures
A background process reads the outbox table

This can be:
- a scheduled Spring job
- a Kafka Connect Debezium connector
- a lightweight polling thread
The background process publishes the event to Kafka
After successful publish, the outbox record is marked as processed

Why this is the gold standard

no lost events
no double‑publishing
no dependency on async listeners
fully decoupled from request latency
battle‑tested at Uber, Netflix, Stripe, Shopify

8. Summary

Here’s the journey we walked through:

Start with simple Spring events
Add @EventListener to react to them
Realize enrichment logic slows down the request
Add @Async + @EnableAsync to make listeners non‑blocking
Add a custom executor for production‑grade async processing
Finally, adopt the Outbox Pattern for guaranteed delivery and reliability

This progression mirrors how real systems evolve as they scale.

If you’re building event‑driven microservices, the outbox pattern is the foundation you eventually want to reach.

How a Cache Invalidation Bug Nearly Took Down Our System - And What We Changed After

Jayesh Shinde — Fri, 05 Dec 2025 01:57:56 +0000

A few weeks ago, we had one of those production incidents that quietly start in the background and explode right when the traffic peaks.
This one involved Aurora MySQL, a Lambda with a 30-second timeout, and a poorly-designed cache invalidation strategy that ended up flooding our database.

Here’s the story, what went wrong, and the changes we made so it never happens again.

🎬 The Setup

The night before the incident, we upgraded our Aurora MySQL engine version.
Everything looked good. No alarms. No red flags.

The next morning around 8 AM, our daily job kicked in — the one responsible for:

deleting the stale “master data” cache
refetching fresh master data from the DB
storing it back in cache

This master dataset is used Application to work correctly, so if the cache isn’t warm, the DB gets hammered.

💥 The Explosion

Right after the engine upgrade, a specific query in the Lambda suddenly started taking 30+ seconds.
But our Lambda had a 30-second timeout.

So what happened?

The cacheInvalidate → cacheRebuild flow failed.
The cache remained empty.
Every user request resulted in a cache miss.
All those requests hit the DB directly.
Aurora CPU spiked to 99%.
Application responses stalled across the board.

Classic cache stampede.

We eventually triggered a failover, and luckily the same query ran in ~28.7 seconds on the new writer, just under the Lambda timeout. That bought us a few minutes to stabilize.

Later that night, we found the real culprit:
➡️ The query needed a new index, and the upgrade changed its execution plan.

We created the index via a hotfix, and the DB stabilized.

But the deeper problem was our cache invalidation approach.

🧹 Our Original Cache Invalidation: Delete First, Hope Later

Our initial flow was:

Delete the existing cache key
Fetch fresh data from DB
Save it back to cache

If step 2 fails, everything collapses.

It’s simple… until it isn’t.
In our case, the Lambda failed to fetch fresh data, so the cache stayed empty.

🔧 What We Changed (and Recommend)

1. Never delete the cache before you have fresh data

We inverted the flow:

Fetch → Validate → Update cache
Only delete if we already have fresh data ready

This eliminates the “empty cache” window.

2. Use “stale rollover” instead of blunt deletion

If the refresh job fails, we now:

rename the key
- "Master-Data" → "Master-Data-Stale"
keep the old value available
add an internal notification so the team can investigate

This ensures that even if the DB is slow or down, the system still has something to serve.

It’s not ideal, but it prevents a meltdown.

3. API layer now returns stale data when fresh data is unavailable

The API logic became:

Try to read "Master-Data"
If not found:
- Attempt to rebuild (only if allowed)
- If rebuild fails → return stale data

This avoids cascading failures.

4. Add a Redis distributed lock to prevent cache stampede

Without this, even if stale data existed, multiple API nodes or Lambdas could all try to rebuild simultaneously — hammering the DB again.

With a Redis lock:

Only one request gets the lock and rebuilds
Others:
- Do not hit DB
- Simply return stale data
- Wait for the winner to repopulate the cache

This one change alone eliminates 90% of stampede risk.

Node.js — Acquire Distributed Lock (Redis)

Below is a simple Redis-based lock using SET NX PX (no external library).
You can replace redis client with ioredis or node-redis based on your stack.

// redis.js
const { createClient } = require("redis");

const redis = createClient({
  url: process.env.REDIS_URL
});
redis.connect();

module.exports = redis;

Acquiring and Releasing the Lock

// lock.js
const redis = require("./redis");
const { randomUUID } = require("crypto");

const LOCK_KEY = "lock:master-data-refresh";
const LOCK_TTL = 10000; // 10 seconds

async function acquireLock() {
  const lockId = randomUUID();

  const result = await redis.set(LOCK_KEY, lockId, {
    NX: true,
    PX: LOCK_TTL
  });

  if (result === "OK") {
    return lockId; // lock acquired
  }

  return null; // lock not acquired
}

async function releaseLock(lockId) {
  const current = await redis.get(LOCK_KEY);

  if (current === lockId) {
    await redis.del(LOCK_KEY);
  }
}

module.exports = { acquireLock, releaseLock };

Usage

const { acquireLock, releaseLock } = require("./lock");

async function refreshMasterData() {
  const lockId = await acquireLock();

  if (!lockId) {
    console.log("Another request is refreshing. Returning stale data.");
    return getStaleData();
  }

  try {
    const newData = await fetchFromDB();
    await saveToCache(newData);
    return newData;
  } finally {
    await releaseLock(lockId);
  }
}

5. Add observability around refresh times

We now record:

query execution time
cache refresh duration
lock acquisition metrics
alerts when a refresh exceeds a threshold

The goal is to catch slowdowns before timeout happens.

📝 Key Takeaways

Engine upgrades can change execution plans, sometimes dramatically.
Always benchmark critical queries after major DB changes.
Cache invalidation strategies must assume that refresh can fail.
Serving stale-but-valid data is often better than serving errors.
Distributed locks are essential in preventing cache stampede.

🚀 Final Thoughts

The incident was stressful, but the learnings were worth it.
Caching problems rarely show up during normal traffic — they appear right when your system is busiest.

If you have a similar “delete-then-refresh” pattern somewhere in your application… you may want to review it before it reviews you.

🧩 From 15 Minutes to Infinite: Scaling STT Jobs with AWS Batch

Jayesh Shinde — Sun, 09 Nov 2025 05:25:17 +0000

💡 The Problem

We recently ran into a production issue — our Speech-to-Text (STT) service stopped working for a few hours.
The feature was fixed quickly, but the transcripts for that downtime were missing.

Luckily, in Amazon Connect, all call recordings are stored in S3.
So the audio was there, but no transcripts.

We needed to reprocess all those missed files — fast.

🧠 First Attempt: Lambda (and its Limitations)

We quickly built a Lambda function to process unprocessed files from S3.

It worked fine — until it didn’t.
AWS Lambda has a 15-minute execution limit, and processing large audio files can easily exceed that.

We could have switched to EC2, but that felt like using a hammer for a small screw — no auto-scaling, no graceful shutdown, no built-in retry or job management.

We needed something that behaved like a job, not a script.

🚀 Enter AWS Batch + Fargate

That’s when AWS Batch came to the rescue.
It’s perfect for this kind of workload — long-running, batch-style, event-driven jobs.

Here’s the setup we used:

Created a Compute Environment

Backed by AWS Fargate → no EC2 management.
Scales automatically depending on job load.

Defined a Job Queue

All reprocessing jobs will be submitted here.
The queue ensures controlled concurrency and retries.

Built a Job Definition

Packaged our STT processing logic as a Docker image.
Uploaded it to Amazon ECR.
Defined required vCPU and memory for each job.

Triggered via Lambda

A small Lambda fetches a list of unprocessed S3 files.
For each batch (say 50 files), it submits a Batch Job.

⚙️ The Flow in Action

Lambda → Checks for unprocessed audio files in S3.
Lambda → AWS Batch: Submits a job to process them.
AWS Batch (Fargate) spins up compute, runs the job.
Job → Downloads audio → runs STT → uploads transcript → updates metadata.
Fargate shuts down automatically when the job finishes.

No idle servers, no manual cleanup, no stress.

🧩 Why This Design Rocks

✅ Serverless all the way — Lambda + Fargate + S3
✅ Auto-scaling compute — no EC2 to babysit
✅ Long-running safe zone — runs beyond Lambda’s 15-min cap
✅ Reusable — we can reprocess any backlog anytime
✅ Cost-efficient — pay only for what’s used

🪄 Bonus Tip

You can even schedule a “missed transcript” job to run daily or weekly,
checking for any files without transcripts and triggering a Batch job automatically.

🧩 Understanding AWS Batch Scaling

In AWS Batch, the number of tasks (containers) that run in parallel depends on three things working together:

Compute Environment capacity
→ e.g., your environment has a maximum of 10 vCPUs.
Job Definition requirements
→ e.g., each job needs 1 vCPU.
How many jobs are in the queue (and their array size, if used).

🔹 Case 1: You Submit Multiple Independent Jobs

If you submit 10 jobs, each with 1 vCPU, and your environment allows 10 vCPUs,
then AWS Batch can run all 10 in parallel (subject to available Fargate capacity).

Example:

# pseudo example
for i in {1..10}; do
  aws batch submit-job \
    --job-name process-audio-$i \
    --job-queue my-queue \
    --job-definition my-job-def
done

Each job = 1 vCPU → up to 10 can run simultaneously.

AWS Batch’s Job Scheduler will automatically pack as many as possible based on available compute.

🔹 Case 2: You Use an Array Job

Instead of manually looping, you can submit an array job.

Example:

aws batch submit-job \
  --job-name process-audios \
  --job-queue my-queue \
  --job-definition my-job-def \
  --array-properties size=10

This creates 10 child jobs under a single parent, each running independently (great for S3 list chunking).

Same result — 10 parallel containers, each with 1 vCPU.

🔹 Case 3: You Submit a Single Job that Needs More vCPUs

If you set in your job definition:

"vcpus": 4

and your environment has 10 total vCPUs →
then Batch will reserve 4 vCPUs for that job, leaving room for other smaller jobs.

So the compute environment doesn’t spawn “10 copies automatically” —
it just enforces a maximum pool of total CPU that concurrent jobs can consume.

⚙️ TL;DR — How to Scale

Goal	What to Do
Run multiple tasks concurrently	Submit multiple jobs or an array job
Each job’s CPU need	Defined in Job Definition (e.g., 1 vCPU)
Max parallel limit	Based on compute environment capacity
Control at runtime	You can pass `--array-properties size=N` dynamically
Scaling behavior	Batch automatically scales Fargate/EC2 capacity up/down

🏁 Closing Thoughts

This experience reminded me —

“When your script starts feeling like a job, give it job-like powers.”

AWS Batch (especially with Fargate) is often underrated,
but it’s a powerful tool when you need on-demand, containerized, long-running compute
without managing any servers.

Reusing HTTP and SDK clients in AWS Lambda to avoid “too many open files” (FD) errors

Jayesh Shinde — Wed, 15 Oct 2025 05:04:25 +0000

TL;DR

We hit sporadic network errors in a high-throughput Lambda that made HTTP calls (Axios) and AWS SDK calls.
Root cause: creating new HTTP clients/agents per invocation ballooned the number of open sockets (file descriptors).
Fix: initialize clients and their https.Agent once at module scope with keep-alive and reuse them across warm invocations. For AWS SDK v2, also set AWS_NODEJS_CONNECTION_REUSE_ENABLED=1.

The scenario

We had a Lambda that was invoked asynchronously to process a large dataset (thousands of events). Inside the handler, we created an Axios client and AWS SDK client(s) for each invocation. Under sustained concurrency, we started seeing intermittent network failures.

Symptoms we saw

These popped up in CloudWatch logs while the Lambda was busy:

“too many open files” errors:
- Error: EMFILE: too many open files, open
- NodeError: getaddrinfo ENFILE
Connection instability:
- AxiosError: socket hang up
- Error: read ECONNRESET
- Error: connect ECONNRESET
Occasional timeouts and throttling-like behavior despite healthy downstream services

These were worse during bursts when many async invocations overlapped.

What’s really happening (FDs and sockets in Lambda)

Every TCP connection (HTTP/HTTPS) consumes a file descriptor (FD).
Lambda execution environments have a relatively low per-process FD limit (commonly around 1024).
If you create a new HTTP client (and thus a new https.Agent) per invocation, each agent can open many sockets. Under high concurrency, you exhaust FDs, leading to the errors above.
Lambda reuses the same execution environment for multiple “warm” invocations. Objects created at module scope are kept alive and reused, which is exactly what we want for clients and connection pools.

Why Node’s `https.Agent` matters

The agent controls connection pooling and keep-alive.
Creating a new agent per invocation increases the number of socket pools and the total sockets in use.
Reusing a single agent keeps the number of open sockets bounded and allows connection reuse across requests, reducing FD pressure and latency.

The anti-pattern (what we had)

Creating new clients and agents inside the handler:

import axios from 'axios';
import https from 'https';

// Anti-pattern: runs on every invocation
export const handler = async () => {
  const ax = axios.create({
    httpsAgent: new https.Agent(), // new agent each time
  });

  const resp = await ax.get('https://api.example.com/data');
  return resp.data;
};

Same issue with AWS SDK if you new a client per invocation, especially if you also create its own agent.

The fix (module-level reuse with keep-alive)

Move client and agent creation to module scope so they’re created once per warm environment and then reused.

Axios

import axios from 'axios';
import https from 'https';

const httpsAgent = new https.Agent({
  keepAlive: true,
  maxSockets: 60,         // tune based on expected concurrency per environment
  maxFreeSockets: 10,
  timeout: 30_000,        // socket idle timeout
  freeSocketTimeout: 30_000,
});

const ax = axios.create({
  headers: { 'Content-Type': 'application/json' },
  httpsAgent,
});

export const handler = async () => {
  const resp = await ax.get('https://api.example.com/data');
  return resp.data;
};

AWS SDK v3

import https from 'https';
import { NodeHttpHandler } from '@aws-sdk/node-http-handler';
import { S3Client } from '@aws-sdk/client-s3';

const httpsAgent = new https.Agent({
  keepAlive: true,
  maxSockets: 60,
  maxFreeSockets: 10,
  timeout: 30_000,
  freeSocketTimeout: 30_000,
});

const s3 = new S3Client({
  region: process.env.AWS_REGION,
  requestHandler: new NodeHttpHandler({
    httpsAgent,
    connectionTimeout: 3_000,
    socketTimeout: 30_000,
  }),
});

export const handler = async () => {
  const out = await s3.listBuckets({});
  return out;
};

AWS SDK v2

Reuse clients, and enable connection reuse via env var.

import https from 'https';
import AWS from 'aws-sdk';

// Also set in Lambda env: AWS_NODEJS_CONNECTION_REUSE_ENABLED=1
AWS.config.update({
  region: process.env.AWS_REGION,
  httpOptions: { agent: new https.Agent({ keepAlive: true, maxSockets: 60 }) },
});

const s3 = new AWS.S3();

export const handler = async () => {
  const out = await s3.listBuckets().promise();
  return out;
};

Results after the change

FD-related errors (EMFILE, ENFILE, socket hang ups) disappeared under the same workload.
Lower p95 latency due to connection reuse.
Fewer outbound connection spikes visible on NAT Gateway/ENI metrics (for VPC Lambdas).
More predictable behavior during bursts.

Bonus mitigations

Concurrency control: use SQS with a sane maxConcurrency/batchSize, reserved concurrency, or step-wise throttling to prevent bursts from scaling FD usage across many environments at once.
Timeouts and retries: set realistic timeouts; add backoff with jitter to avoid synchronized retries.
context.callbackWaitsForEmptyEventLoop = false: can help the handler return even if the agent keeps idle sockets open (don’t overuse).
Consider undici for HTTP in Node 18+; it provides efficient HTTP/1.1 keep-alive by default.

Quick checklist

Initialize HTTP clients and SDK clients at module scope.
Use a shared https.Agent with keepAlive: true; set maxSockets, maxFreeSockets, and timeouts.
For AWS SDK v2, set AWS_NODEJS_CONNECTION_REUSE_ENABLED=1.
Avoid creating clients/agents inside loops or inside the handler.
Monitor and tune under realistic concurrency.

Closing thoughts

FD exhaustion is easy to miss until traffic scales. In serverless, the simplest lever is to reuse resources across warm invocations. One shared agent + one shared client per execution environment eliminates a whole class of flaky, intermittent network issues.

🧠 How We Upgraded Our WordPress Search with OpenSearch Neural + Cohere for Multilingual Semantic Search

Jayesh Shinde — Sat, 11 Oct 2025 04:32:51 +0000

At our company, we use WordPress as a knowledge base (KB) for internal articles.

We indexed those articles in OpenSearch, but the default keyword search felt… old-school.

Searching “Netflix subscription” missed “How to manage ネットフリックス plans”.
Searching “AWS cost optimization” returned random hits because keywords didn’t align.

So we upgraded our search to semantic search using the OpenSearch ML plugin + Cohere embeddings served via Amazon Bedrock — a great combination for multilingual understanding and secure enterprise integration.

⚙️ The Problem

Our initial setup used simple keyword mapping:

{
  "properties": {
    "title": { "type": "text" },
    "content": { "type": "text" }
  }
}

Even after tweaking analyzers, users searching in Japanese (katakana) or English weren’t getting expected matches.

For example, “netflix” and “ネットフリックス” should be the same — but OpenSearch treated them as completely different tokens.

🚀 The Plan

We wanted to add semantic search on top of our existing index.
That means converting both documents and queries into vectors using the same embedding model, and comparing them using cosine similarity.

Our plan:

Create an ML connector for the Cohere embedding API
Create a model for document embeddings (input_type = search_document)
Create another model for query embeddings (input_type = search_query)
Update our ingestion pipeline to generate vectors during indexing
Use script_score (or kNN query) to retrieve the best semantic matches

🔌 Step 1: Create a Connector (to Cohere)

First, we create an ML connector in OpenSearch to call Cohere’s API.

POST _plugins/_ml/connectors/_create
{
  "name": "bedrock-cohere-doc-connector",
  "description": "Amazon Bedrock connector for Cohere embeddings (document)",
  "protocol": "aws_sigv4",
  "parameters": {
    "region": "us-east-1",
    "service_name": "bedrock",
    "model_id": "cohere.embed-multilingual-v3",
    "input_type": "search_document"
  }
}

Here:

protocol: aws_sigv4 makes OpenSearch sign Bedrock API calls with IAM credentials.

model_id refers to Cohere’s multilingual embedding model hosted on Bedrock.

We use input_type=search_document for document-level embeddings.

🧩 Step 2: Register the Model

Now, we create a model using the connector:

POST _plugins/_ml/models/_register
{
  "name": "bedrock-cohere-doc-embed",
  "function_name": "embedding",
  "connector_id": "bedrock-cohere-doc-connector"
}

POST _plugins/_ml/models/bedrock-cohere-doc-embed/_deploy

This model will be used in our ingestion pipeline.

Then deploy it:

POST _plugins/_ml/models/cohere-doc-embed-model/_deploy

OpenSearch now knows how to call Cohere to embed documents.

💾 Step 3: Create the Index with a Vector Field

Next, we create our knowledge base index that includes a vector field for embeddings.

PUT kb-articles
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { "type": "text" },
      "embedding": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "name": "hnsw",
          "space_type": "cosinesimil",
          "engine": "nmslib",
          "parameters": {
            "ef_construction": 512,
            "m": 16
          }
        }
      }
    }
  }
}

The vector field embedding will hold our document-level embeddings from Cohere (1024 dimensions).

🧠 Step 4: Add an Ingest Pipeline

We’ll generate document embeddings automatically during ingestion.

PUT _ingest/pipeline/kb-embed-pipeline
{
  "processors": [
    {
      "ml_inference": {
        "model_id": "bedrock-cohere-doc-embed",
        "input_map": { "title": "text" },
        "output_map": { "embedding": "embedding" }
      }
    }
  ]
}

Now, when we index a document:

POST kb-articles/_doc?pipeline=kb-embed-pipeline
{
  "title": "Netflix subscription help",
  "content": "Steps to manage your Netflix account and billing."
}

OpenSearch automatically calls Amazon Bedrock, retrieves Cohere’s embedding, and stores it in the embedding vector field.

💬 Step 5: Handle User Queries with Another Connector

Initially, I used the same connector (with input_type=search_document) for both documents and queries.
That caused a mismatch — “ネットフリックス” (Katakana) and “Netflix” were still not matching.

The fix was to create another connector and model specifically for query embeddings.

POST _plugins/_ml/connectors/_create
{
  "name": "bedrock-cohere-query-connector",
  "description": "Amazon Bedrock connector for Cohere embeddings (query)",
  "protocol": "aws_sigv4",
  "parameters": {
    "region": "us-east-1",
    "service_name": "bedrock",
    "model_id": "cohere.embed-multilingual-v3",
    "input_type": "search_query"
  }
}

and then register and deploy another model:

POST _plugins/_ml/models/_register
{
  "name": "bedrock-cohere-query-embed",
  "function_name": "embedding",
  "connector_id": "bedrock-cohere-query-connector"
}

POST _plugins/_ml/models/bedrock-cohere-query-embed/_deploy

This ensures both documents and queries are embedded in compatible vector spaces.

Now, when we run a search, we first generate a query embedding via the ML model’s /predict endpoint:

POST _plugins/_ml/models/bedrock-cohere-query-embed/_predict
{
  "text": "ネットフリックス subscription"
}

🔍 Step 6: Semantic Search Query

Finally, we plug the embedding into a script_score query to rank results by cosine similarity:

POST kb-articles/_search
{
  "size": 5,
  "query": {
    "script_score": {
      "query": { "match_all": {} },
      "script": {
        "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
        "params": {
          "query_vector": [/* embedding array from _predict */]
        }
      }
    }
  },
  "sort": [
    { "_score": { "order": "desc" } }
  ]
}

Now “Netflix” and “ネットフリックス” both match beautifully.
🎯 Cohere’s multilingual embeddings + OpenSearch vector search did the trick.

🧩 Why We Chose Cohere

Strong multilingual understanding — perfect for English + Japanese content
Easy integration — just plug in an API key via OpenSearch ML connector
Consistent embedding dimensions — works well with knn_vector
Fast inference — good for production-scale pipelines

🪛 Troubleshooting Tips

Problem	Cause	Fix
Katakana & English not matching	Used `search_document` for query embeddings	Create a separate connector with `input_type=search_query`
“Dimension mismatch” errors	Wrong embedding model or field dimension	Make sure both model & field use same `dimension`
Inference timeout	Cohere API rate limits	Batch or cache embeddings during ingestion
Weird scores	Missing `space_type: cosinesimil`	Use cosine similarity in mapping

✅ Summary

We started with basic keyword search and ended up with multilingual semantic search using:

OpenSearch ML plugin
Cohere embedding models
Cosine similarity on knn_vector
Separate connectors for documents vs. queries

✨ Outcome

After integrating Cohere via Bedrock and separating the input types:

✅ We can now search using phrases, not just keywords

🌏 Cross-lingual search works — “Netflix” ≈ “ネットフリックス”

💬 Semantic matching improved drastically (e.g., “streaming issue” finds “再生エラー”)

📈 Search relevance and recall are noticeably better, even for non-English content

The combination of OpenSearch ML plugin + Cohere embeddings via Bedrock turned our keyword search into a truly semantic multilingual search engine — all running within the AWS ecosystem.

💡 If you’re building multilingual or brand-sensitive search, don’t skip the input_type difference — it can make or break your semantic matching.

🛠️ Fixing Lost SecurityContext and Correlation IDs in Async Calls with Spring Boot

Jayesh Shinde — Sun, 05 Oct 2025 11:57:58 +0000

When we started parallelizing API calls in our Spring Boot service using CompletableFuture and a custom ExecutorService, everything looked great… until we checked the logs.

Our JWT SecurityContext wasn’t available in the async threads.
Our MDC correlation IDs (used for distributed tracing/log correlation) were missing too.

That meant downstream services didn’t know who was calling, and our logs lost the ability to tie requests together. Not good.

🚨 The Problem

Spring Security stores authentication in a ThreadLocal (SecurityContextHolder).

SLF4J’s MDC (Mapped Diagnostic Context) also uses ThreadLocal to store correlation IDs.

When you hop threads (e.g., via CompletableFuture.supplyAsync), those ThreadLocal values don’t magically follow along. So in worker threads:

SecurityContextHolder.getContext() → empty
MDC.get("correlationId") → null

✅ The Solution: Wrap the Executor

We solved this by wrapping our ExecutorService in a lightweight decorator that captures the MDC + SecurityContext from the submitting thread and restores them inside the worker thread.

Here’s the implementation:

public class ContextPropagatingExecutorService extends AbstractExecutorService {

    private final ExecutorService delegate;

    public ContextPropagatingExecutorService(ExecutorService delegate) {
        this.delegate = delegate;
    }

    private Runnable wrap(Runnable task) {
        final Map<String, String> mdc = MDC.getCopyOfContextMap();
        final SecurityContext securityContext = SecurityContextHolder.getContext();

        return () -> {
            if (mdc != null) MDC.setContextMap(mdc);
            if (securityContext != null) SecurityContextHolder.setContext(securityContext);
            try {
                task.run();
            } finally {
                MDC.clear();
                SecurityContextHolder.clearContext();
            }
        };
    }

    @Override
    public void execute(Runnable command) {
        delegate.execute(wrap(command));
    }

    // delegate lifecycle methods...
}

⚙️ Wiring It Up

In our Spring Boot config:

@Configuration
public class AsyncConfig {

    @Bean(name = "executorService")
    public ExecutorService executorService() {
        ExecutorService base = Executors.newCachedThreadPool();
        return new ContextPropagatingExecutorService(base);
    }
}

Now, whenever we do:

CompletableFuture<AccountDTO> fromAccount =
    CompletableFuture.supplyAsync(() -> accountClient.getAccountsById(id), executorService);

…the async thread has the same SecurityContext and MDC as the request thread.

📊 Before vs After

Aspect	Before	After
`SecurityContextHolder.getContext()`	Empty in async thread	Correctly populated
`MDC.get("correlationId")`	`null`	Same correlation ID as request thread
Logs	Missing trace IDs	Full traceability across async calls
Downstream services	No JWT propagated	JWT available for Feign/RestTemplate

🔑 Takeaways

ThreadLocals don’t cross thread boundaries — you need to propagate them manually.
Wrapping your ExecutorService is a clean, reusable fix.
This pattern works not just for MDC + SecurityContext, but for any contextual data you need across async boundaries.

🚀 Closing Thoughts

If you’re building microservices with Spring Boot and using async execution (CompletableFuture, @Async, Kafka listeners, etc.), don’t forget about context propagation. Without it, your logs and security checks will silently break.

Wrapping your executor is a small change that pays off big in observability and security consistency.

Why My CDK Deploys Started Failing After Org Added Strict SCP Rules

Jayesh Shinde — Sat, 27 Sep 2025 07:56:29 +0000

Recently, I ran into a head‑scratcher while deploying a CDK stack. Everything used to work fine, but once my organization introduced strict SCP rules based on tags, my cdk deploy started failing with:

AccessDenied: action cloudformation:CreateChangeSet is not authorized

At first glance, it didn’t make sense. I was already tagging everything in my CDK code. I even had a loop that pulled tags from props.Tags and attached them with cdk.Tags.of(this).add(...). These tags used to flow down nicely to all resources — including the CloudFormation stack itself.

So why did it suddenly stop working? 🤔

What Changed? SCPs and Request‑Time Enforcement

The key is where SCP rules get evaluated.

An SCP can restrict not only what resources exist, but also what API calls are allowed. In my case, the org had a policy like this:(dummy example)

{
  "Effect": "Deny",
  "Action": "cloudformation:CreateChangeSet",
  "Resource": "*",
  "Condition": {
    "StringNotEquals": {
      "aws:RequestTag/Org": "ABC"
    }
  }
}

This means: if the CreateChangeSet request doesn’t already include the required tag, the call is blocked right away.

CDK Tagging: Two Different Worlds

This is where CDK behavior matters:

StackProps.tags
When you pass tags into the super(scope, id, props) constructor, CDK includes those tags in the CreateChangeSet API call. These show up as RequestTags. That’s what SCPs check.
cdk.Tags.of(resource).add(...)
This method attaches tags to resources inside the CloudFormation template. They are applied after the stack is already created.

So my old approach of looping through props.Tags and calling cdk.Tags.of(this).add(...) worked fine in the past, but now fails because the SCP never lets the stack get created in the first place. The required tags simply aren’t present yet at request time.

Fix: Pass Tags via StackProps

The solution was simple once I understood the difference. Previously my props had Tags property which I replaced with tags which props:cdk.StackProps expects and uses to initialize CDK’s in-memory construct tree (the Stack object inside your app).
When you run cdk deploy, the CLI uses the CloudFormation SDK to call:

CreateChangeSet (or UpdateChangeSet) → this is the first API call to AWS.

This is where stack-level tags (props.tags) are injected into the request payload

export class MyStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

      for (const [k, v] of Object.entries(props.tags)) {
        cdk.Tags.of(this).add(k, v);
      }

  }
}

new MyStack(app, "MyTaggedStack", {
  tags: {
    Org: "ABC",
    Owner: "TeamA"
  }
});

Now:

StackProps.tags → go straight into the CreateChangeSet request (SCP passes ✅).
cdk.Tags.of(this).add(...) → still ensures all resources get the same tags after creation.

Takeaway

If your organization enforces strict SCP rules on cloudformation:CreateChangeSet, you can’t rely on cdk.Tags.of(...) alone. Those tags arrive too late. You need to use StackProps.tags so the tags are present in the request itself.

It’s a subtle but important difference — and once I understood it, the “AccessDenied” error finally made sense.

DEV Community: Jayesh Shinde

Stop Using IAM Access Keys: Secure Cross-Cloud Workloads with OIDC Federation

Why NOT IAM Users?

The Core Concept: Workload Identity Federation

Scenario 1: Azure D365 / Entra ID ➔ AWS

1. Register Microsoft as an Identity Provider in AWS IAM

2. Configure the IAM Role & Trust Policy

Scenario 2: GitHub Actions ➔ AWS

1. Register GitHub as an Identity Provider in AWS IAM

2. Configure the GitHub IAM Trust Policy

3. Use it in your GitHub Actions Workflow

Pro-Tips for Production Implementations 💡

Explaining the aud Difference Between Scenario 1 and Scenario 2

Scenario 1: Azure D365 / Entra ID → AWS

Scenario 2: GitHub Actions → AWS

Side-by-Side Summary

The Core Reason

Summary: Choose Your Strategy

The Mystery of the 37-Second Lambda Delay (And How AWS EventBridge Fooled Us)

The Mystery of the 37-Second Lambda Delay (And How AWS EventBridge Fooled Us)

The Problem: The 37-Second Wall

How We Debugged It

The Realization: EventBridge Jitter & The Redo

The Final Takeaway

How we tracked down a mysterious latency issue in our AWS Lambda + RDS Proxy stack, and discovered Prisma was the culprit all along.

The Ghost in the Logs

Following the Breadcrumbs to RDS Proxy

The RDS Proxy Logs Tell All

Down the MySQL Protocol Rabbit Hole

First Attempt: statement_cache_size=0

Second Attempt: Reducing Pool Size

The Actual Fix: Prisma 7.8.0 and Driver Adapters

The Migration Wasn't Trivial, But It Was Focused

The Results

Bonus: ORM Comparison

What We Learned

Still Some Loose Ends

TL;DR

How We Fixed Intermittent ECS Image-Not-Found Errors in AWS CDK

The Root Cause

The Strategy We Took

Sample: Dedicated ECR Stack

Sample: Point CDK to the Dedicated Repo

One Important Gotcha

Another Important Gotcha: IAM Permissions

Why This Worked Well for Us

When to Use This Pattern

Final Thought

The Silent Connection Killer: MySQL2 and AWS Lambda's Freeze/Thaw Problem

The Mystery Error

Understanding the Real Problem : How Lambda Actually Works

What Happens to Your Database Connection

Why idleTimeout Doesn't Help

Step 1: Enable TCP Keep-Alive

Step 2: Implement Retry Logic

Key Points:

What About Pool Settings?

Using RDS Proxy?

The Final Architecture

Summary

BUT....

The Problem With Just Retrying Once

How mysql2's Pool Actually Works Internally

The Fix: Drain All Free Connections on a Stale Error

Why pool._freeConnections.length = 0 instead of a while loop with shift()?

What conn.destroy() does under the hood

Comparison: Single Retry vs. Drain All

A Note on _freeConnections Being a Private API

Why Prisma Doesn't Have This Problem

How it stacks up

Even Prisma Isn't Bulletproof

Summary

Building a Clean Event Pipeline in Spring: From Simple Events to Async Listeners to the Outbox Pattern

1. The initial requirement: emit an event inside the service

2. Handling the event with @EventListener

3. The problem: enrichment logic slows down the request

4. Making listeners async with @Async and @EnableAsync

5. The default executor is not production‑grade

6. Adding a custom executor

7. The gold standard: the Outbox Pattern

Explaining the `aud` Difference Between Scenario 1 and Scenario 2

First Attempt: `statement_cache_size=0`

Why `pool._freeConnections.length = 0` instead of a `while` loop with `shift()`?

What `conn.destroy()` does under the hood

A Note on `_freeConnections` Being a Private API

2. Handling the event with `@EventListener`

4. Making listeners async with `@Async` and `@EnableAsync`

Why Node’s `https.Agent` matters