DEV Community: AWS Community Builders

I Built a Bot That Updates My EKS Nodes While I Sleep — Here's How

SURYANSH GUPTA — Sat, 30 May 2026 10:03:45 +0000

TL;DR: Manual EKS AMI updates are slow, risky, and easy to forget. I wired together EventBridge, Lambda, Amazon Bedrock (Claude 3.5 Haiku), GitHub PRs, ArgoCD, and Karpenter into a pipeline that detects new AMIs, runs AI risk analysis, opens a PR for human review, and rolls out nodes automatically — zero downtime, full audit trail.

The problem every EKS team hits eventually

You're running production Kubernetes on AWS. You know you're supposed to keep worker nodes patched. But between sprints, incidents, and everything else — checking for new EKS-optimized AMIs falls through the cracks.

When you finally do an update, there's a whole ritual: find the new AMI ID, read through the release notes, assess any CVEs, draft a PR, wait for approvals, then carefully roll out nodes without taking down your workloads.

It's not rocket science — it's just slow, manual, and one of those tasks that always feels lower priority than the thing currently on fire.

What if the whole thing ran itself?

The solution in one sentence

Twice a day, a Lambda checks for new EKS AMIs. If one exists, Bedrock analyzes the risk and opens a GitHub PR. A human reviews it. Merging the PR triggers ArgoCD + Karpenter to roll out the new nodes with zero downtime.

The magic is that the only thing a human needs to do is read the AI's analysis and merge (or close) the PR. Everything else — detection, analysis, branch creation, notification, node rollout — is automated.

Architecture: Three clean phases

Phase 1 — Detection

An EventBridge scheduled rule fires at 9 AM and 9 PM UTC every day. It triggers a Lambda that:

Queries AWS SSM Parameter Store for the latest EKS-optimized AMI ID (/aws/service/eks/optimized-ami/1.34/amazon-linux-2023/recommended/image_id)
Compares it against what's currently committed in your GitHub repository (your source of truth)
If they differ — new AMI exists → triggers the Step Functions workflow

No new AMI? The Lambda exits quietly. Nothing else happens.

Phase 2 — AI Analysis + Pull Request

This is where it gets interesting. AWS Step Functions orchestrates three Lambda functions in sequence:

Lambda 1 — bedrock-analyzer

Fetches the real AMI release notes from GitHub (awslabs/amazon-eks-ami) and sends them to Amazon Bedrock running Claude 3.5 Haiku with this prompt:

Analyze this Amazon EKS AMI update using the actual release notes.
New AMI ID: {ami_id}
Previous AMI ID: {previous_ami}

ACTUAL EKS AMI RELEASE NOTES:
{release_notes}

Respond in JSON with:
- risk_score: 1–10
- recommendation: APPROVE or REJECT
- summary: one-line summary of actual changes
- pr_description: full markdown PR body with CVEs, package versions,
  risk assessment, and review guidance

The output is a structured JSON object with a risk score and a ready-to-paste PR description.

Lambda 2 — gitops-updater

Uses GitHub App credentials (stored in AWS Secrets Manager) to:

Create a new branch
Update the Karpenter EC2NodeClass YAML with the new AMI ID
Open a Pull Request with the full Bedrock analysis embedded in the description

Lambda 3 — send-notification

Fires an SNS email to the team: "New AMI detected, PR #N is open for your review." Includes the PR link and the one-line AI summary.

The human's job: Read the AI analysis. Check the YAML diff (it's literally one line — the AMI ID). Merge to approve, close to reject.

Phase 3 — GitOps Deployment

After the PR is merged:

ArgoCD detects the commit on main, auto-syncs the updated EC2NodeClass manifest to the EKS cluster
Karpenter sees the new AMI ID in the EC2NodeClass, provisions new EC2 nodes with the updated AMI, then gracefully drains the old nodes
Workloads migrate to new nodes. Zero downtime.

The whole rollout happens without anyone touching kubectl.

What the PR actually looks like

This is what your team sees in GitHub:

## EKS AMI Update — ami-04b406d4e6eaca578

**AI Risk Score: 2/10 — APPROVE**

### What changed
- Go updated to 1.25.9
- Kernel updated to 6.12.79-101.147.amzn2023
- No new CVEs introduced

### CVE Assessment
No critical or high-severity CVEs in this update. Two previously
known CVEs (CVE-2024-XXXX, CVE-2024-YYYY) are patched.

### Review guidance
This is a routine kernel + runtime update. Low risk. Recommend
merging during business hours with normal monitoring in place.

---
*Merge this PR to trigger ArgoCD + Karpenter rollout.*
*Close this PR to skip this AMI version.*

Your reviewer doesn't need to dig through release notes. The AI already did it.

CloudFormation: everything in one stack

The whole solution deploys from a single CloudFormation template. Here's what it provisions:

Resource	Purpose
AWS Secrets Manager	GitHub App credentials
Amazon SNS + subscription	Email alerts
5 IAM roles	Per-function least-privilege
4 Lambda functions	Detector, analyzer, PR creator, notifier
Amazon Bedrock Guardrail	Content filtering on AI output
Step Functions state machine	Orchestrates analyze → PR → notify
EventBridge rule	Twice-daily schedule

Deploy it:

aws cloudformation create-stack \
  --stack-name eks-ami-update \
  --template-body file://cloudformation-template.yaml \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameters \
    ParameterKey=NotificationEmail,ParameterValue=your@email.com \
    ParameterKey=GitHubAppId,ParameterValue=<app-id> \
    ParameterKey=GitHubAppInstallationId,ParameterValue=<install-id> \
    ParameterKey=GitHubAppPrivateKey,ParameterValue=$(base64 -i app.pem | tr -d '\n') \
    ParameterKey=GitHubRepoOwner,ParameterValue=<your-org> \
    ParameterKey=GitHubRepoName,ParameterValue=<your-repo> \
    ParameterKey=GitHubFilePath,ParameterValue=karpenter-configs/clusters/your-cluster/nodeclass.yaml \
    ParameterKey=GitHubBranch,ParameterValue=main \
    ParameterKey=EKSVersion,ParameterValue=1.34

Takes about 2–3 minutes. Confirm the SNS subscription email when it arrives.

Prerequisites checklist

Before deploying, you need:

[ ] An existing EKS cluster (v1.34+)
[ ] Karpenter installed and configured
[ ] ArgoCD installed with auto-sync enabled
[ ] A GitHub repository for Karpenter configs
[ ] A GitHub App installed on that repo (you need App ID, Installation ID, and Private Key)
[ ] Amazon Bedrock enabled in your region (enable Claude 3.5 Haiku access in the Bedrock console)
[ ] AWS CLI + kubectl configured

Important: Fork the aws-samples repository to your own account — you need write access to configure the GitHub App. Deploy your EC2NodeClass config to the repo before running the stack.

Testing it without waiting for an AMI release

Don't want to wait up to 12 hours for the schedule to fire? Trigger it manually:

aws lambda invoke \
  --function-name eks-ami-detector \
  --payload '{}' \
  --cli-binary-format raw-in-base64-out \
  /tmp/response.json && cat /tmp/response.json

Check your inbox. You should get an SNS email with the risk analysis and PR link within a couple of minutes.

After merging, verify the ArgoCD sync:

# Update your kubeconfig
aws eks update-kubeconfig --region <region> --name <cluster-name>

# Check ArgoCD sync policy
kubectl get application karpenter-nodeclass -n argocd \
  -o jsonpath='{.spec.syncPolicy}'

# Verify the AMI ID was applied
kubectl get ec2nodeclass default -o yaml | grep ami-

Common issues and how to fix them

SNS subscription not confirmed — Check your spam folder. The confirmation email comes from AWS and sometimes gets filtered.

GitHub App auth failure — Double-check the App is installed on the correct repository with read/write permissions. Regenerate the private key in GitHub if needed and re-run the CloudFormation update.

Bedrock access denied — Go to the Amazon Bedrock console → Model access → enable Claude 3.5 Haiku in your region. This is a manual step that's easy to miss.

ArgoCD not syncing — Verify the Application resource has spec.syncPolicy.automated set. Check that the repo URL and path match exactly.

Step Functions failures — Check CloudWatch Logs for the failing Lambda. 99% of the time it's an IAM permission issue or a missing secret.

Why this architecture is worth copying

A few design decisions I want to highlight:

GitHub PRs as the approval interface — Engineers already live in GitHub. Using a PR as the human gate means no new tool to learn, built-in commenting, and a permanent audit trail in Git history. The PR description IS the change record.

AI analysis on real release notes — The Bedrock prompt fetches actual release notes from the awslabs/amazon-eks-ami repo. It's not making things up — it's summarizing real content. The risk score is grounded in actual CVE and package data.

Karpenter over managed node groups — Karpenter watches the EC2NodeClass for changes and handles the node lifecycle automatically. You don't need to write any drain/cordon scripts.

Least-privilege IAM — Each Lambda has its own role with only the permissions it needs. The CF template provisions five separate roles. This matters in production.

Guardrails on Bedrock — The solution includes a Bedrock Guardrail for content filtering on the AI output. Belt and suspenders.

Cleaning up

aws cloudformation delete-stack --stack-name eks-ami-update

What I'd add next

A few things that would make this even better:

Slack notification instead of (or in addition to) SNS email — PR link directly in your #platform channel
Dry-run mode — run the full pipeline but don't actually open a PR, just log the analysis
Multi-cluster support — one stack managing AMI updates across dev/staging/prod with different approval thresholds per environment
Custom risk criteria — tune the Bedrock prompt to your org's specific compliance requirements (PCI-DSS, SOC 2, etc.)
Automatic REJECT on critical CVEs — skip the PR entirely and alert the team if the risk score is 8+

Get the code

Fork the repo, follow the README, and deploy:

👉 GitHub: suryansh639/sample-eks-ami-gitops-pipeline

The CloudFormation template, Lambda code, and example Karpenter configs are all there.

Wrapping up

The goal wasn't to remove humans from the loop — it was to remove the boring part of the loop. The AI reads the release notes. The AI writes the PR description. The human decides. The automation executes.

That's the right split. And it means your nodes actually get updated on time, every time, with a full audit trail and no 2 AM surprises.

If you try this out, drop a comment — I'd love to hear what customizations you make.

What If You Need Two ArgoCD Instances in One EKS Cluster?

POTHURAJU JAYAKRISHNA YADAV — Sat, 30 May 2026 08:26:58 +0000

Most ArgoCD tutorials start the same way:

Deploy ArgoCD.

Connect a Git repository.

Create an Application.

Done.

But what happens when multiple teams start sharing the same Kubernetes cluster?

Recently I was exploring a scenario where a single Amazon EKS cluster needed to support two different ArgoCD environments:

A Custom ArgoCD instance for the Platform Team
A Managed ArgoCD instance for Application Teams

The Custom ArgoCD would manage infrastructure components such as:

cert-manager
external-dns
monitoring
ingress controllers

While the Managed ArgoCD would be used by application teams to deploy:

APIs
frontends
microservices
business applications

The challenge was figuring out how to keep both environments isolated while still sharing the same EKS cluster, Cognito User Pool, and AWS Application Load Balancer.

At first, I thought:

Why not just use a single ArgoCD instance?

The more I thought about it, the more questions came up.

How do we separate responsibilities?
How do we avoid accidental changes?
How do we provide SSO?
How do we keep costs under control?

That led me down an interesting path:

Can we run two ArgoCD instances inside the same EKS cluster, authenticate both using Cognito SSO, and still share the same ALB?

Turns out we can.

This article walks through the architecture, challenges, lessons learned, and a few surprises along the way.

The Scenario

Imagine you have a Kubernetes cluster used by multiple teams.

The Platform Team owns:

cert-manager
external-dns
monitoring
ingress controllers
cluster add-ons

The Application Teams own:

APIs
frontends
microservices
business applications

A single ArgoCD instance technically works.

But should both groups operate from the same GitOps control plane?

For smaller environments, probably yes.

For larger environments, things start getting messy.

Different permissions.

Different ownership boundaries.

Different operational responsibilities.

That was the motivation behind splitting ArgoCD into two instances.

The Goal

The architecture I wanted looked like this:

EKS Cluster

├── ArgoCD (Platform Team) - Custom argocd
│     ├── cert-manager
│     ├── external-dns
│     └── monitoring
│
└── ArgoCD (Application Teams) - Managed argocd
      ├── frontend
      ├── backend
      └── business applications

And on top of that:

Single Sign-On
Fully automated provisioning
Infrastructure as Code
Minimal AWS cost

The Final Architecture

The final design consisted of:

Developer
    |
    | HTTPS
    v

Shared AWS ALB
(group.name = argocd)

    |
    +---------------------+
    |                     |
    v                     v

argocd              argocd2
Namespace:          Namespace:
argocd              argocd-managed

Platform Team       Application Teams

    |                     |
    +----------+----------+
               |
               v

AWS Cognito User Pool

Both ArgoCD instances run inside the same EKS cluster.

The difference is that each lives in its own namespace.

argocd
argocd-managed

This gives logical separation without requiring another cluster.

Why Cognito?

Before choosing Cognito, I looked at AWS IAM Identity Center.

Identity Center is generally the preferred enterprise solution.

However, for this learning exercise I wanted everything automated using Terraform.

My ideal deployment process was:

terraform apply

and automatically create:

User Pool
App Clients
Groups
Users
Passwords

without opening the AWS Console.

Cognito made that surprisingly easy.

Building the Authentication Layer

The first step was creating a Cognito User Pool.

Each ArgoCD instance gets its own App Client.

Visually, it looked like this:

Cognito User Pool

├── App Client
│     └── ArgoCD Platform
│
└── App Client
      └── ArgoCD Managed

Here's what it looked like in AWS.

One thing I particularly liked was eliminating manual user creation.

Terraform creates:

User Pool
Domain
App Clients
Groups
Admin User

Everything becomes Infrastructure as Code.

No temporary passwords.

No manual console work.

No forgotten setup steps.

Connecting ArgoCD to Cognito

ArgoCD supports OIDC natively.

That means Cognito can act as the identity provider.

The authentication flow becomes:

User
  |
  v
ArgoCD
  |
  v
Cognito Hosted UI
  |
  v
Authentication
  |
  v
OIDC Callback
  |
  v
ArgoCD Dashboard

Once logged in, ArgoCD reads the Cognito groups from the token and maps them to ArgoCD roles.

That means administrators are controlled centrally through Cognito instead of managing users inside ArgoCD itself.

The Shared ALB Experiment

This was probably the most interesting part.

Initially I considered creating two separate Application Load Balancers.

One for each ArgoCD instance.

That would work.

But it also means:

More AWS resources
More management
More cost

Then I remembered the AWS Load Balancer Controller supports Ingress Groups.

That changed everything.

Instead of:

ALB #1 → ArgoCD
ALB #2 → ArgoCD Managed

I could use:

One ALB
Two Hostnames
Two ArgoCD Instances

The ALB creates listener rules based on the hostname.

And from Kubernetes:

Both ingress resources point to the same ALB.

The routing happens automatically based on the Host header.

A nice cost optimization with very little additional complexity.

The Problem Nobody Warned Me About

The first ArgoCD deployment worked perfectly.

The second one failed immediately.

The error looked something like this:

customresourcedefinitions.apiextensions.k8s.io already exists

At first it wasn't obvious.

Then it clicked.

ArgoCD installs CRDs.

CRDs are cluster-scoped resources.

The first installation already created them.

The second installation attempted to create them again.

Kubernetes refused.

The fix was simple:

crds:
  install: false

Once I disabled CRD installation for the second ArgoCD instance, everything worked.

It was one of those issues that takes a while to understand but only seconds to fix.

The Login Experience

After deployment, the user experience was exactly what I wanted.

First, users see the ArgoCD login screen.

Selecting the Cognito login option redirects users to the Hosted UI.

After successful authentication, users land directly inside ArgoCD.

No local accounts.

No shared admin credentials.

No Kubernetes secret lookups.

Just SSO.

Another Small Gotcha

While testing Applications, I ran into this error:

app path is not a directory

I had accidentally pointed ArgoCD to a YAML file instead of a directory.

ArgoCD expects:

path: manifests/

not:

path: manifests/app.yaml

A small detail, but one that can waste a surprising amount of time when you're learning.

What I Learned

The most valuable lesson wasn't about Cognito.

It wasn't about Terraform.

It wasn't even about ArgoCD.

It was about ownership.

As environments grow, separating responsibilities becomes more important than simply making things work.

Running a Custom ArgoCD for platform operations and a Managed ArgoCD for application teams gave:

Better operational boundaries
Cleaner RBAC
Clear ownership between teams
Reduced risk of accidental changes
Independent GitOps workflows

The platform team can safely manage infrastructure applications while application teams operate in their own ArgoCD environment without interfering with cluster-level services.

And because both instances share a single ALB, the cost impact remained minimal.

Would I Do This Again?

For a small team?

Probably not.

A single ArgoCD instance is easier to manage.

For organizations with dedicated platform teams and application teams?

Absolutely.

The pattern scales nicely, provides cleaner ownership boundaries, and integrates well with SSO.

Most importantly, the entire environment can be created with a single:

terraform apply

which is exactly what I wanted from the beginning.

If you've experimented with multiple ArgoCD instances or different GitOps multi-tenancy approaches, I'd love to hear how you've solved it.

Happy GitOps 🚀

GitHub Repository

https://github.com/jayakrishnayadav24/eks-argocd-cognito-sso

"Reinstalling Won't Fix It": A Cross-App Shared-Auth Deadlock After Switching Phones

Kento IKEDA — Sat, 30 May 2026 07:26:30 +0000

After migrating to a new Android phone, a few specific apps stopped launching. Amazon Shopping and Kindle would freeze on a blank white or black screen for a while, then close on their own. Reinstalling, clearing storage, updating the OS — none of it helped. Going through the usual support steps changed nothing.

What finally fixed it was clearing the storage of every Amazon app at once. Tracing the cause through ADB logs, it turned out that authentication data shared across multiple apps had become inconsistent, and the auth-retrieval step at startup was deadlocking.

The incident itself happened with a specific Pixel-and-Amazon combination, but structurally it's a pattern that can hit any app where "authentication data shared across apps" meets "many subsystems initialized in parallel at startup for speed." It's worth knowing about whether you design SDKs, build apps, or handle ops and support, so I'm leaving it here as a case study.

Note: This happened on my own device, with my own account. I'm reading the diagnostic output that the OS itself wrote out — not decompiling any app.

What happened after switching phones

Right after migrating data to a Pixel 10a, only certain apps refused to launch.

Affected: Amazon Shopping, Kindle
Symptom: open the app, it freezes on a blank white or black screen for about ten seconds, then closes on its own
Everything else: other apps work fine

The officially suggested remedies are generic — "reinstall the app," "clear the cache," "update the OS" — and none of them worked. When the root cause is in the OS or the data migration rather than the app itself, the standard support path has a hard time isolating it.

At first I couldn't even tell whether it was an OS problem, an app problem, or an Amazon account problem.

When nothing works, change the framing

I tried all the standard fixes.

Reinstalling the affected apps
Clearing storage of just the affected apps
Clearing the cache of the affected apps
Updating every app in the Play Store
Updating the OS
Restarting the device

The affected apps still wouldn't launch, no matter what.

The key realization: as long as you think of it as "a problem with one app," you won't fix it. Reinstalling just the affected app, or clearing just that app's storage, changes nothing. If that's the case, the cause probably isn't contained within a single app.

Only once you reframe it as "a problem across a group of apps" does the solution come into view.

Hypothesis: the shared authentication data is corrupted

Let me start with a hypothesis built only from public information. I'll back it up with logs in the second half.

Amazon's Login with Amazon SDK has a mechanism that lets other apps reuse the Amazon Shopping app's logged-in state. This is documented officially.

https://developer.amazon.com/docs/login-with-amazon/customer-experience-android.html

According to the docs, if the user is already signed in to the Amazon Shopping app, an app that integrates Login with Amazon won't ask them to re-enter account details — the SDK recognizes and reuses the auth state of the Amazon Shopping app or the Fire OS device. That's single sign-on (SSO). The SDK's internal package name is com.amazon.identity.auth.map.device, which also appears in Amazon's official migration guide.

https://developer.amazon.com/docs/login-with-amazon/upgrade-android-sdk.html

What we can say from this is that Amazon's authentication layer (referred to internally in the SDK as MAP) is designed so that other apps can reference the Amazon Shopping app's logged-in state. What the docs directly describe is SSO for apps using Login with Amazon, but it's reasonable to think that Amazon's own apps such as Kindle and Prime Video share the same auth layer too.

The hypothesis, then: during data migration, only part of the authentication data was carried over in a corrupted state, and the startup process that tries to fetch that shared data is getting stuck. If that's right, it explains why nothing short of wiping the apps that hold the shared data will fix it.

That said, specifics like "the first-installed app holds the auth data as the representative" or "only one particular app is the source" can't be asserted from public information. I'll check how solid the hypothesis is by reading the ANR trace in the second half.

The fix: clear storage for the whole group of related apps at once

Here's the fix up front. The concrete steps:

Open Settings > Apps > See all XX apps
List every Amazon app installed
For each one, run Storage & cache > Clear storage
Once they're all cleared, restart the device
Then open Amazon Shopping or Kindle — if a login screen appears, you're good

Examples of apps to target:

Amazon Shopping
Kindle
Amazon Prime Video
Amazon Music
Amazon Photos
Amazon Alexa

The important point is that what you need to wipe is not "the app that's failing" but "every app that shares the authentication data." In my case, the one that finally did it was clearing Prime Video's storage. It was an app I barely ever opened, and until I cleared it, clearing the other Amazon apps did nothing. It may well have been the source of the shared data.

Migration tools restore apps automatically from the old device's app list. As a result, you can end up with Amazon apps you haven't used in ages — ones you've forgotten you ever installed. In the user's mind it's an "app I don't use," but in the authentication-sharing network it's a full-fledged node, and the corrupted data sitting there drags down the apps you are launching. The instinct to "only clear the app that's failing" or "only clear the apps I actually use" backfires here.

Verifying the hypothesis with the ANR trace

Now the main part. Let's verify whether this really is a shared-auth deadlock, using the ANR (Application Not Responding) stack trace.

The first thing to establish: what's killing the app is an "ANR," not a "crash." A crash throws an exception and the process drops immediately; an ANR is the system force-closing the app after the main thread has failed to respond for some time (roughly several seconds to ten). Freezing on a blank screen and then closing is the classic ANR symptom — not an exception, but a timeout while waiting for a response.

Since this was happening on my own device with my own account, I connected the Pixel to a Mac over ADB and pulled the diagnostic log (the stack trace) that the OS wrote out when the ANR occurred. Again, I'm not decompiling the app — just reading the diagnostic output the OS left behind.

adb shell dumpsys dropbox --print data_app_anr | \
  grep -A 200 "Process: com.amazon.mShop.android.shopping"

The "DropBox" in dumpsys dropbox refers to DropBoxManager, the Android system-log mechanism that stores diagnostic entries (crashes, ANRs, and so on) over time. It has nothing to do with the cloud storage service of the same name. --print data_app_anr pulls only the entries tagged as app ANRs, filtered here by Amazon Shopping's process name.

The trace recorded several threads running in parallel at startup. The key part: they were waiting on each other's locks. Let's read them in order.

main thread (tid=1): the UI itself, stuck

The main thread was stuck while running a startup task called AndroidComponentDetectTask.

"main" prio=5 tid=1 Blocked
  at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(...)
  - waiting to lock <0x00eb4d79> held by thread 37
  at com.amazon.platform.service.ServiceRegistryImpl.getService(...)
  at com.amazon.mShop.appStart.AndroidComponentDetectTask.apply(...)
  ...
  at android.app.ActivityThread.handleBindApplication(...)

It's trying to acquire a lock, <0x00eb4d79>, and waiting for thread 37 to release it. This lock is one the Service Registry (the common registry where each subsystem registers and retrieves itself) takes internally when fetching or creating a service. On Android the main thread is the UI thread, so when it stops here, nothing gets drawn and the screen stays blank.

thread 36: collateral damage waiting on the same lock

The error-reporting init task (thread 36) was waiting on the exact same lock as main.

"StagedExecutor2-pool-19-thread-1" prio=5 tid=36 Blocked
  at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(...)
  - waiting to lock <0x00eb4d79> held by thread 37
  at com.amazon.platform.service.ServiceRegistryImpl.getService(...)
  at com.amazon.mShop.sam.log.SAMLogManager.initialize(...)
  at com.amazon.mShop.errorReporting.ErrorReporter.startSession(...)

Also waiting on <0x00eb4d79>. This Service Registry lock is a congestion point that multiple threads fight over at startup.

thread 37: the culprit, holding a lock while waiting on auth data

The problem thread is thread 37. It was holding <0x00eb4d79> (the Service Registry lock) while trying to acquire another lock, <0x004a4835>, and getting stuck.

"StagedExecutor3-pool-20-thread-1" prio=5 tid=37 Blocked
  - waiting to lock <0x004a4835> held by thread 62
  at com.amazon.identity.auth.device.api.MAPAccountManager.getAccount(...)
  at com.amazon.mShop.minerva.MinervaWrapperMAPClient.fetchAndSetAccountAttributeForTeen(...)
  at com.amazon.mShop.minerva.MinervaWrapperMAPClient.<init>(...)
  at com.amazon.mShop.minerva.MinervaWrapperServiceImpl.initializeMinervaClientIfNeeded(...)
  at com.amazon.platform.service.ServiceRegistryImpl.instantiateService(...)
  at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(...)
  - locked <0x00eb4d79>

Reading bottom to top, the sequence is:

The Service Registry takes its internal lock <0x00eb4d79> to create a service (at this moment, main and thread 36 are made to wait)
Still holding that lock, it proceeds into initializing the metrics SDK (Minerva) client
Inside that, it calls MAPAccountManager.getAccount to fetch the currently logged-in account
The auth SDK (MAP) tries to take another lock, <0x004a4835>, internally
But that lock is held by thread 62 and never comes back

Thread 37 sits in the decisive position that triggers the deadlock: holding the Service Registry lock while frozen waiting for auth data. Because it won't release the lock it holds, main and thread 36 — which are waiting on it — stall in turn.

thread 27: another one waiting on the auth lock

On top of that, thread 27 (Weblab, fetching A/B-test flags) was also waiting on the same auth lock <0x004a4835> as thread 37.

"StagedExecutor1-pool-15-thread-1" prio=5 tid=27 Blocked
  - waiting to lock <0x004a4835> held by thread 62
  at com.amazon.identity.auth.device.api.MultipleAccountManager.getAccountForMapping(...)
  at com.amazon.mShop.sso.SSOUtil.getCurrentAccountFromDisk(...)
  at com.amazon.mShop.core.features.weblab.WeblabServiceImpl.getTreatmentAndCacheForAppStartWithTrigger(...)

Weblab also needs auth information at startup, and via getAccountForMapping it's waiting on the same auth lock to be released. Note that thread 37 reaches the lock through MAPAccountManager and thread 27 through MultipleAccountManager — two different APIs converging on one internal lock.

The big picture: auth-data retrieval is where every task converges

Laid out, the dependencies look like this:

What stands out is that the tasks meant to run in parallel at startup (metrics, error reporting, A/B testing, component detection) all ultimately converge on a single point: "fetching the MAP auth data." Minerva and Weblab are supposed to be independent features, yet somewhere in initialization each one reaches for the same auth SDK to find out "who is logged in right now."

That auth-data retrieval never returns, because the shared data is corrupted. Every task that needs auth stalls; and because the task holding the Service Registry lock has stalled, even tasks unrelated to auth (main, error reporting) get dragged down. That's the full chain that leaves the screen blank until the ANR fires.

Following thread 62 — the one stuck while holding the auth lock — it was sending a query to another process via a ContentProvider and waiting for the response. A ContentProvider is Android's mechanism for sharing data between apps, and Amazon's apps appear to use it to pass authentication data around. It seems thread 62 was stuck holding the auth lock because one of the sharing sources never returned a response. Which app, and why it didn't respond, can't be pinned down from this trace alone. But the structure — "go fetch the shared auth data from the source, and it never comes back" — is consistent with the fact that wiping every Amazon app's storage fixed it.

Strictly speaking, this isn't a circular wait where two threads grab each other's locks (the textbook deadlock). It's a hang: a thread holding a lock freezes waiting on an external process, and the threads waiting on it stall in a chain. But since the outcome — "stuck holding a lock, with everyone waiting on it blocked forever" — is no different from a deadlock, I'm calling it a deadlock in this article.

The design pitfalls this case reveals

The textbook lessons — "acquire locks in a consistent order," "don't block the main thread" — apply here too, of course. But what the trace really surfaced is the pitfall that emerges when well-intentioned design decisions pile up. Parallelization for speed, auth lookups for functionality, cross-app data sharing for convenience. Each is reasonable on its own, but stacked together they become the following three pitfalls.

Pitfall 1: parallel-init speedups backfire on shared resources

Initializing subsystems in parallel to speed up startup looks like a correct optimization. Indeed, the trace recorded several init tasks running concurrently on separate threads — metrics, error reporting, A/B testing, component detection.

The thing is, many of them internally call the same shared operations: "register with the Service Registry" and "fetch the current logged-in account." Even run in parallel, they end up serialized on the shared resource's lock. On its own that just makes startup slower — but when the thread holding a lock stalls on something else, everyone waiting gets swept up all at once, as happened here.

Parallelization aimed at speed becomes effectively serial under shared-resource contention, and in the worst case deadlocks. The most dangerous spot is the assumption that "it parallelized, so it must be faster." When you add startup tasks, you have to look at how each one touches shared resources (registry, auth, settings store) as a set — otherwise you not only fail to get the scaling benefit, you raise the odds of a deadlock.

Pitfall 2: auth has become an implicit dependency of every feature

The most surprising thing in the trace was that both metrics and A/B testing — features that look unrelated to auth — were reaching for "who is logged in right now" during initialization. Metrics wants to attach user attributes; A/B testing wants to bucket by account. The reasons are each fair enough, but the result is that the auth SDK has become an implicit dependency point for the entire app.

When auth-data retrieval jams at a single point, it's not auth itself that stops — it's every feature that referenced auth, stalling in a chain. You need to recognize that auth isn't "a concern around the login screen" but "a critical path of the entire startup sequence." If you count how many subsystems call auth-data retrieval at startup in your own app, the number may be higher than you'd imagine.

Pitfall 3: ownership of shared data goes adrift during migration

A design where multiple apps share authentication data is convenient for users — sign in once and you don't need to log in again on the other apps. The problem is that "who owns this shared data, and who fixes it when it breaks" is left implicit.

Suppose there's an implicit rule like "the first-installed app is the representative." If migration doesn't reproduce the install order or state, the ownership relationship goes adrift. The owner sits there holding corrupted data while other apps go to reference it. The fact that nothing was fixed until I cleared Prime Video this time may have this ownership ambiguity in the background. Shared data needs a fallback — another app taking over, or safely regenerating the data — for when the owner disappears or its data breaks.

Lessons for support and for users

Even if you're not in a position to change the design, knowing this structure changes how fast you can respond.

For support: keep in mind that "please reinstall" only works when the problem is contained within a single app. For a post-migration report of "only certain apps won't launch," suspect that migration left the shared data corrupted, and being able to offer the next move — "clear storage for the related apps as a group" — changes the opening response. Even just asking "did this start right after switching phones?" up front can sometimes narrow the investigation considerably.

For users: think of the cleanup target as "every app from the same provider," not "the app that's giving you trouble." Keeping in mind that even an app you don't use is a node in the sharing network, and that corruption there causes collateral damage, raises your odds of getting out of it on your own.

Other cases where the same pattern can occur

It's not just Amazon — there are plenty of designs where multiple apps share authentication information.

Sharing auth tokens across apps via Android's standard AccountManager
Sharing login information across same-signature apps through a ContentProvider
Groups of apps with a common account platform (cross-app login across several apps from the same company, for example)

When these combine with a design that "initializes many subsystems in parallel at startup," the same conditions line up as in this case: when the shared data breaks, every app chain-fails to launch, and a single reinstall won't fix it. If your own app group meets these two conditions, it's worth checking once how you guarantee consistency across a device migration, and how you degrade or regenerate when the sharing source breaks.

Closing

If you run into apps not launching after switching phones, first try "if a single clear doesn't fix it, clear the group of related apps." That's the shortest fix from the user's side. When the sharing source is broken, you have to wipe the source along with the rest.

From the design side, the three pitfalls this trace surfaced are worth remembering: parallel init can backfire on shared resources; auth tends to become an implicit critical path for every feature; and ownership of shared data goes adrift during migration. Each is a "well-intentioned design" on its own, yet combined they produce an app that won't start.

"Please reinstall" only holds up as a universal fix for designs contained within a single app. For apps that hold shared data and carry a complex startup sequence, the same symptom recurs even after reinstalling. Just knowing this one structure makes a real difference in how fast you respond the next time you hit the same incident.

It's not too late! Make your AWS Security Agent debut with a code review!

NaoyukiFujita — Sat, 30 May 2026 00:45:11 +0000

Introduction

This article is an English translation of the article at the following URL, which was originally written in Japanese. The screenshots are still in Japanese. Sorry about that.

https://qiita.com/amarelo_n24/items/e196b74f718c750a0e18

The penetration testing feature for AWS Security Agent (hereinafter referred to only as "Security Agent"), which was announced at AWS re:Invent 2025, has been generally available (GA). Code review and design review are still in preview as of May 25th, so those who haven't been able to try Security Agent yet can still try these features. I wasn't able to try penetration testing during the preview period , so I decided to at least experience code review and made my Security Agent debut!

This article reflects the author's personal views. It is based on personal testing and should be used for reference only. Furthermore, the author has no experience in app development, so the terminology used may not be entirely accurate. Any corrections or errors in the content would be greatly appreciated.
This article was written based on information as of May 25, 2026.

What is a Security Agent?

As mentioned above, this service was announced during AWS re:Invent 2025. It is a frontier agent that proactively protects applications throughout the entire development lifecycle in all environments (quoted from the official AWS page).

https://aws.amazon.com/security-agent/

It includes three features that became generally available (GA) in April: penetration testing, design review, and code review (the subject of this article).

Function name	Feature Overview	Status（As of 2026/5/25）
Penetration testing	Attempting to infiltrate the system from an external source to evaluate security measures.	GA
Design Review	Analyze product specifications, architecture documents, and technical designs from a security risk perspective.	Preview
Code Review	Inspect source code and repositories to detect code-level vulnerabilities.	Preview

Code security review (hereinafter referred to as "code review") is a web application diagnostic method that falls under "SAST" (Static Application Security Testing). It is considered a vulnerability assessment that checks for flaws in the source code during the development phase before it is deployed in a test environment, and detects vulnerabilities visible at the code level.

Security Agent Code Review

From here, I will describe the steps to enable the Security Agent and run a code review.

Enable Security Agent

To start using Security Agent, you first need to enable it. Incidentally, simply enabling Security Agent will not incur any charges.

① Click [Set up AWS Security Agent]

② Enter [Agent Space name].
③ Specify [User access configuration].

If you have enabled AWS Organizations and also enabled IAM Identity Center, you might want to select "Single sign-on (SSO) with IAM Identity Center." I chose this option because I also run a one-person organization. Even if you haven't enabled Organizations yet, this might be a good opportunity to try out a one-person organization.

④ Enter a service role name. If there is no suitable role available in your account, a new service role will be created.
⑤ If you want to use KMS encryption, check the encryption option checkbox. If the default encryption is sufficient, uncheck the checkbox.
⑥ Set tags as needed.
⑦ Click [Set up AWS Security Agent].

⑧ Once you see a message indicating that the application has been successfully enabled, the Security Agent has been successfully activated.

Enable Code Review

① Click "Enable code review".

② Add a "Connected Integration". Click "Add".

③ Select "Create a new account" and "GitHub," then click "Next."

④ Click "Open AWS Security Agent on GitHub".

⑤ You will be redirected to the GitHub page. Click "Install".

⑥ Click on the GitHub account that contains the repository where you want to install the AWS Security Agent GitHub App.

⑦ Click "Only select repositories" and select the repository where you want to install the GitHub App from the "Select repositories" dropdown menu.

⑧ Click "Install," and the setup is complete when a screen like the one below appears.

⑨ Return to the following screen and click "Add" again.

⑩ Select the added integration and click "Next".

⑪ Select the GitHub repository name and click "Next".

⑫ Select the features to enable. If you want to perform code reviews, enable "Code review comments". If you want to automatically remediate detected vulnerabilities, enable "Automatic remediation". Click Connect and confirm that "Integration resource added" is displayed.

⑬ Select the code review settings and click "Next". I selected "Security requirements and vulnerability detection results," which is selected by default.

⑭ If you want to obtain application operation logs in CloudWatch Logs, select the log group where you want to store the logs (you need to create the log group beforehand).
⑮ Create a role for service access. If you have already created one, click "Use existing service role". If the default role is acceptable, click "Create default role".
⑯ Click "Save".

This time, we created a default role, but I think it's necessary to create a role with carefully considered policy settings. I'll investigate what policies are necessary in the future.

⑰When it displays as shown below and "Ready" appears in the code review section, code review is enabled.

Application User Settings

Add an IAM Identity Center user to allow code reviews from the application.

For testing purposes, you can access it with "Administrator Access" without creating a user, but since administrators don't usually perform vulnerability assessments in normal operation, we'll configure a user even for testing purposes.

① Return to the "Agent Spaces" top page and click "Add Users" from the "Web App" tab.

② Select the IAM Identity Center username you want to allow access to the Security Agent web app and click "Add users".

③ Once the message indicating that the user has been added is displayed, click the Agent Web App URL.

④ When the following screen appears, click "Sign in" or wait a moment, and you will be redirected to the Agent Web App screen.

⑤ The screen will display as shown below, and you should confirm that the created Agent Space name is displayed.

Running the Code Review

Now, we will run the code review.

① From the Agent Web App home screen, click "Create a code review."

② Enter a title for the code review.
③ Select the previously connected GitHub repository, the created service role, and the CloudWatch log group, and click "Create a code review."

④ Once the message indicating that the code review has been created is displayed, click "Start review." A confirmation screen will appear, so click "Start review" again.

⑤ The message "Code review started" will be displayed. Reloading the screen will display "In progress."

⑥ Clicking on the created code review will show the progress. Wait until completion.

Code Review Results

This time it was completed in about an hour.

① Once completed, you can view the code review results.

② The scan results are displayed as follows. Well-known vulnerabilities such as SQL injection, cross-site scripting, and path traversal were detected.

Although it says "Completed," it remained showing "Finalizing" for some reason.

Downloading Code Review Results

You can download the code review results as a PDF file. This is likely for requesting corrections or sharing information with developers who do not have an AWS account, or for storing it as evidence.

① Click "Generate Report" in the upper right corner of the code review results screen.

② Edit the extraction criteria and click "Generate and Download." The code review results will be output as a PDF file to your PC's download folder.

Automatic Remediation of Detected Vulnerabilities

Detected vulnerabilities need to be fixed. It is possible to fix them automatically instead of manually.

① Select the vulnerability you want to automatically fix and click "Fix Code."

② Code remediation will begin.

If "Automatic remediation" is not enabled, the following error will appear. In the GitHub repository's features management, turn on the "Automatic remediation" toggle button and save.

③ Scroll to the bottom of the screen to see the detection results for the selected vulnerability. The code remediation status will be displayed. Once the fix is complete and the status changes to "COMPLETED," a pull request is sent to GitHub.

④ Opening the pull request reveals that it was automatically created by Security Agent and details of the changes. If there are no issues with the content, merge it.

Impressions of Conducting a Code Review with Security Agent

Recognizing the Importance of SAST

This code review detected many types of vulnerabilities. It's probably difficult to uncover all vulnerabilities through human code reviews alone. I believe it's an important service that complements human code reviews by inspecting for remaining vulnerabilities. Furthermore, I realized that web application security testing should not only utilize external attack-based testing methods like DAST (Dynamic Application Security Testing), but also SAST, which identifies vulnerabilities at the code level and provides a starting point for fixes.

Completely Eliminating Human Reviews is Not Yet Possible

I realized that Security Agent doesn't completely replace human code reviews.

After running the automatic fix and then performing another code review, the fixed vulnerabilities were not re-detected, but several new vulnerabilities were detected. It's possible that a detection method was added during the initial code review, or that it was a false positive, but it's also possible that a fix in one place affected the entire code or even the entire repository.

As such, the results of the review and the recommended fixes for vulnerabilities are not always optimal for the whole system. Furthermore, there's a risk that applying automated fixes too readily could break the entire application. I think that unless people carefully review the fixes and decide whether to automate or manual fixes, it could lead to unnecessary work being done.

Wouldn't it be great if it could be integrated with CodeCommit?

As of May 21, 2026, it's not possible to target CodeCommit repositories for code reviews. It was truly surprising that S3 could be targeted for code reviews, but CodeCommit couldn't. Currently, if a user of CodeCommit wants to perform code reviews with the Security Agent, they would have to either:

Store source code files in S3 and perform code reviews there
Migrate the repository to GitHub.

Storing files in S3 is troublesome, and migrating to GitHub doesn't seem practical. I think it would be great if it could integrate with CodeCommit to easily perform code reviews.

CodeCommit was such a valuable service that it was shut down once before returning to GA (General Availability), so I thought it was a bit of a shame that it couldn't be integrated. I guess we can only hope for future AWS updates.

Finally

It's been almost six months since re:Invent 2025, but I finally got to try out Security Agent. Penetration testing, once GA is available, has become difficult to implement at an individual level. The preview period doesn't last forever. I strongly felt that you should try it as soon as possible after the announcement.

You can still relatively easily experience Security Agent through code reviews, which are still in preview, so it's not too late! Why not make your Security Agent debut with a code review and use it as an opportunity to learn about web application security?

Also, since design reviews are still in preview, I plan to try those out soon as well.

I hope this article is helpful to someone. Thank you for reading to the end!

Here's why your Prompt is WRONG 😑

Pravesh Sudha — Fri, 29 May 2026 17:48:43 +0000

The right prompt is no longer just a skill — it is becoming a necessity in this fast-paced world where almost everything is driven by AI chatbots and agents.

A lot of people think AI gives bad results because the model is not powerful enough, but in most cases, the real issue is the prompt itself.

In today’s blog, we will uncover some of the most useful AI prompting techniques that can help you write better prompts and get significantly better results.

1. Zero-Shot Prompting

This is the most common prompting technique among beginners. Almost everyone starts from this approach.

In Zero-Shot Prompting, you directly ask the AI what you need in a brief and specific way without giving any prior examples.

For example:

Instead of writing:

“Generate me a Kubernetes Deployment file”

You can write:

“Generate ONLY a Kubernetes Deployment file”

This small change cuts out unnecessary explanations, extra commands, and long guides that AI models often generate by default.

Zero-Shot Prompting works best for popular or familiar use cases where the AI already has strong contextual understanding.

Another advantage of this approach is lower token usage. In the screenshots above, you can notice that the token usage is almost 30% lower compared to longer prompts. This becomes extremely important in large organizations where APIs frequently interact with AI systems at scale.

2. Few-Shot Prompting

In this approach, before giving the actual task, we first provide the AI with a few examples.

This helps the model become context-aware and understand the expected style, structure, or format of the output.

Few-Shot Prompting is especially useful when organizations want responses to follow a particular standard rather than simply generating the “ideal” answer.

For example, if a company wants all incident reports, YAML files, or summaries to follow a fixed structure, giving examples beforehand helps maintain consistency.

For Fun: I have attached a SuperHero Example

3. Multi-Shot Prompting

Multi-Shot Prompting is very similar to Few-Shot Prompting, but instead of providing a few examples, we provide many examples for even better contextual understanding.

The advantage is usually better and more refined output quality.

However, the downside is increased token consumption because the input becomes significantly larger due to additional examples.

This is a tradeoff between output quality and cost efficiency.

4. Chain of Thought Prompting

Chain of Thought Prompting encourages the AI to break down complex reasoning tasks into intermediate steps before generating the final answer.

For example:

“Explain why this deployment failed in a step-by-step manner.”

This technique is extremely useful for debugging, analysis, problem-solving, and learning deeply about a topic instead of just scratching the surface.

It is especially beneficial for curious minds who want to understand why something happened rather than simply receiving the final answer.

5. RAG (Retrieval-Augmented Generation)

RAG is not exactly a prompting technique, but more of a workflow approach used alongside prompting.

In this method, the AI is connected to external data sources such as databases, PDFs, internal documents, or APIs.

Instead of relying solely on its internal training data, the model retrieves relevant information from these external sources before generating a response.

This helps produce more accurate, contextual, and up-to-date answers.

RAG is widely used in AI agents, enterprise chatbots, documentation assistants, and knowledge-based systems.

Final Thoughts

Writing an efficient prompt is not as difficult as it seems.

It simply involves understanding which technique works best for your specific use case.

Experiment with different prompting approaches, observe the outputs, and gradually build your own prompting style.

If you liked this blog, make sure to follow me on LinkedIn, Twitter, and YouTube where I regularly share my learnings around AI, DevOps, and technology.

Till then,

Adios 👋

Trying AWS IoT Core's New GetConnection and ListSubscriptions APIs

ほうき星 — Fri, 29 May 2026 14:10:41 +0000

This article is a machine translation of the contents of the following URL, which I wrote in Japanese:

AWS IoT Core 単体で MQTT クライアントの接続状態を取得できるようになったので試す #awsIoT - Qiita

はじめにこんにちは、ほうき星 @H0ukiStar です。本日 AWS IoT Core に以下のようなアップデートがありました。 Posted on: May 28, 2026 Today, AWS IoT Core launches two new MQTT c...

qiita.com

Introduction

Hello, I'm @H0ukiStar.

Today, AWS IoT Core announced the following update.

Posted on: May 28, 2026
Today, AWS IoT Core launches two new MQTT connection management APIs, GetConnection and ListSubscriptions, enabling you to easily access MQTT client connection and subscription information for your Internet of Things (IoT) devices. These APIs help you troubleshoot connectivity issues, monitor client behavior, and audit connection patterns across your device fleet.

https://aws.amazon.com/about-aws/whats-new/2026/05/aws-iot-core-apis-mqtt/

Previously, checking device connection status required using Fleet Indexing or implementing your own state management with features such as Device Shadow. With this update, it is now possible to check device connection status directly using AWS IoT Core alone.

I previously introduced methods for checking connection status using Fleet Indexing and Device Shadow in the following article, so feel free to check it out as well.

AWS IoT Core でモノの接続状態をオンデマンドで取得する2つの方法 #lambda - Qiita

はじめにこんにちは、ほうき星です。皆さんは、AWS IoT Core に接続した自作 IoT デバイス（Raspberry Pi / ESP32 など）が「今オンラインなのか、それともオフラインなのか」を確認したいと思ったことはありませんか？私は自作した IoT ...

qiita.com

Trying It Out

Let's try the new APIs.

At the time of writing, these APIs do not appear to be available in the AWS SDK yet, so the API requests must be signed manually using SigV4.

This time, I used awscurl for signing. You can install it with the following command:

pip install awscurl

You can also retrieve the API endpoint using the AWS CLI as shown below.

aws iot describe-endpoint --endpoint-type iot:Data-ATS

{
    "endpointAddress": "abcd1234567890-ats.iot.ap-northeast-1.amazonaws.com"
}

GetConnection

This API allows you to retrieve information such as the connection status and Keep Alive configuration of a specified MQTT client.

docs.aws.amazon.com

Send the following request using awscurl.

awscurl \
  --service iotdevicegateway \
  --region ap-northeast-1 \
  "https://abcd1234567890-ats.iot.ap-northeast-1.amazonaws.com/connections/<clientId>"

[!WARNING]
For SigV4 signing, you must specify iotdevicegateway as the service name instead of iot.

You can retrieve information such as whether the client is currently connected and when the connection was established.

{
    "connected": true,
    "cleanSession": true,
    "clientId": "clientId",
    "thingName": "thingName",
    "keepAliveDuration": 15,
    "connectedSince": 1780061470032
}

You can also retrieve socket information by adding includeSocketInformation=true.

awscurl \
  --service iotdevicegateway \
  --region ap-northeast-1 \
  "https://abcd1234567890-ats.iot.ap-northeast-1.amazonaws.com/connections/<clientId>?includeSocketInformation=true"

{
    "connected": true,
    "cleanSession": true,
    "clientId": "clientId",
    "thingName": "thingName",
    "sourceIp": "sourceIp",
    "sourcePort": sourcePort,
    "targetIp": "targetIp",
    "targetPort": 8883,
    "keepAliveDuration": 15,
    "connectedSince": 1780061470032
}

ListSubscriptions

This API allows you to retrieve the topics subscribed to by a specified client.

docs.aws.amazon.com

Send the following request using awscurl.

awscurl \
  --service iotdevicegateway \
  --region ap-northeast-1 \
  "https://abcd1234567890-ats.iot.ap-northeast-1.amazonaws.com/connections/<clientId>/subscriptions"

[!WARNING]
For SigV4 signing, you must specify iotdevicegateway as the service name instead of iot.

You can retrieve the topics currently subscribed to by the specified client as shown below.

{
    "nextToken": "",
    "subscriptions": [
        {
            "topicFilter": "$aws/things/<thingName>/shadow/update/delta",
            "qos": 0
        }
    ]
}

Conclusion

With the newly introduced GetConnection and ListSubscriptions APIs, it is now possible to retrieve MQTT client connection status and subscription information directly from AWS IoT Core.

Previously, checking connection status often required custom implementations using Fleet Indexing, Device Shadow, or Lifecycle Events. With these new APIs, monitoring client state should become much simpler.

Although the APIs do not yet appear to be available in the AWS SDK, they can already be used today by manually signing requests with SigV4.

If you're interested, give them a try.

Hackez votre AWS CLI pour ajouter le support CloudShell et transformer votre terminal en bastion

Paul SANTUS — Fri, 29 May 2026 12:43:23 +0000

J'utilise AWS CloudShell depuis la Console depuis un moment. C'est pratique : un shell pré-authentifié dans votre navigateur, directement dans la Console AWS. Mais je me suis toujours demandé : pourquoi je ne peux pas l'utiliser depuis mon terminal ? Pourquoi n'y a-t-il pas de commande aws cloudshell ?

Il s'avère que c'est possible. L'API existe, elle n'est simplement pas publique. Et une fois que vous avez accès à CloudShell en CLI, vous pouvez faire des choses intéressantes avec, comme utiliser un CloudShell attaché à un VPC comme bastion pour atteindre vos instances RDS privées.

Consultez le dépôt compagnon en lisant cet article.

CloudShell : une API non documentée

AWS CloudShell n'a pas de support officiel SDK ou CLI. Mais la Console doit bien communiquer avec quelque chose, non ? En regardant ce que fait le navigateur quand vous ouvrez CloudShell, vous pouvez rétro-ingénierer l'API.

Heureusement, Jérôme Guyon a déjà fait ce travail et publié un modèle de service compatible boto3. Son travail a rendu tout cela possible.

L'API est simple : créer des environnements, les démarrer/arrêter, créer des sessions, uploader/télécharger des fichiers. Le mécanisme de session utilise le protocole WebSocket de SSM sous le capot, ce qui signifie que session-manager-plugin (le même binaire qui fait tourner aws ssm start-session) peut se connecter aux sessions CloudShell.

Apprendre un nouveau tour à l'AWS CLI

L'AWS CLI a une fonctionnalité peu connue : aws configure add-model. Donnez-lui un modèle de service JSON, et soudain la CLI connaît un nouveau service. AWS utilise ça en interne pour les previews privées.

(Le modèle boto3 du dépôt de Jérôme a juste besoin d'un champ "version": "2.0" ajouté au niveau racine pour devenir compatible CLI.)

Exécutez :

aws configure add-model \
  --service-model file://cloudshell-cli-model.json \
  --service-name cloudshell

C'est tout. Maintenant j'ai aws cloudshell avec l'auto-complétion et tout :

$ aws cloudshell help

AVAILABLE COMMANDS
       create-environment
       create-session
       delete-environment
       describe-environments
       get-environment-status
       start-environment
       stop-environment
       ...

Se connecter à CloudShell depuis le terminal

Le workflow est simple :

# Créer ou trouver un environnement
aws cloudshell create-environment --region eu-west-1

# Attendre qu'il soit RUNNING
aws cloudshell get-environment-status --environment-id <ID> --region eu-west-1

# Créer une session et se connecter
session-manager-plugin "$(aws cloudshell create-session \
  --environment-id <ID> \
  --session-type TMUX \
  --tab-id "$(uuidgen | tr '[:upper:]' '[:lower:]')" \
  --q-cli-disabled \
  --region eu-west-1 \
  --query '{SessionId:SessionId,TokenValue:TokenValue,StreamUrl:StreamUrl}' \
  --output json)" eu-west-1 StartSession

Et vous y êtes. Un shell complet sur une instance CloudShell, depuis votre terminal. Pas besoin de navigateur.

Le problème des credentials

Il y a un hic. Quand vous utilisez CloudShell depuis la Console, AWS injecte vos credentials automatiquement via un appel API PutCredentials. Celui-ci utilise votre token de session console (l'auth par cookie de votre connexion navigateur) pour alimenter le endpoint de métadonnées du conteneur en credentials temporaires.

Quand vous vous connectez par programme, ça ne se fait pas. Le endpoint de credentials du conteneur renvoie une erreur 500. Vous devez injecter les credentials vous-même :

# Exécutez localement, puis collez la sortie dans votre session CloudShell
aws configure export-credentials --profile my-profile --format env

Pas idéal, mais ça fonctionne.

Le cas d'usage bastion

C'est là que ça devient intéressant. Vous pouvez créer un environnement CloudShell attaché à un VPC :

aws cloudshell create-environment \
  --environment-name db-access \
  --vpc-config '{
    "VpcId": "vpc-abc123",
    "SubnetIds": ["subnet-private-1"],
    "SecurityGroupIds": ["sg-allowed-by-rds"]
  }' \
  --region eu-west-1

Mettez-le dans le même security group que celui autorisé par votre RDS, et soudain vous pouvez vous connecter à votre base de données directement depuis le shell :

mysql -h my-instance.xxx.eu-west-1.rds.amazonaws.com -u admin -p

Pas d'instance EC2 bastion à maintenir. Pas de clés SSH à gérer. Pas de coût horaire quand vous ne l'utilisez pas (CloudShell est gratuit). L'environnement se suspend après 20 minutes d'inactivité et vous pouvez le maintenir en vie avec aws cloudshell send-heart-beat.

Ce qui ne marche pas (et j'ai essayé..)

J'ai passé pas mal de temps à essayer de faire fonctionner CloudShell comme un vrai bastion de port-forwarding, pour pouvoir utiliser des outils locaux comme DBeaver contre un RDS distant à travers lui. Voici ce que j'ai trouvé :

Le port forwarding basé sur SSM ne fonctionne pas.

ECS, par exemple, enregistre les conteneurs comme cibles SSM. Son identifiant SSM n'est pas documenté mais une fois qu'on le connaît, ça marche bien, comme je l'ai décrit dans un précédent article. De cette façon vous pouvez lancer aws ssm start-session --document-name AWS-StartPortForwardingSessionToRemoteHost.
Les notebooks SageMaker ont un comportement similaire.

Les instances/conteneurs CloudShell ne semblent pas être enregistrés comme instances managées SSM. Ou s'ils le sont, c'est caché et à ce jour, personne chez AWS n'a divulgué le format de leur ID :) J'ai essayé toutes les combinaisons d'ID d'environnement, d'ID de session et de format de préfixe auxquelles j'ai pu penser. Aucune ne fonctionne.

Le port forwarding local à travers le PTY ne fonctionne pas non plus. La session est un terminal, pas un flux TCP brut. Vous ne pouvez pas faire passer des données binaires du protocole MySQL à travers. J'ai même essayé de mettre en place un relais ncat à l'intérieur de CloudShell et de tunneler à travers la session. Le relais fonctionne bien en interne, mais il n'y a aucun moyen de l'exposer comme un port TCP local sur votre machine.

Le hole punching UDP est théoriquement possible mais nécessite que le CloudShell ait accès à internet (NAT Gateway sur son subnet), et même là vous vous battez contre des problèmes de symétrie NAT des deux côtés. J'ai réussi à faire fonctionner STUN depuis CloudShell, mais le hole punch complet est fragile et impraticable pour un usage en production.

Alors à quoi ça sert ?

Honnêtement, à pas mal de choses :

Accès rapide à la base de données sans maintenir une instance EC2 bastion. Connectez-vous, exécutez vos requêtes, déconnectez-vous. Gratuit.
Automatisation. Vous pouvez scripter l'exécution de commandes sur CloudShell via Python + session-manager-plugin. Utile pour exécuter des choses à l'intérieur d'un VPC sans déployer une Lambda ou une tâche Fargate.
Débogage de connectivité réseau. Lancez un CloudShell dans une combinaison subnet/SG spécifique et testez ce qui peut atteindre quoi.
Transfert de fichiers (depuis les environnements publics). Les APIs get-file-upload-urls et get-file-download-urls vous donnent des URLs S3 présignées.

La limitation principale est que vous êtes limité à exécuter des commandes à l'intérieur du shell. Vous ne pouvez pas l'utiliser comme un tunnel transparent pour vos outils locaux. Pour ça, vous avez toujours besoin d'une instance EC2 avec l'agent SSM, ou d'une tâche ECS avec execute-command activé.

Essayez vous-même

J'ai publié le modèle et un script d'exemple ici : github.com/psantus/cloudshell-cli

L'installation se fait en une commande. Le tout est un seul fichier JSON qui apprend un nouveau service à votre AWS CLI. Rappelez-vous juste : c'est une API non documentée. AWS peut la modifier ou la casser à tout moment. Ne construisez rien de critique dessus.

Mais pour un accès VPC rapide depuis votre terminal ? C'est plutôt génial.

Générer des données structurées avec un LLM : quelques astuces pour plus de fiabilité

Paul SANTUS — Fri, 29 May 2026 12:41:31 +0000

Les LLMs sont excellents pour générer du texte. Ils sont mauvais pour générer des données structurées de manière fiable. Si vous avez déjà essayé de faire produire à un agent un objet JSON avec un schéma précis, vous connaissez le douloureux résultat : champs manquants, clés hallucinées, types incohérents, et des sorties qui cassent votre pipeline en aval.

Dépassant le stade du code de démo pour travailler sur de vraies applications IA en production, j'ai été confronté au problème et j'ai trouvé une approche qui fonctionne remarquablement bien pour une application IA que je développe : utiliser les outils comme le pattern Builder de la programmation orientée objet. Au lieu de demander au modèle de produire un blob JSON final, vous lui donnez des outils qui construisent la sortie de manière incrémentale - comme appeler des méthodes sur un objet. Le modèle ne voit ni ne produit jamais la structure finale directement. Il appelle simplement des outils, et la sortie structurée émerge comme un effet de bord.

C'est particulièrement important quand votre agent traite des documents volumineux (formulaires d'assurance, dossiers juridiques, dossiers médicaux) qui consomment la majeure partie de la fenêtre de contexte. Quand l'entrée est volumineuse et que la tâche comporte plusieurs étapes, vous ne pouvez pas vous permettre de réserver aussi de l'espace pour une sortie structurée massive à la fin. Le pattern accumulateur vous permet de compresser la conversation en cours de route sans perdre aucune des données structurées déjà collectées, car ces données vivent entièrement en dehors de la fenêtre de contexte.

Défis

"Génère-moi un gros JSON" : les soucis

L'approche naïve - demander au modèle de produire une structure JSON complète - échoue de manière quasi systématique lorsque le volume augmente :

Dérive de schéma. Le modèle oublie des champs obligatoires, en invente de nouveaux, ou change les types d'une exécution à l'autre. Un champ date peut être une chaîne une fois et un objet la suivante.
Tout-ou-rien. Si le modèle fait une seule erreur dans une sortie JSON de 200 lignes, l'ensemble est impossible à parser. Vous devez soit relancer toute la génération, soit écrire du code de correction fragile.
Pas de progrès incrémental. Quand un agent doit collecter des informations et produire une sortie structurée, lui demander de faire les deux en une seule passe signifie qu'il ne peut pas itérer. Il s'engage sur une structure avant d'avoir tous les faits.

Pourquoi `response_format` et les schémas de function-calling ne suffisent pas

Les modes de sortie structurée (comme response_format: json_schema d'OpenAI ou les schémas de résultats d'outils de Bedrock) aident avec la syntaxe - vous obtiendrez du JSON valide. Mais ils ne résolvent pas le problème sémantique. Le modèle doit toujours produire la structure entière en une seule passe, et il hallucine toujours du contenu pour remplir les champs obligatoires.

Un problème répandu

Toute équipe qui construit des agents autonomes ou semi-autonomes fait face à ce problème, pas seulement moi. Kiro CLI, le compagnon de développement agentique d'AWS, par exemple, a beaucoup galéré avec les grandes structures de données à son lancement.

Depuis, ses mainteneurs ont équipé son harnais de capacités JSON (manipulations jq, par exemple) et de multiples stratégies (utilisation extensive de grep, glob, tail..) pour éviter de remplir la fenêtre de contexte.

Ça fait quand même plaisir de savoir que je ne suis pas le seul à avoir galéré :)

Mes solutions

Voici quelques astuces que j'ai utilisées avec succès pour contrôler à la fois la sortie de l'agent et la fenêtre de contexte. Comme je ne prétends pas avoir toutes les recettes, n'hésitez pas à commenter les vôtres ou à me taguer dans vos propres posts :)

Utiliser les outils comme des Builder méthodes

L'idée centrale : définir des outils qui agissent comme des méthodes Builder en POO. Chaque appel d'outil ajoute un élément bien typé à un accumulateur. Le travail du modèle passe de "produis cette structure" à "appelle ces fonctions dans le bon ordre."

Voici le pattern - imaginez un agent qui traite des sinistres d'assurance en lisant des documents et en construisant une évaluation structurée :

from strands import tool

# L'accumulateur - c'est votre sortie structurée
claim_output = {
    "parties": [],
    "events": [],
    "damages": [],
    "evidence": [],
    "assessment": None,
}

def reset_output():
    claim_output["assessment"] = None
    for k in ["parties", "events", "damages", "evidence"]:
        claim_output[k] = []


@tool
def add_party(name: str, role: str, policy_id: str = "") -> str:
    """Enregistrer une partie impliquée dans le sinistre.

    Args:
        name: Nom complet de la personne ou de l'organisation.
        role: Un parmi : claimant, insured, witness, adjuster, third_party
        policy_id: Numéro de police si applicable.

    Returns:
        Confirmation avec les détails de la partie.
    """
    if role not in ("claimant", "insured", "witness", "adjuster", "third_party"):
        return f"Error: invalid role '{role}'. Must be one of: claimant, insured, witness, adjuster, third_party"

    claim_output["parties"].append({
        "name": name,
        "role": role,
        "policy_id": policy_id,
    })
    return f"Added {role}: {name}"


@tool
def add_event(description: str, date: str, location: str = "") -> str:
    """Enregistrer un événement chronologique pertinent pour le sinistre.

    Args:
        description: Ce qui s'est passé (1-3 phrases).
        date: Date au format ISO (AAAA-MM-JJ).
        location: Où cela s'est produit (optionnel).
    """
    claim_output["events"].append({
        "description": description,
        "date": date,
        "location": location,
    })
    return f"Recorded event on {date} ({len(claim_output['events'])} events total)"


@tool
def add_damage(item: str, amount: float, category: str, evidence_ref: str = "") -> str:
    """Enregistrer un poste de dommage avec le coût estimé.

    Args:
        item: Description de l'élément endommagé ou du coût.
        amount: Coût estimé en dollars.
        category: Un parmi : property, medical, liability, lost_income
        evidence_ref: Référence à une preuve justificative (optionnel).
    """
    if category not in ("property", "medical", "liability", "lost_income"):
        return f"Error: invalid category '{category}'."

    claim_output["damages"].append({
        "item": item,
        "amount": amount,
        "category": category,
        "evidence_ref": evidence_ref,
    })
    total = sum(d["amount"] for d in claim_output["damages"])
    return f"Added damage: {item} (${amount:.2f}). Running total: ${total:.2f}"

L'agent reçoit ces outils et un prompt système qui lui dit de traiter un sinistre. Au fur et à mesure qu'il lit les documents et découvre des informations, il appelle add_party, add_event et add_damage. La sortie structurée se construit de manière incrémentale.

Validation à la frontière

Chaque appel d'outil est un point de contrôle de validation. Vous pouvez rejeter les entrées invalides immédiatement :

@tool
def add_damage(item: str, amount: float, category: str, evidence_ref: str = "") -> str:
    if category not in ("property", "medical", "liability", "lost_income"):
        return f"Error: invalid category '{category}'."
    if amount <= 0:
        return f"Error: amount must be positive, got {amount}."
    if evidence_ref and evidence_ref not in [e["id"] for e in claim_output["evidence"]]:
        return f"Error: evidence '{evidence_ref}' not registered. Call add_evidence first."
    # ...

Le modèle reçoit un feedback instantané. S'il essaie de référencer une preuve qu'il n'a pas encore enregistrée, l'outil le lui dit. Le modèle se corrige au tour suivant. Comparez cela à la validation d'un blob JSON de 500 lignes après coup - à ce moment-là, le modèle est passé à autre chose et ne peut plus corriger ses erreurs dans le contexte.

Décorréler la phase de réflexion de la construction de la sortie

Un avantage clé : le même agent peut avoir des outils de lecture et des outils d'écriture. Les outils de lecture récupèrent et explorent les données. Les outils d'écriture construisent la sortie. Le modèle les entrelace naturellement :

agent = Agent(
    model=model,
    system_prompt=prompt,
    tools=[
        # Outils de lecture
        read_document,
        search_policy,
        get_weather_report,
        # Outils d'écriture (méthodes Builder)
        add_party,
        add_event,
        add_damage,
        add_evidence,
        set_assessment,
        # Suivi de progression
        mark_step_done,
    ],
)

# Un seul appel - l'agent lit les documents ET construit la sortie structurée
agent("Process this claim: " + claim_text)

# La sortie est prête
print(claim_output)

Le modèle lit un rapport de police, extrait une partie, lit une facture médicale, enregistre un poste de dommage, vérifie la police d'assurance, et ainsi de suite. Recherche et construction de la sortie sont entrelacées plutôt que séquentielles.

Suivi de progression et récupération

Parce que la sortie s'accumule de manière incrémentale, vous obtenez la récupération après crash gratuitement :

STEPS = [
    "1. Identify all parties",
    "2. Establish timeline of events",
    "3. Catalog damages with evidence",
    "4. Cross-reference policy coverage",
    "5. Produce assessment",
]
completed_steps: list[int] = []

@tool
def mark_step_done(step_number: int) -> str:
    """Marquer une étape de traitement comme terminée."""
    completed_steps.append(step_number)
    remaining = [s for i, s in enumerate(STEPS, 1) if i not in completed_steps]
    return f"Step {step_number} done. Remaining: {', '.join(remaining)}"

Si l'agent atteint une limite de fenêtre de contexte ou plante, vous avez déjà des résultats partiels - chaque partie identifiée, chaque événement enregistré, chaque poste de dommage catalogué jusqu'à ce point. Vous pouvez reprendre ou utiliser ce que vous avez.

Gestion du contexte par injection d'état

C'est là que ce pattern prend tout son sens. Quand votre agent ingère un document de 30 pages puis fait des dizaines d'appels d'outils pour récupérer des sources supplémentaires, la fenêtre de contexte se remplit vite. Dans une approche traditionnelle, vous perdriez votre sortie structurée en même temps que la conversation quand vous atteignez la limite. Mais parce que l'accumulateur vit dans la mémoire Python - pas dans l'historique des messages - vous pouvez compresser agressivement la conversation sans perdre un seul point de données.

Un gestionnaire de conversation personnalisé (une possibilité offerte, par exemple, par le SDK Strands Agents) remplace les anciens messages par un résumé d'état compact dérivé de l'accumulateur :

class ClaimConversationManager(ConversationManager):
    def apply_management(self, agent, **kwargs):
        messages = agent.messages
        if len(messages) <= 2:
            return

        # Garder le premier message + les 2 derniers messages
        # Remplacer tout le reste par un résumé d'état
        first_msg = messages[0]
        recent = messages[-2:]

        state = self._build_state_summary()
        state_msg = {
            "role": "user",
            "content": [{"text": f"[STATE]\n{state}\n\nContinue."}],
        }
        messages[:] = [first_msg, state_msg] + recent

    def _build_state_summary(self) -> str:
        """Résumer ce qui a été fait en utilisant l'état de l'accumulateur."""
        lines = []
        if claim_output["parties"]:
            parties = [f"{p['name']} ({p['role']})" for p in claim_output["parties"]]
            lines.append(f"Parties: {', '.join(parties)}")
        if claim_output["damages"]:
            total = sum(d["amount"] for d in claim_output["damages"])
            lines.append(f"Damages: {len(claim_output['damages'])} items, ${total:.2f} total")
        if claim_output["events"]:
            lines.append(f"Events: {len(claim_output['events'])} recorded")
        return "\n".join(lines)

Parce que la sortie structurée vit en Python (pas dans la conversation), la compression du contexte ne perd aucune donnée. Le modèle peut toujours voir ce qu'il a déjà produit en lisant le résumé d'état.

Bénéfices

Sûreté de typage sans coercition

Chaque outil a des paramètres typés imposés par le framework. Le modèle doit fournir une category parmi property, medical, liability, lost_income - non pas parce que vous parsez du JSON et vérifiez après coup, mais parce que la signature de l'outil l'exige. Les appels invalides sont rejetés avec des messages d'erreur clairs.

Composabilité

Les outils se composent naturellement. Vous pouvez ajouter de nouveaux champs de sortie en ajoutant de nouveaux outils sans modifier les existants. Vous voulez suivre les pièces justificatives ? Ajoutez un outil add_evidence. Vous voulez une recommandation finale ? Ajoutez un outil set_assessment. Le modèle découvre les nouvelles capacités via sa liste d'outils.

Testabilité

Chaque outil est une fonction pure (ou presque). Vous pouvez les tester unitairement de manière indépendante :

def test_add_damage_rejects_invalid_category():
    reset_output()
    result = add_damage(item="Roof repair", amount=5000, category="cosmetic")
    assert "Error" in result
    assert len(claim_output["damages"]) == 0

def test_add_damage_tracks_total():
    reset_output()
    add_damage(item="Roof repair", amount=5000, category="property")
    add_damage(item="Water damage", amount=2000, category="property")
    assert len(claim_output["damages"]) == 2
    assert sum(d["amount"] for d in claim_output["damages"]) == 7000

Schéma de sortie déterministe

Le schéma de sortie est défini par votre code Python, pas par l'interprétation du modèle d'un prompt. claim_output a toujours les mêmes clés avec les mêmes types. Les consommateurs en aval peuvent compter sur la structure de manière inconditionnelle.

Dégradation gracieuse

Si le modèle manque de contexte ou rencontre une erreur, vous avez tout ce qu'il a produit jusqu'à ce point. Vous pouvez même détecter une sortie vide et relancer avec un coup de pouce :

try:
    agent(claim_text)
except Exception:
    pass

if not claim_output["parties"] and not claim_output["events"]:
    agent("You haven't started processing. Begin by identifying the parties involved.")

Comportement naturel de l'agent

Le modèle n'a pas besoin de basculer entre "réfléchir" et "formater." Il réfléchit en appelant des outils. La sortie structurée est un sous-produit du travail de l'agent, pas un fardeau de formatage supplémentaire ajouté par-dessus.

Ce pattern - outils comme Builder, accumulateur comme sortie, validation à la frontière - est la manière la plus fiable que j'ai trouvée pour obtenir des données structurées d'un workflow agentique. Ça fonctionne parce que c'est aligné avec la façon dont les modèles à appels d'outils se comportent déjà : ils raisonnent, ils agissent, ils observent les résultats, et ils agissent à nouveau. Vous faites simplement en sorte que "agir" signifie "construire un morceau de la sortie."

LLMs suck at generating large, structured data. Tips on how to get your AI agent to do it reliably

Paul SANTUS — Fri, 29 May 2026 12:06:19 +0000

LLMs are great at generating text. They're terrible at generating structured data reliably. If you've ever tried to get an agent to produce a JSON object with a specific schema, you know the pain: missing fields, hallucinated keys, inconsistent types, and outputs that break your downstream pipeline.

As I got past toy examples and labs to work on real, production-grade AI apps, I faced the problem and found an approach that works remarkably well for an AI app I'm building: use tools like object-oriented programming Builder pattern. Instead of asking the model to produce a final JSON blob, you give it tools that incrementally build the output - like calling methods on an object. The model never sees or produces the final structure directly. It just calls functions, and the structured output emerges as a side effect.

This matters especially when your agent processes large documents (like insurance forms, legal filings, medical records) that eat up most of the context window. When the input is big and the task is multi-step, you can't afford to also reserve space for a massive structured output at the end. The accumulator pattern lets you compress the conversation mid-flight without losing any of the structured data you've already collected, because that data lives outside the token window entirely.

Challenges

The "generate JSON" problem

The naive approach - asking a model to output a complete JSON structure - fails in predictable ways:

Schema drift. The model forgets required fields, invents new ones, or changes types between runs. A date field might be a string one time and an object the next.
All-or-nothing failure. If the model makes one mistake in a 200-line JSON output, the entire thing is unparseable. You either retry the whole generation or write brittle fixup code.
No incremental progress. If the model hits a context limit or stops mid-generation, you lose everything. There's no partial result to recover from.
Hallucination in structure. Models are more likely to hallucinate when producing structured output in one shot. They fill in fields they're uncertain about rather than leaving them empty, because the structure demands completeness.
Coupling research and output. When an agent needs to gather information and produce structured output, asking it to do both in one pass means it can't iterate. It commits to a structure before it has all the facts.

Why `response_format` and function-calling schemas aren't enough

Structured output modes (like OpenAI's response_format: json_schema or Bedrock's tool result schemas) help with syntax - you'll get valid JSON. But they don't solve the semantic problem. The model still has to produce the entire structure in one shot, and it still hallucinates content to fill required fields.

A wide-spread issue

Any team building autonomous or semi-autonomous agents face this, not just me. Kiro CLI, AWS' agentic dev companion, for instance, struggled hard with large data structures when first launched.

Since then, its maintainers have equipped its harness with JSON capabilities (jq manipulations, for instance) and multiples strategies (extensive use of grep, glob, tail..) to avoid filling the context window.

Still, happy to know I'm not alone in facing this :)

My solutions

Here are a few tricks I have used successfully to control both agent output and context window. As I don't claim to have all the recipes, don't hesitate to comment your own or tag my in your own posts :)

Tools as Builder methods

The core idea: define tools that act like OOP builder methods. Each tool call adds one well-typed element to an accumulator. The model's job shifts from "produce this structure" to "call these functions in the right order."

Here's the pattern - imagine an agent that processes insurance claims by reading documents and building a structured claim assessment:

from strands import tool

# The accumulator - this is your structured output
claim_output = {
    "parties": [],
    "events": [],
    "damages": [],
    "evidence": [],
    "assessment": None,
}

def reset_output():
    claim_output["assessment"] = None
    for k in ["parties", "events", "damages", "evidence"]:
        claim_output[k] = []


@tool
def add_party(name: str, role: str, policy_id: str = "") -> str:
    """Register a party involved in the claim.

    Args:
        name: Full name of the person or organization.
        role: One of: claimant, insured, witness, adjuster, third_party
        policy_id: Policy number if applicable.

    Returns:
        Confirmation with party details.
    """
    if role not in ("claimant", "insured", "witness", "adjuster", "third_party"):
        return f"Error: invalid role '{role}'. Must be one of: claimant, insured, witness, adjuster, third_party"

    claim_output["parties"].append({
        "name": name,
        "role": role,
        "policy_id": policy_id,
    })
    return f"Added {role}: {name}"


@tool
def add_event(description: str, date: str, location: str = "") -> str:
    """Record a chronological event relevant to the claim.

    Args:
        description: What happened (1-3 sentences).
        date: ISO date string (YYYY-MM-DD).
        location: Where it happened (optional).
    """
    claim_output["events"].append({
        "description": description,
        "date": date,
        "location": location,
    })
    return f"Recorded event on {date} ({len(claim_output['events'])} events total)"


@tool
def add_damage(item: str, amount: float, category: str, evidence_ref: str = "") -> str:
    """Register a damage item with estimated cost.

    Args:
        item: Description of the damaged item or cost.
        amount: Estimated cost in dollars.
        category: One of: property, medical, liability, lost_income
        evidence_ref: Reference to supporting evidence (optional).
    """
    if category not in ("property", "medical", "liability", "lost_income"):
        return f"Error: invalid category '{category}'."

    claim_output["damages"].append({
        "item": item,
        "amount": amount,
        "category": category,
        "evidence_ref": evidence_ref,
    })
    total = sum(d["amount"] for d in claim_output["damages"])
    return f"Added damage: {item} (${amount:.2f}). Running total: ${total:.2f}"

The agent is given these tools and a system prompt that tells it to process a claim. As it reads documents and discovers information, it calls add_party, add_event, and add_damage. The structured output builds up incrementally.

Validation at the boundary

Each tool call is a validation checkpoint. You can reject bad input immediately:

@tool
def add_damage(item: str, amount: float, category: str, evidence_ref: str = "") -> str:
    if category not in ("property", "medical", "liability", "lost_income"):
        return f"Error: invalid category '{category}'."
    if amount <= 0:
        return f"Error: amount must be positive, got {amount}."
    if evidence_ref and evidence_ref not in [e["id"] for e in claim_output["evidence"]]:
        return f"Error: evidence '{evidence_ref}' not registered. Call add_evidence first."
    # ...

The model gets instant feedback. If it tries to reference evidence it hasn't registered yet, the tool tells it. The model self-corrects on the next turn. Compare this to validating a 500-line JSON blob after the fact - by then, the model has moved on and can't fix its mistakes in context.

Separating research from output construction

A key benefit: the same agent can have reading tools and writing tools. Reading tools fetch and explore data. Writing tools construct the output. The model interleaves them naturally:

agent = Agent(
    model=model,
    system_prompt=prompt,
    tools=[
        # Reading tools
        read_document,
        search_policy,
        get_weather_report,
        # Writing tools (builder methods)
        add_party,
        add_event,
        add_damage,
        add_evidence,
        set_assessment,
        # Progress tracking
        mark_step_done,
    ],
)

# One call - the agent reads documents AND builds structured output
agent("Process this claim: " + claim_text)

# Output is ready
print(claim_output)

The model reads a police report, extracts a party, reads a medical bill, registers a damage item, cross-references the policy, and so on. Research and output construction are interleaved rather than sequential.

Progress tracking and recovery

Because output accumulates incrementally, you get crash recovery for free:

STEPS = [
    "1. Identify all parties",
    "2. Establish timeline of events",
    "3. Catalog damages with evidence",
    "4. Cross-reference policy coverage",
    "5. Produce assessment",
]
completed_steps: list[int] = []

@tool
def mark_step_done(step_number: int) -> str:
    """Mark a processing step as completed."""
    completed_steps.append(step_number)
    remaining = [s for i, s in enumerate(STEPS, 1) if i not in completed_steps]
    return f"Step {step_number} done. Remaining: {', '.join(remaining)}"

If the agent hits a context window limit or errors out, you already have partial results - every party identified, every event recorded, every damage item cataloged up to that point. You can resume or use what you have.

Context management with state injection

Here's where this pattern really pays off. When your agent ingests a 30-page document and then makes dozens of tool calls to fetch additional sources, the context window fills up fast. In a traditional approach, you'd lose your structured output along with the conversation when you hit the limit. But because the accumulator lives in Python memory - not in the message history - you can aggressively compress the conversation without losing a single data point.

A custom conversation manager (a possibility offered, for instance, by the Strands Agents SDK) replaces old messages with a compact state summary derived from the accumulator:

class ClaimConversationManager(ConversationManager):
    def apply_management(self, agent, **kwargs):
        messages = agent.messages
        if len(messages) <= 2:
            return

        # Keep first message + last 2 messages
        # Replace everything in between with a state summary
        first_msg = messages[0]
        recent = messages[-2:]

        state = self._build_state_summary()
        state_msg = {
            "role": "user",
            "content": [{"text": f"[STATE]\n{state}\n\nContinue."}],
        }
        messages[:] = [first_msg, state_msg] + recent

    def _build_state_summary(self) -> str:
        """Summarize what's been done using the accumulator state."""
        lines = []
        if claim_output["parties"]:
            parties = [f"{p['name']} ({p['role']})" for p in claim_output["parties"]]
            lines.append(f"Parties: {', '.join(parties)}")
        if claim_output["damages"]:
            total = sum(d["amount"] for d in claim_output["damages"])
            lines.append(f"Damages: {len(claim_output['damages'])} items, ${total:.2f} total")
        if claim_output["events"]:
            lines.append(f"Events: {len(claim_output['events'])} recorded")
        return "\n".join(lines)

Because the structured output lives in Python (not in the conversation), context compression doesn't lose any data. The model can always see what it's already produced by reading the state summary.

Benefits

Type safety without type coercion

Each tool has typed parameters enforced by the framework. The model must provide a category that's one of property, medical, liability, lost_income - not because you're parsing JSON and checking after the fact, but because the tool signature demands it. Invalid calls get rejected with clear error messages.

Composability

Tools compose naturally. You can add new output fields by adding new tools without changing existing ones. Want to track evidence attachments? Add an add_evidence tool. Want a final recommendation? Add a set_assessment tool. The model discovers new capabilities through its tool list.

Testability

Each tool is a pure function (or close to it). You can unit test them independently:

def test_add_damage_rejects_invalid_category():
    reset_output()
    result = add_damage(item="Roof repair", amount=5000, category="cosmetic")
    assert "Error" in result
    assert len(claim_output["damages"]) == 0

def test_add_damage_tracks_total():
    reset_output()
    add_damage(item="Roof repair", amount=5000, category="property")
    add_damage(item="Water damage", amount=2000, category="property")
    assert len(claim_output["damages"]) == 2
    assert sum(d["amount"] for d in claim_output["damages"]) == 7000

Deterministic output schema

The output schema is defined by your Python code, not by the model's interpretation of a prompt. claim_output always has the same keys with the same types. Downstream consumers can rely on the structure unconditionally.

Graceful degradation

If the model runs out of context or hits an error, you have everything it produced up to that point. You can even detect empty output and retry with a nudge:

try:
    agent(claim_text)
except Exception:
    pass

if not claim_output["parties"] and not claim_output["events"]:
    agent("You haven't started processing. Begin by identifying the parties involved.")

Natural agent behavior

The model doesn't have to context-switch between "thinking" and "formatting." It thinks by calling tools. The structured output is a byproduct of the agent doing its job, not an additional formatting burden layered on top.

This pattern - tools as Builder, accumulator as output, validation at the boundary - has been the most reliable way I've found to get structured data out of an agentic workflow. It works because it aligns with how tool-calling models already behave: they reason, they act, they observe results, and they act again. You're just making "act" mean "build one piece of the output."

The 10 Commandments of Working in Production

Orel Bello — Fri, 29 May 2026 07:59:05 +0000

Intro

What scares you the most?

Some say spiders, some say clowns, but what scares Engineers (both DevOps and Developers) the most is a P0 incident, where production is down.
Want to make it even scarier? Imagine that you’re the one who’s responsible for it.

When this kind of incident happens, it’s never pleasant, but is it really inevitable?
Production incidents unfortunately happen, and at some companies, they happen more than others.

They say that you can’t be a true Senior Engineer if you don’t have a few Production incidents with your name on them, but that doesn’t mean we want to break production intentionally. We’d like to avoid it as much as possible, and even though we can’t completely eliminate it, with the right methodologies, we can definitely reduce it.

So, let’s learn how to do it, but first, let me introduce myself.

About Me

I’m Orel Bello, an AWS Community Builder and a passionate DevOps Engineer with over 4 years of experience, including the past 3 years at Melio. My tech journey began during my military service as a Deputy Commander in the Technological Control Center for the Israel Police. After earning a B.Sc. in Computer Science, I started as a Storage and Virtualization Engineer before discovering my true calling in DevOps.

Now an AWS Certified Professional in both DevOps and Solutions Architecture, I specialize in building scalable, efficient, and cost-effective cloud solutions.

So you can imagine that I have some experience as a production breaker (and also as a breakdancer, but this is for another blogpost).

WARNING! THESE RULES WERE WRITTEN IN BLOOD!

1. Always have a rollback plan

The first thing that you need to do when you’re touching a Production environment is have a rollback plan.

Let’s say that you need to modify some resource. What if this modification will cause an outage? You need to be prepared. Like they say: Hope for the best, but prepare for the worst. Better be safe than sorry.

Playbooks and documentation can save lives. So even if you’re making a small change, it’s important to prepare a rollback plan ahead of time.

Are you touching the DB? Make sure to take a Snapshot before you do.
Changing a secret, SSM Parameter, or even an IAM Policy? Make sure to save the original value in a safe place.
The examples are endless, but the concept stays the same. Always be sure to have a rollback plan in case things get messy.

2. Timing is everything

Do you have production-related work to do? If you can, always schedule it for the very beginning of the workweek, especially if you’re collaborating across time zones.That’s usually when traffic is lighter, so if your change requires downtime or carries some kind of risk, it’s safer to do it when fewer clients are actively using your system.

We also have the opposite rule that completes the circle — never perform a production change right before the weekend. In just a few hours, the entire company will be offline, and trust me, you don’t want to be the one who forces people back online to fix an issue.

For that same reason, the very end of a working day isn’t the best time for sensitive tasks, either.

3. Work on Dev before Prod — gradually

If you have no idea what a development or QA environment is, stop everything you’re doing right now and go build one.

On a best practice methodology, we always want to avoid testing features in a live Production environment.

It doesn’t matter if you’re working at a small company, you can never have just one environment that shares your development and your production workload. It’s a recipe for multiple outages and downtimes.

It’s best to have a Development/QA environment, a Staging/Pre-Prod environment and a Production environment. And once you have those environments, you can deploy your changes gradually:

First on Development
Then on Staging
Only after that on Production
This way you can handle errors and bugs before they make it to Production.

4. A wolf in sheep’s clothing — not everything is as innocent as it seems

It’s important to have a Production mindset and always think: is what I’m doing somehow affecting production? The answer isn’t always straightforward.

There are the obvious resources that you know you should be aware of, like the Database, DNS records or your compute service that runs your core production logic (EC2, K8s, Lambda functions, you name it). But you shouldn’t let your guard down so easily when you’re working on other resources.

Example (based on a true story):
The security team gives you a list of unused IAM Roles (created by CloudFormation) for more than 180 days, and tells you to handle it. So you may think that you can delete them and no harm will be done. But when you delete them, suddenly dozens of Production CloudFormation stacks can’t be deployed anymore because you deleted a resource created by them, and now they’ve drifted.

So always think twice:

Is my action touching production?
Am I absolutely sure about it?
If I’m not sure what the resource I’m dealing with is, it’s better to be cautious and to tread lightly.

5. Overcome the shame — ask for help

Oops. You did your best but you still broke prod.

Take a deep breath and relax. Don’t panic. It’s unpleasant, but it will pass.

You probably want to fix it ASAP, and the fewer people who know the better. But it’s important to overcome the shame and ask for help. It’s better that your manager hears it from you than from someone else.

Consult with your teammates and fix it together. If you try to handle it yourself without anyone else knowing, there is a chance you can actually make it worse.

Everyone makes mistakes, it’s human nature. I can guarantee you that even your CTO broke production a few times throughout his career. So don’t take it to heart, just focus on fixing it the best way you can.

6. Version control best practices — don’t take shortcuts

Don’t do shortcuts.

Do you have a small and completely safe change? Don’t be lazy. Open a PR and send it to a teammate to review before you deploy it.

NEVER. WORK. ON. MASTER/MAIN.

Developers may be limited by repository rules, and even if they want, they can’t work directly on the Master/Main branch. But DevOps usually have Admin privileges on GitHub, so if they push directly to master, no one can stop them.

Working with PRs is crucial because:

CI/CD workflows may only trigger on merge, not direct pushes
Without a PR, you lose review and can miss mistakes
Rollbacks are harder if you work directly on master

7. Root account — even scarier than a production account

You probably know that when you’re dealing with your production environment, you should pay attention and be careful.

But on the root account, you should be even more careful. The root account, if you’re using an AWS Organization, is the account that manages all the other accounts, including production.

The most common encounter DevOps Engineers have with the root account is managing the SCP (Service Control Policies). If, for example, you apply an SCP to the wrong account or detach the FullAccess Policy, you can affect all the services in all the accounts at once.

So if you’re not paying attention, you can cause an outage to your entire Organization without even noticing.

8. IaC — don’t do anything manually

Remember we talked about how it’s important not to be lazy? Don’t do anything manually on the AWS console.

IaC (Infrastructure as Code) can help you deploy changes with ease, but sometimes it takes more time to write Terraform code for a new resource than to deploy it manually. Don’t get tempted.

Why is it so important?

Easier rollbacks (since the code is in a repo)
More scalable
Consistent across environments
You can preview your changes with a plan before deploying

9. AI — powerful, but dangerous

Today, AI is everywhere, and we can’t run from it even if we tried. And while it can be a productivity boost, unfortunately, it can also cause you an outage if you’re not careful.

Whether it’s malfunctioning code that breaks your application logic, or IaC code that unintentionally deletes core resources, you need to make sure you use AI in a responsible way.

Don’t deploy untested AI-generated code to production
Don’t rely solely on AI without checking documentation
Don’t test AI code on production, that’s what Dev and Staging are for

10. Learn from your mistakes

As much as we don’t like them and want to avoid them, production incidents are a natural part of life.

If you already broke prod, try to learn from the mistake. That’s why we do retro meetings after every incident. And trust me, you won’t forget what you did that caused an outage, and that’s how you will get better.

At the end of the day, production incidents are the best teachers.

Conclusion

The harsh truth is that production incidents are here to stay, and we need to learn to live with them.

But if you follow the best practices, have a “production mindset”, and always ask yourself “Is what I’m about to do affecting production functionality?” and plan your steps accordingly, you can definitely avoid many incidents and improve your entire system uptime.

Got your own rule? Or your production-war-story? Please share it in the comments below!

Why Rate Limiting Alone Won't Stop OTP Abuse — A Real Incident Breakdown

Mubarak Alhazan — Fri, 29 May 2026 06:46:59 +0000

The Attack That Looked Like Normal Traffic

On a perfectly normal Tuesday morning, while the team was getting into their usual flow, we quietly noticed that our SendGrid email delivery was behaving strangely. Emails were bouncing more than usual. Open rates were dropping, and then the one that made everything serious was that legitimate users were reporting that our OTP emails weren't showing up in their inboxes. They were landing in spam.

We dug into our SendGrid dashboard, expecting to find something obvious: a misconfigured domain, a broken DKIM record, maybe a sudden spike in bounces from a bad email list. What we found instead was that thousands of OTP emails had gone out to email addresses we didn't recognise. Addresses nobody on our platform had ever signed up with.

Someone was abusing our OTP endpoints.

The otp endpoints in question are public by design. They exist to let new users sign up via email OTP without needing an existing account. No authentication required. That openness, which is intentional and necessary for the signup flow, is exactly what made them a target.

The attacker had written a script that continuously hit these endpoints with a rotating pool of random email addresses, triggering our system to send OTP emails on their behalf. Our infrastructure was being used as a free email delivery machine, and it was burning our SendGrid reputation in the process.

Most of those emails landed on test carrier gateways and were dropped or bounced. But some hit real inboxes. One recipient received over 15 unsolicited OTP emails from us within two days before finally marking us as spam. Another got two within a single day and did the same. When real people mark you as spam, inbox providers like Gmail and Yahoo take notice, and that's exactly what started pushing our legitimate emails into spam folders.

Why The Obvious Defences Failed

The first instinct when you see an endpoint getting hammered is to reach for rate limiting. We already had it in place (1 burst per minute and 3 requests per hour per IP). Reasonable numbers that would stop any single bad actor cold.

Except this wasn't a single bad actor.

When each of 200+ proxy IPs sends exactly one request and then steps aside, rate limiting becomes completely blind to the attack. From the rate limiter's perspective, it's just seeing 200 different users each making one perfectly normal request. No threshold crossed. No alarm triggered. Just 200 OTP emails going out the door, one after another, all looking entirely legitimate.

That's the thing about residential proxy networks that makes them so effective against traditional defences: they don't look like bots. These aren't datacenter IPs that show up on blocklists. They're real consumer devices on real ISP connections. When you look at those IPs in your logs, they look like your actual users.

We also tried blocking known test-carrier gateways, the kind of infrastructure attackers typically use to absorb bulk emails. That helped reduce some of the noise, but it didn't stop the attack. The attacker's email pool was wide enough that plenty of addresses on real domains were still getting through, and those were the ones reaching actual inboxes and generating spam reports.

Blocking the individual IPs wasn't a real option either. By the time you identify and block one, the proxy network has already rotated to the next. You'd need to block the entire residential IP ranges of major ISPs, which would mean blocking your actual users in the process.

The root problem with all of these approaches is that they're built around a different threat model. They assume abuse looks like volume from a single source. This attack was the opposite: low-volume and distributed across hundreds of sources. No single data point in our logs was alarming. Only the aggregate told the story, and by the time the aggregate was obvious, the damage was already done. The question now was how to stop it, and stop it fast.

We needed something that could answer a simpler question: Is there a real human on the other end of this request? Rate limiting, IP blocking, and gateway filtering can't answer that question. Only one thing can reliably answer that.

Stopping the Bleeding

Before we could build a proper fix, we needed to stop the bleeding first, because every spam report that landed while we were still investigating made our reputation harder to recover. So we made an uncomfortable but necessary call: fully block both OTP signup endpoints immediately, knowing it would lock out legitimate users trying to sign up via OTP alongside the attacker.

The blast radius was manageable. Other signup methods were still available, and anyone hitting the blocked endpoints got a generic message directing them to support. Not ideal, but acceptable for the short window we needed.

The block bought us the breathing room to focus on the real solution that could tell the difference between a real user and an attack script.

The Permanent Fix: AWS WAF CAPTCHA

The core question we needed to answer was "Is there a real human on the other end of this request?" Our rate limiting, IP blocking, and gateway filtering couldn't answer that. A well-established industry solution that can answer is CAPTCHA

A script can rotate IPs, generate random emails, and fire requests all day, but it can't solve a CAPTCHA challenge. That was exactly what we needed.

We implemented the fix at the AWS WAF level, sitting in front of the load balancer. Requests get challenged before they ever reach the backend, which means no OTP logic runs, no email gets triggered, and SendGrid never sees the request at all.

We set up two rules working together:

The first was a rate-based block rule: any single IP exceeding a set threshold within a 10-minute window gets blocked outright. This handles the unsophisticated case where someone is just hammering the endpoint from one place.

The second, and more important one, was an always-CAPTCHA rule: every single request to the OTP endpoints gets challenged with a CAPTCHA, regardless of where it's coming from or how many requests that IP has made. This rule actually kills this specific attack. It doesn't matter that the attacker is rotating through 200+ IPs, sending one request each. None of them can solve a CAPTCHA.

The two rules complement each other cleanly to solve the problem.

One thing worth thinking about when implementing CAPTCHA on a high-traffic endpoint is user experience. You don't want legitimate users solving a CAPTCHA every time they request an OTP. We handled this with a 300-second immunity window: once a user solves the CAPTCHA, they're not challenged again for the next 5 minutes. This immunity time is tracked using WAF Token cookies.

After deploying the changes to production, automated requests stopped reaching the backend. We could also see from the WAF dashboard that the automated requests were failing CAPTCHA challenges.

There was one wrinkle, though — the AWS WAF CAPTCHA challenge itself. It uses the classic image-based approach: select all the traffic lights, pick the bicycles, and identify the buses. It works, but it's not a great experience for users who are just trying to sign up. That friction pushed us to evaluate alternatives, and we ended up switching to Cloudflare Turnstile.

Why We Switched to Cloudflare Turnstile

AWS WAF CAPTCHA solved the security problem, but the user experience it delivered wasn't something we were comfortable shipping long-term. Image-based challenges add real friction to what should be a simple signup flow. First impressions matter, and asking a new user to solve a puzzle before they can even get their OTP is not the experience we wanted.

So we switched to Cloudflare Turnstile.

Turnstile takes a fundamentally different approach. Instead of asking users to prove they're human by identifying objects in blurry images, it runs a set of browser-based signals in the background: things like how the page was loaded, JavaScript execution patterns, and other non-invasive checks. In most cases, the user sees nothing at all. They click the signup button, the check happens invisibly, and the OTP request goes through. Only when Turnstile is genuinely uncertain does it surface a checkbox challenge, and even then, it's far less disruptive than a grid of traffic light images.

The implementation sits in the same place as before, in front of the OTP endpoints, intercepting requests before they reach the OTP logic. The main difference is that instead of relying on the WAF to serve the challenge, the Turnstile widget lives on the frontend and generates a token that gets verified server-side before any OTP logic runs.

If you're evaluating CAPTCHA options for a similar setup, the decision mostly comes down to this: AWS WAF CAPTCHA is convenient if you're already in the AWS ecosystem and want everything managed in one place, but Turnstile is the better choice if user experience is a priority, and for a signup flow, it almost always should be.

What We'd Do Differently From Day One

The honest answer is that this incident was preventable. CAPTCHA solutions were available before the incident, but we didn't think enough about how these endpoints could be abused, only about how they were meant to be used.

That's the mindset shift worth taking away from this.

Any public endpoint that triggers an external action needs abuse protection from day one. OTP emails, password reset emails, SMS codes, notification triggers. If an unauthenticated request can cause your infrastructure to do something on behalf of an attacker, that endpoint is a target. For us, the consequence was a degraded sender reputation. But depending on your setup, the same attack pattern could quietly rack up significant bills. AWS SES, Twilio, and similar pay-per-use services will charge you for every single message an attacker tricks your system into sending. The vulnerability is the same regardless of the provider.
Rate limiting is necessary but not sufficient. It should be the floor, not the ceiling. If rate limiting is your only line of defence on a public endpoint that sends emails, you're one residential proxy network away from this same situation. Layer it with CAPTCHA, and choose a CAPTCHA solution that balances security with user experience
Monitor your email reputation proactively. We only noticed the problem when users started complaining. By then, the damage was already accumulating. Setting up daily delivery reports and reputation alerts from your email provider takes little time and gives you early warning before a bad situation becomes a crisis.
Think about blast radius when you design public flows. Because we had alternative signup methods available, blocking the affected endpoints bought us time without completely breaking the product. If those endpoints had been the only way into the platform, the trade-off would have been much harder. Designing with that kind of redundancy gives you options when something goes wrong.

Finally, this incident prompted us to audit every other public endpoint that triggers an email. That audit should have happened at build time. It's now on the roadmap as a recurring security review, not a one-time fix.

Closing Thoughts

Incidents like this one don't make the highlight. There's no dramatic zero-day, no sophisticated exploit, no novel attack vector. It's a script, a proxy network, and an endpoint that was never designed with abuse in mind. And that's what makes it so easy to overlook.

The technical fix here wasn't complex: a couple of WAF rules and a CAPTCHA integration. What required experience was recognising why the obvious defences were failing, making the uncomfortable call to block legitimate users to stop the bleeding, and thinking through the layered approach that would hold up beyond this specific attack.

Security exploits are only getting more sophisticated and more frequent. As you build, make it a habit to ask not just how the feature is meant to be used, but how it can be abused.

If you're building anything with public-facing endpoints that trigger emails or external services, treat abuse protection as a first-class requirement, not an afterthought. Your sender reputation, your cloud bill, and your users will thank you for it.

Thank You for Reading

You can follow me on LinkedIn and subscribe to my YouTube Channel, where I share more valuable content. Also, let me know your thoughts in the comments section.

Coordinar deploys de frontend y backend sin orquestado, usando Github Actions

Franchesco Romero — Thu, 28 May 2026 16:52:34 +0000

El Setup

Un setup chiquito de SPA + API donde dos workflows de GitHub Actions
salen en paralelo en cada push a main. Probablemente un setup que no usaría en prod, pero algo que si uso para mis proyectos personales.

Ahora bien esto trae un problema de coordinación (el frontend llega a los usuarios antes de que exista el endpoint de la API), cuatro opciones, y el gate de ~80 líneas de bash que fue el ganador para nuestro caso de uso

El codebase: una SPA de React en CloudFront + S3, un backend FastAPI en AWS ECS Fargate, infraestructura en CDK.
Dos workflows (deploy-frontend.yml, deploy-backend.yml) disparados por push a main con path filters.

Acá el código para seguir el post paso a paso:

elchesco / github-actions-combined-deploy-with-no-orchestrator

Coordinating frontend and backend deploys without an orchestrator:

Code companion — frontend/backend deploy coordination post

Each folder maps to one stage of the post's narrative:

00-independent/         starting point — two workflows, no gate, the race exists
01-coupled-detection/   git diff vs the parent commit to know if backend changed
02-poll-by-sha/         GitHub Actions API filtered by head_sha
03-grace-window/        handle "backend run not registered yet" (race fix)
04-final/               full step with meaningful error surfacing

Notes

00-independent/ is two separate files — they exist as .yml so you can diff them against your own workflows.
04-final/ contains the complete step block plus the workflow permissions and checkout config it depends on. Copy all three pieces or the gate will silently fail open.
The actions: read permission is…

View on GitHub

TL;DR

El approach simple aguanta mientras tu push esté dominado por
cambios de un solo dominio. Deja de funcionar cuando la mayoría de los pushes tocan los dos o más.

El problema

Dos workflows en la misma branch main:

# deploy-frontend.yml
on:
  push:
    branches: [main]
    paths: ["frontend/**"]

# deploy-backend.yml
on:
  push:
    branches: [main]
    paths: ["backend/**", "infra/**"]

Un push que toca los dos — digamos, "agrega el endpoint
/api/v1/leaderboard y la UI que lo llama" — dispara ambos workflows
en paralelo. Los builds de frontend normalmente son más rápidos (sin
Docker, sin rollout de ECS). Así que un usuario que refresca entre el
minuto 2 (SPA subida) y el minuto 6 (backend sano en ECS) pega contra
una SPA nuevecita apuntando a un endpoint que todavía no existe. La
consola del browser muestra un 404. Sentry pega un brinco, se disparan alertas de Cloudwatch, el usuario recarga, pega contra el caché, y ve el mismo error.

Difícil de cachar en dev porque el stack local arranca los dos servicios juntos. Fácil de cachar en prod una vez que pasa.

Las opciones

Opción 1: workflows paralelos independientes (el baseline de no hacer nada)

Con lo que arrancamos. Cada workflow escucha sus propios paths y
despliega en su propio tiempo. Cero coordinación.

# ambos workflows
on:
  push:
    branches: [main]
    paths: [...]

Pros:
cero setup, deploys de un solo dominio lo más rápido posible,
aislamiento total.
Contras:
la ventana de race de arriba. El costo solo aparece en los pushes acoplados.

Opción 2: colapsar en un solo workflow con orden explícito

El fix más "obvio": escribir un deploy.yml con deploy-backend como job 1 y deploy-frontend con needs: [deploy-backend].

jobs:
  deploy-backend:
    runs-on: self-hosted
    steps: [...]
  deploy-frontend:
    needs: [deploy-backend]
    runs-on: self-hosted
    steps: [...]

Pros:
modelo mental trivial; GitHub Actions maneja el orden.
Contras:
un push de solo-frontend ahora espera por un backend que no cambió (o
necesita un if: explícito para saltárselo, que es su propia
complejidad). Un step de lint flaky en el backend bloquea el deploy de frontend que ni siquiera dependía de los cambios del backend. Perdimos la independencia de path filters que hacía deseables los workflows paralelos para empezar.

Opción 3: el frontend gatea contra el SHA del backend (el gate simple)

Conservar los dos workflows. Agregar un step al inicio del workflow de frontend: si este commit también tocó backend, espera a que el
workflow de backend en el mismo SHA termine bien.

Pros:
cero overhead en pushes desacoplados (el caso común); se conserva
el paralelismo del caso común; el gate son ~80 líneas de bash + un
curl a la API de GitHub Actions. Sin infra nueva, sin servicio
orquestador, sin artifact pinning.

Contras:
bash. Polling. El gate corre en el runner del frontend, así
que cuesta minutos de runner mientras espera (gratis en self-hosted,
facturable en ubuntu-latest).

Opción 4: pin del frontend a un artifact buildeado del backend

La respuesta correcta de principio: cada build de frontend embebe la
versión de backend contra la que se construyó; la SPA se niega a llamar una API que no haga match.

// frontend
const REQUIRED_API_VERSION = "2026.05.27.a"; // inyectado en el build
if (apiHealthcheck.version !== REQUIRED_API_VERSION) showStaleBanner();

Pros:
cero ventana de race incluso con rollouts totalmente independientes; el cliente puede caer a un banner de "refresca para actualizar"; funciona durante rollbacks.
Contras:
cada cambio de endpoint se vuelve un contrato versionado; necesitas un endpoint de discovery /api/version y lógica en la SPA para manejar el mismatch; coordinar across clientes móviles eventualmente cuesta más que el gate.

Cómo elegir

Patrón de push	Mejor opción
Pushes mayormente de un solo dominio	Opción 3 (gate)
Pushes mayormente acoplados	Opción 2 (un solo workflow)
Clientes de larga vida / móvil / offline / una app de escritorio que no puedes forzar a refrescar	Opción 4 (contrato versionado)

Nuestra distribución: ~80% solo-backend, ~15% solo-frontend, ~5%
acoplado. La Opción 3 fue el match obvio. El resto del post es su
evolución.

Etapa 0: el punto de partida

Dos workflows, sin coordinación:

# .github/workflows/deploy-frontend.yml
name: Deploy Frontend
on:
  push:
    branches: [main]
    paths:
      - "frontend/**"
      - ".github/workflows/deploy-frontend.yml"
jobs:
  build-and-deploy:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v5
      - run: pnpm install --frozen-lockfile && pnpm build
      - run: aws s3 sync dist s3://myapp-frontend
      - run: aws cloudfront create-invalidation --distribution-id $DIST_ID --paths '/*'

Push acoplado → race → 404s en producción.
Fix: gatear el deploy de frontend contra el de backend cuando el commit cambió código de backend.

Etapa 1: detectar el push acoplado

Antes de hacer nada más, el gate tiene que responder: ¿este commit de verdad tocó el backend? Si no, no esperamos, procede de inmediato y no desperdicies un minuto de runner.

- uses: actions/checkout@v5
  with:
    # Necesitamos ≥ 2 commits para diffear contra el padre.
    fetch-depth: 2

- name: Detect coupled push
  run: |
    set -euo pipefail
    changed=$(git diff --name-only HEAD~1 HEAD 2>/dev/null || true)
    if ! echo "$changed" | grep -qE "^(backend/|infra/|\.github/workflows/deploy-backend\.yml$)"; then
      echo "No backend/infra changes — proceeding."
      exit 0
    fi
    echo "Backend changes detected — gate engaged."

El patrón de paths espeja el bloque paths: de deploy-backend.yml
exactito — si un path dispara el workflow de backend, el gate tiene que esperarlo. La diferencia entre los dos es la causa #1 de false
proceeds, así que mantenlos pegaditos en el code review.

El fetch-depth: 2 es la trampa — actions/checkout@v5 viene por
default en shallow 1, y git diff HEAD~1 en un checkout de depth-1
regresa nada en silencio, lo cual el script lee como "no hay cambios de backend — procede". (Pegamos contra esto en el primer deploy después de shipear el gate. Cáchalo con un guard, no nomás con documentación.)

Etapa 2: hacer poll al run del backend por SHA

Ahora sabemos que hay que esperar. El mecanismo es la REST API de
GitHub Actions filtrada por head_sha:

api="https://api.github.com/repos/${GITHUB_REPOSITORY}/actions/workflows/deploy-backend.yml/runs"
query="head_sha=${GITHUB_SHA}&per_page=1"

for i in $(seq 1 160); do  # techo de 40 min a 15s/poll
  payload=$(curl -sS \
    -H "Authorization: Bearer ${GH_TOKEN}" \
    -H "Accept: application/vnd.github+json" \
    "${api}?${query}")
  status=$(echo "$payload" | jq -r '.workflow_runs[0].status // empty')
  conclusion=$(echo "$payload" | jq -r '.workflow_runs[0].conclusion // empty')
  if [ "$status" = "completed" ]; then
    case "$conclusion" in
      success|skipped) exit 0 ;;
      *) echo "::error::Backend ${conclusion}"; exit 1 ;;
    esac
  fi
  sleep 15
done
echo "::error::Timed out waiting for backend deploy."; exit 1

El filtro head_sha es el eje de todo — regresa el run de este commit exacto, no "el último run en main", que haría race con un push de fast-follow.

El permiso actions: read se tiene que agregar al bloque permissions: del workflow — sin eso la API regresa 403 y el gate
falla open en silencio.

Etapa 3: manejar el "todavía no se registra"

El primer run en prod reveló un race: el workflow de frontend puede
arrancar antes de que GitHub haya registrado el run del workflow de
backend en el mismo SHA. La API regresa total_count: 0, el script lee "no hay run de backend en este SHA, procede", y ya volvimos al problema original del 404.

Fix: una grace window. Dale a GitHub hasta 30s para registrar el run.

# Grace window — espera hasta 30s a que aparezca el run de backend.
for i in $(seq 1 6); do
  count=$(curl ... | jq -r '.total_count')
  [ "$count" != "0" ] && break
  sleep 5
done

Los 30s son empíricos — medí unos cuantos pushes acoplados, el delay de registro siempre fue < 10s pero brincó a 18s una vez durante un
incidente de GitHub. 30s es lo suficientemente generoso como para que
no hagamos false-proceed; el timeout exterior de 40 minutos absorbe el costo.

Si después de 30s el run sigue sin existir, el script sí cae a "no
hay run en este SHA — procede". Esa es la decisión correcta: o el
workflow de backend no se disparó (el path filter excluyó los cambios), o GitHub está tan degradado que un deploy de frontend es el menor de nuestros problemas. No le busques tres pies al gato.

Etapa 4: superficies de error con sentido

Dos modos de falla necesitan manejo explícito para que el dev que ve el build en rojo pueda actuar:

# 1. La API de GitHub regresa 4xx (auth, rate limit, etc.).
fetch_runs() {
  local response status body
  response=$(curl -sS -w "\n%{http_code}" ...)
  status=$(echo "$response" | tail -n1)
  body=$(echo "$response" | sed '$d')
  if [ "$status" != "200" ]; then
    echo "::error::GitHub API returned ${status}: ${body}" >&2
    return 1
  fi
  echo "$body"
}

curl -f sale con 22 sin body. El wrapper conserva el body para que el log de error diga "403 Forbidden: actions read permission missing" en lugar de "exit code 22, suerte".

# 2. El deploy de backend falló — saca la conclusion tal cual.
*) echo "::error::Backend deploy ${conclusion}. Abortando el deploy de frontend para que la SPA nunca apunte a una API ausente." ;;

Los casos failure|cancelled|timed_out todos colapsan a la misma
acción (no deployar el frontend), pero imprimir la conclusion exacta te ahorra un click hacia el run del workflow de backend cuando estás
investigando.

El resultado

Un setup de dos workflows que:

Cuesta 0 segundos en pushes desacoplados (el caso del 80%)
Cuesta a lo mucho el wall time del deploy de backend en pushes acoplados
Falla closed — si el deploy de backend falla, el frontend no shipea (sin tormenta de 404s)
Falla closed en el gate mismo — mala respuesta de API, timeout, permiso caído, todos salen con 1

Desde que lo shipeamos (medido sobre seis semanas):

47 pushes acoplados deployados limpio
3 pushes acoplados donde el gate cachó un deploy de backend fallido antes de que el frontend saliera (habría sido un outage visible para el usuario)
0 casos de gate haciendo false-proceed

Lo que NO ayudó

Intentar detectar "cambios de endpoint de API" desde el diff. Un cambio en un campo de schema de Pydantic basta), y los false negatives aquí son peores que los false positives. El check de path por git-diff es suficientemente bueno.
Cancel-in-progress en el workflow de frontend. Se ve atractivo matar el deploy de frontend en vuelo si llega un push nuevo — pero el cancel pasa a media S3-sync, dejando el bundle a medio subir. Combinado con un caché stale de CloudFront esto es peor que el race condition original. Lo dejamos en cancel-in-progress: false.

Qué sí ayudaría a futuro

Mover la lógica del gate a un action reutilizable (actions/wait-for-workflow@v1). Las ~80 líneas de bash funcionan pero están medio copy-pasteadas entre proyectos.
Sacar el conteo de pushes acoplados a un dashboard.** Si el ratio se va del 5% hacia el 30%, el gate simple deja de ser la herramienta correcta.
Hacer el deploy de backend más rápido para que la espera sea más corta cuando sí se active.

Lecciones

Los path filters definen qué cosas tus workflows acuerdan que los disparan. Mantenlos juntos y revísalos juntos.
El filtro head_sha en la API de Actions es lo más útil de todo esto. Existe, es estable, está documentado, y convierte "¿en cuál run estoy esperando?" de un problema difícil a un solo query.