<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community</title>
    <description>The most recent home feed on DEV Community.</description>
    <link>https://dev.to</link>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed"/>
    <language>en</language>
    <item>
      <title>How Solid Queue Became the Rails 8 default, and More on Open Source Maintainership</title>
      <dc:creator>Carla Urrea Stabile</dc:creator>
      <pubDate>Tue, 23 Jun 2026 22:00:00 +0000</pubDate>
      <link>https://dev.to/auth0/how-solid-queue-became-the-rails-8-default-and-more-on-open-source-maintainership-2859</link>
      <guid>https://dev.to/auth0/how-solid-queue-became-the-rails-8-default-and-more-on-open-source-maintainership-2859</guid>
      <description>&lt;p&gt;Seven gems to run background jobs. That's what 37signals was running before they said "this can't be right."&lt;/p&gt;

&lt;p&gt;In Episode 6 of Making Software, I talked to &lt;a href="https://www.linkedin.com/in/rosagutierrezescudero/" rel="noopener noreferrer"&gt;&lt;strong&gt;Rosa Gutiérrez&lt;/strong&gt;&lt;/a&gt;, Principal Programmer at 37signals and board member of the Rails Foundation. She built &lt;a href="https://github.com/rails/solid_queue" rel="noopener noreferrer"&gt;Solid Queue&lt;/a&gt;, the database-backed job queue that ships with Rails 8 by default, and we got into how it was built, what open source maintainership actually looks like, and why Ruby is having a moment again.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Seven gems, one problem.&lt;/strong&gt; 37signals was running seven separate gems just to cover the edge cases &lt;a href="https://github.com/resque/resque" rel="noopener noreferrer"&gt;Resque&lt;/a&gt; didn't handle natively. That became the brief for &lt;a href="https://github.com/rails/solid_queue" rel="noopener noreferrer"&gt;Solid Queue&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Disks got fast. Redis got optional.&lt;/strong&gt; &lt;a href="https://github.com/rails/solid_queue" rel="noopener noreferrer"&gt;Solid Queue&lt;/a&gt; is built on the same insight as &lt;a href="https://github.com/rails/solid_cache" rel="noopener noreferrer"&gt;Solid Cache&lt;/a&gt;: modern database storage is cheap and fast enough that you don't need a separate in-memory service for a lot of production workloads.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What happens when millions of developers start using your gem.&lt;/strong&gt; Going from internal 37signals tooling to bundled-in-Rails means your edge cases multiply overnight. Rosa walks through what that transition felt like, the good contributions, the bad ones, and the more recent arrival of agent-written PRs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Readable code is a design choice.&lt;/strong&gt; Rosa spent a lot of time debugging Resque's internals and adjacent gems that had no naming conventions in common. She decided &lt;a href="https://github.com/rails/solid_queue" rel="noopener noreferrer"&gt;Solid Queue&lt;/a&gt; would never make someone feel that way. She also points out she wrote it pre-AI, when you couldn't just ask Claude to explain it to you.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ruby is well-suited for AI agents, and the community knows it.&lt;/strong&gt; Convention-over-configuration turns out to be great for agents too. Rosa explains why the Ruby ecosystem is seeing a quiet comeback, and what the &lt;a href="https://rubytriathlon.com/" rel="noopener noreferrer"&gt;Ruby Triathlon&lt;/a&gt; says about the community holding it together.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Things that stuck with me
&lt;/h2&gt;

&lt;p&gt;Seven gems to manage background jobs. The team looked at what they were running and said "this can't be right." That became the brief for &lt;a href="https://github.com/rails/solid_queue" rel="noopener noreferrer"&gt;Solid Queue&lt;/a&gt;. Rosa got picked for the project, built it in production at &lt;a href="https://hey.com" rel="noopener noreferrer"&gt;Hey&lt;/a&gt; first, iterated on it for months, and shipped it into Rails 8. She keeps calling it luck. I don't think it's luck.&lt;/p&gt;

&lt;p&gt;She mentioned almost as a side note that it's mostly agents opening issues and PRs now. "Agents are polite, so that's nice." She'd just been describing contributors who didn't read the README, opened sparse issues, were occasionally rude. That contrast was funny, and also a bit telling.&lt;/p&gt;

&lt;p&gt;As Rosa said it, being nice is free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's a gem or library you've used where reading the source code took forever? What made it hard?&lt;/strong&gt; Let me know in the comments!&lt;/p&gt;

&lt;h2&gt;
  
  
  Listen to the full episode
&lt;/h2&gt;

&lt;p&gt;Available on &lt;a href="https://www.youtube.com/playlist?list=PLZ14qQz3cfJKRDmX3yasmbwoC4kipeQfu" rel="noopener noreferrer"&gt;&lt;strong&gt;YouTube&lt;/strong&gt;&lt;/a&gt;, &lt;a href="https://podcasts.apple.com/us/podcast/making-software/id1872107131" rel="noopener noreferrer"&gt;&lt;strong&gt;Apple Podcasts&lt;/strong&gt;&lt;/a&gt;, and &lt;a href="https://open.spotify.com/show/6J856S2fijMvP3rzFkRnBi" rel="noopener noreferrer"&gt;&lt;strong&gt;Spotify&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Thanks for reading! 👋&lt;/p&gt;

</description>
      <category>rails</category>
      <category>opensource</category>
      <category>programming</category>
    </item>
    <item>
      <title>TPU Developer Hub: A Technical Review of a High-Performance AI Platform</title>
      <dc:creator>Fernando Azevedo</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:52:06 +0000</pubDate>
      <link>https://dev.to/fernando_azevedo_6844e930/tpu-developer-hub-a-technical-review-of-a-high-performance-ai-platform-2b6k</link>
      <guid>https://dev.to/fernando_azevedo_6844e930/tpu-developer-hub-a-technical-review-of-a-high-performance-ai-platform-2b6k</guid>
      <description>&lt;p&gt;When Google launched the TPU Developer Hub, the technical signal was clear: the company wants to reduce friction between ML practitioners and specialized acceleration hardware. As an architect who spends a significant portion of time designing inference and training pipelines for financial systems — where every millisecond of latency and every dollar of compute cost must be justified — I read that announcement with productive skepticism. TPUs are not new; what changes is the developer experience layer and the proposition of making this hardware accessible beyond Google's own research labs. In this article, I analyze what the TPU Developer Hub actually delivers, where it differentiates from alternatives like GPU instances on AWS, where it imposes hard trade-offs, and how I would structure an adoption decision in a regulated financial environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers that define the context
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~4.6x&lt;/strong&gt; — TPU v5e throughput gain vs. A100 in LLM training (public JAX/MaxText benchmarks). For dense models above 7B parameters in bfloat16; results vary with network topology and batch size&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$2.20/h&lt;/strong&gt; — Cost per TPU v5e chip on-demand (us-central1, 1 chip). Compared to ~$3.06/h per A100 GPU on equivalent p4d.xlarge on AWS us-east-1; cost parity depends heavily on utilization efficiency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;256 chips&lt;/strong&gt; — Practical minimum scale of a TPU v5p pod for training models &amp;gt;70B parameters. Below this threshold, inter-chip communication overhead reduces hardware efficiency to below 60% MFU (Model FLOP Utilization)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What the TPU Developer Hub is and what it actually changes
&lt;/h2&gt;

&lt;p&gt;The TPU Developer Hub is not a new hardware generation — it is a reorganization of the development experience around existing TPUs. The hub centralizes documentation, interactive notebooks, PyTorch/JAX migration guides, fine-tuning examples with models like Gemma and PaLM 2, and access to pre-configured development environments. The stated goal is to reduce the time from "I have a model" to "I am training efficiently on TPU" from weeks to hours.&lt;/p&gt;

&lt;p&gt;From an architectural standpoint, what interests me most is the abstraction layer the hub proposes. Historically, working with TPUs required deep mastery of XLA (Accelerated Linear Algebra), the compiler that transforms high-level operations into hardware-optimized instructions. This created a significant entry barrier — teams accustomed to CUDA and PyTorch needed to relearn static compilation paradigms, static tensor shapes, and explicit sharding strategies.&lt;/p&gt;

&lt;p&gt;The hub attempts to address this with three layers: (1) &lt;strong&gt;MaxText&lt;/strong&gt; and &lt;strong&gt;MaxDiffusion&lt;/strong&gt; as high-performance reference implementations already optimized for TPU; (2) &lt;strong&gt;Pathways&lt;/strong&gt; as a distributed runtime that abstracts the physical pod topology; and (3) native integration with &lt;strong&gt;Vertex AI&lt;/strong&gt; for job orchestration. For teams already operating in the Google Cloud ecosystem, this vertical integration is genuinely valuable. For teams with hybrid or multi-cloud workloads — which is the reality of most financial environments I know — the story is more complicated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where TPUs shine: the use case that justifies the complexity
&lt;/h2&gt;

&lt;p&gt;There is a workload profile where TPUs deliver clear and measurable competitive advantage: &lt;strong&gt;training large-scale dense models with large batches and static tensor shapes&lt;/strong&gt;. Language models above 7B parameters, diffusion models for financial image generation (documents, reports), and embedding models trained on proprietary financial data corpora — all of these fit well within the TPU efficiency profile.&lt;/p&gt;

&lt;p&gt;The technical reason is the systolic array architecture of TPUs: they are optimized for matrix multiplication operations in bfloat16, which is exactly what dominates the forward and backward pass of transformers. The XLA compiler, when fed with static shapes, can plan the entire execution of a training step as a single compiled program, eliminating dispatch overhead and maximizing hardware utilization. In public benchmarks from the MaxText project, TPU v5e achieves &lt;strong&gt;Model FLOP Utilization (MFU) above 55-60%&lt;/strong&gt; on models like LLaMA-2 70B — a number that A100 GPUs rarely exceed 45-50% in comparable configurations.&lt;/p&gt;

&lt;p&gt;For a bank or fintech that is continuously pre-training or fine-tuning fraud detection, credit scoring, or financial news sentiment analysis models, this efficiency gain translates directly into lower training costs and faster experimentation cycles. A fine-tuning cycle that takes 18 hours on 8x A100s can drop to 6-8 hours on an equivalent TPU v5e slice — and the cost-per-hour difference favors TPUs when utilization is high and consistent.&lt;/p&gt;

&lt;h2&gt;
  
  
  TPU Developer Hub strengths
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vertical integration with Vertex AI&lt;/strong&gt;: training jobs, ML pipelines, and model registry in a single control surface, reducing operational overhead for Google Cloud-native teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MaxText as a high-performance reference&lt;/strong&gt;: JAX transformer reference implementation already optimized for TPU, with documented and reproducible MFU — eliminates weeks of manual tuning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pathways runtime&lt;/strong&gt;: pod topology abstraction that allows scaling from 1 chip to thousands without rewriting sharding code — critical for iterative experimentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Competitive cost per FLOP at high utilization&lt;/strong&gt;: when the workload is appropriate (static shapes, large batches, continuous training), the cost per effective TFLOP is 20-35% lower than equivalent GPUs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Curated notebooks and migration guides&lt;/strong&gt;: real reduction of the entry barrier for PyTorch-first teams that need to migrate to JAX/XLA&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Training and Inference Pipeline with TPU Developer Hub in a Hybrid Financial Environment
&lt;/h2&gt;

&lt;p&gt;Typical flow for a financial ML team using TPUs for training and AWS for inference and data governance — a multi-cloud pattern that maximizes cost efficiency without compromising compliance&lt;/p&gt;

&lt;h3&gt;
  
  
  📦 Data Layer — AWS S3 + Glue
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;S3 Raw financial data (storage)&lt;/li&gt;
&lt;li&gt;AWS Glue ETL + schema validation (compute)&lt;/li&gt;
&lt;li&gt;S3 Curated bfloat16 tensors (storage)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔵 Google Cloud — TPU Training
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;GCS Bucket mirrored training data (storage)&lt;/li&gt;
&lt;li&gt;Vertex AI Training Job (ai)&lt;/li&gt;
&lt;li&gt;TPU v5e Pod MaxText / JAX (compute)&lt;/li&gt;
&lt;li&gt;Vertex Model Registry (ai)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🟧 AWS — Inference + Governance
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;AWS Bedrock custom model import (ai)&lt;/li&gt;
&lt;li&gt;Lambda inference wrapper (compute)&lt;/li&gt;
&lt;li&gt;API Gateway WAF + throttling (security)&lt;/li&gt;
&lt;li&gt;CloudWatch SLO dashboards (compute)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔐 Security &amp;amp; Compliance
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;AWS KMS CMK encryption (security)&lt;/li&gt;
&lt;li&gt;IAM Permission Boundary (security)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Flows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;s3-raw -&amp;gt; glue-etl: ingestion&lt;/li&gt;
&lt;li&gt;glue-etl -&amp;gt; s3-curated: validated data&lt;/li&gt;
&lt;li&gt;s3-curated -&amp;gt; gcs-mirror: cross-cloud replication&lt;/li&gt;
&lt;li&gt;gcs-mirror -&amp;gt; vertex-job: loads tensors&lt;/li&gt;
&lt;li&gt;vertex-job -&amp;gt; tpu-v5e: dispatches job&lt;/li&gt;
&lt;li&gt;tpu-v5e -&amp;gt; model-registry: saves checkpoint&lt;/li&gt;
&lt;li&gt;model-registry -&amp;gt; bedrock: exports GGUF/ONNX model&lt;/li&gt;
&lt;li&gt;bedrock -&amp;gt; lambda-infer: invokes inference&lt;/li&gt;
&lt;li&gt;lambda-infer -&amp;gt; apigw: response&lt;/li&gt;
&lt;li&gt;apigw -&amp;gt; cloudwatch: SLO metrics&lt;/li&gt;
&lt;li&gt;kms -&amp;gt; s3-curated: CMK at-rest&lt;/li&gt;
&lt;li&gt;iam-boundary -&amp;gt; vertex-job: access control&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where it hurts: the real frictions the hub does not resolve
&lt;/h2&gt;

&lt;p&gt;The TPU Developer Hub improves the development experience, but it does not solve the structural problems that make TPUs difficult in regulated financial environments. I will be direct about each of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ecosystem lock-in&lt;/strong&gt;: JAX is the first-class citizen in the TPU world. PyTorch/XLA exists, but it is a second-class citizen — dynamic operations, conditional control flow, and variable shapes frequently force XLA recompilations that destroy the performance gain. In financial environments where ML models are frequently developed by data science teams with a PyTorch background, migrating to JAX is not trivial. I am not talking about syntax — I am talking about rethinking how you write training loops, how you do debugging (no eager mode by default), and how you integrate with third-party libraries that lack JAX support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limited observability outside the Google ecosystem&lt;/strong&gt;: The hub integrates well with Cloud Monitoring and Cloud Trace, but if your observability stack is OpenTelemetry + Datadog (as most financial environments I operate in), you will need to instrument manually. There is no native OTLP exporter for TPU chip utilization metrics — you depend on Cloud Monitoring and then export via Pub/Sub to your observability backend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compliance and data residency&lt;/strong&gt;: For banks operating under LGPD, BACEN, or equivalent regulations, the question of where training data resides is critical. Replicating curated financial data to GCS in us-central1 to feed a TPU job requires legal analysis and additional technical controls — DLP, tokenization, data processing agreements. The hub does not address this.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Critical pitfalls before committing to TPUs in production:&lt;/strong&gt; &lt;strong&gt;Dynamic shapes are enemy number one&lt;/strong&gt;: any operation that produces tensors with shapes that vary between training steps forces an XLA recompilation. In models with variable-length attention (financial documents of different sizes), this can increase step time by 10-50x. Always use static padding or sequence bucketing before migrating to TPU. &lt;strong&gt;TPU pod preemption&lt;/strong&gt;: unlike EC2 instances with Savings Plans, TPU pods do not have availability guarantees at all sizes — especially v5p above 512 chips. Plan checkpointing every 15-30 minutes with GCS as the checkpoint backend, and use Orbax for state management. &lt;strong&gt;Cross-cloud egress cost&lt;/strong&gt;: replicating data from S3 to GCS has AWS egress cost (~$0.09/GB) plus GCS ingress cost. For training datasets above 10TB, this can add $900+ to the experiment cost before running a single step.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TPU v5e vs. AWS p4d.24xlarge (8x A100) — Trade-offs for Financial Workloads
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;TPU v5e (8 chips)&lt;/th&gt;
&lt;th&gt;AWS p4d.24xlarge (8x A100)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;On-demand cost/hour&lt;/td&gt;
&lt;td&gt;~$17.60 (8 chips × $2.20)&lt;/td&gt;
&lt;td&gt;~$32.77&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MFU on LLM 7B (bfloat16)&lt;/td&gt;
&lt;td&gt;55-62% (static shapes)&lt;/td&gt;
&lt;td&gt;42-50% (PyTorch FSDP)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Native PyTorch support&lt;/td&gt;
&lt;td&gt;Partial (PyTorch/XLA, limitations on dynamic ops)&lt;/td&gt;
&lt;td&gt;Full (native CUDA, no restrictions)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integration with AWS ecosystem&lt;/td&gt;
&lt;td&gt;Requires cross-cloud (S3→GCS, federated IAM)&lt;/td&gt;
&lt;td&gt;Native (SageMaker, S3, CloudWatch, KMS)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compliance/data residency&lt;/td&gt;
&lt;td&gt;Requires additional legal analysis for financial data&lt;/td&gt;
&lt;td&gt;Controllable via AWS regions + KMS CMK + SCPs&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low-latency inference (&amp;lt;50ms)&lt;/td&gt;
&lt;td&gt;Not recommended (TPUs optimized for batch)&lt;/td&gt;
&lt;td&gt;Adequate with TensorRT + Triton&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The multi-cloud pattern I would use: TPU for training, AWS for serving
&lt;/h2&gt;

&lt;p&gt;After analyzing the TPU Developer Hub in the context of real financial workloads, the architectural pattern I would recommend is neither "migrate everything to TPUs" nor "ignore TPUs and stay on GPU". It is a pattern of separation of responsibilities by model lifecycle phase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training and fine-tuning on TPU v5e&lt;/strong&gt;: For models above 7B parameters with static and well-structured training data (historical transactions, financial time series, regulatory document corpora), the cost-efficiency profile of TPUs is superior. The key is preparing data on the AWS side — schema validation with Glue, tokenization, static padding, serialization in TFRecord or ArrayRecord — before replicating to GCS. This keeps sensitive financial data in the AWS environment for as long as possible and reduces the volume transferred.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inference on AWS Bedrock or SageMaker&lt;/strong&gt;: After training, the model is exported in ONNX format or via conversion to GGUF and imported into AWS Bedrock (custom model import) or deployed on SageMaker with &lt;code&gt;ml.g5.xlarge&lt;/code&gt; instances for low-latency inference. This keeps the serving layer within the AWS compliance perimeter, with KMS CMK for encryption of models at rest, VPC endpoints for network isolation, and CloudWatch for p99 latency SLOs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orchestration with Step Functions&lt;/strong&gt;: The complete pipeline — data preparation, cross-cloud replication, Vertex AI job trigger, convergence monitoring, model export, Bedrock deployment — can be orchestrated with AWS Step Functions using external activities for the Google Cloud steps. This keeps the control plane in AWS, where you have audit visibility via CloudTrail and can integrate with your existing change management processes.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to responsibly adopt the TPU Developer Hub in financial environments
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;1. Validate the workload profile before any migration&lt;/strong&gt; — Run a tensor shape profiling on your current training pipeline. If more than 20% of steps produce variable shapes or if you use data-dependent conditional control flow, the XLA recompilation cost will negate the hardware gain. Use &lt;code&gt;jax.jit&lt;/code&gt; with &lt;code&gt;static_argnums&lt;/code&gt; and &lt;code&gt;donate_argnums&lt;/code&gt; on a small data subset before committing to full migration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;2. Establish the data perimeter before replicating to GCS&lt;/strong&gt; — Classify training data with AWS Macie, apply tokenization or pseudonymization of PII/financial fields with AWS Glue + KMS, and document the Data Processing Agreement with Google Cloud before any transfer. Configure VPC Service Controls on Google Cloud to restrict access to the training GCS bucket exclusively to the Vertex AI job service account.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;3. Configure checkpointing with Orbax + GCS from day 1&lt;/strong&gt; — Use &lt;code&gt;orbax.checkpoint.CheckpointManager&lt;/code&gt; with &lt;code&gt;save_interval_steps=500&lt;/code&gt; and &lt;code&gt;max_to_keep=3&lt;/code&gt;. Configure the GCS bucket with versioning and Pub/Sub notifications for checkpoint events — this feeds an AWS Lambda that updates the job status in Step Functions and enables automatic resumption in case of pod preemption.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;4. Instrument MFU and chip utilization in your observability backend&lt;/strong&gt; — Cloud Monitoring exposes TPU metrics via &lt;code&gt;compute.googleapis.com/tpu/container/accelerator/duty_cycle&lt;/code&gt;. Configure a Pub/Sub sink to export these metrics in real time to an AWS Lambda that publishes them to CloudWatch as custom metrics. Define an alarm if duty cycle drops below 70% for more than 5 minutes — this indicates a data pipeline problem or excessive XLA recompilation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;5. Export the model in a neutral format and validate before deploying on AWS&lt;/strong&gt; — Use &lt;code&gt;jax.export&lt;/code&gt; + ONNX conversion via &lt;code&gt;jax2tf&lt;/code&gt; + &lt;code&gt;tf2onnx&lt;/code&gt; to produce a portable model artifact. Numerically validate the equivalence between the model output on TPU and GPU using a reference input set with 1e-3 tolerance in bfloat16. Store the ONNX artifact in S3 with versioning enabled and KMS CMK, and use AWS Signer to sign the artifact before deploying on Bedrock.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What the TPU Developer Hub reveals about the industry's direction
&lt;/h2&gt;

&lt;p&gt;The launch of the TPU Developer Hub is symptomatic of a broader shift: the commoditization of ML acceleration hardware is forcing providers to compete at the developer experience layer, not just in raw FLOPS. AWS did the same with Trainium2 and Neuron SDK; Meta with PyTorch 2.0 and &lt;code&gt;torch.compile&lt;/code&gt;; NVIDIA with TensorRT-LLM and Triton Inference Server. The battle is no longer for the fastest chip — it is for which ecosystem captures the developer workflow.&lt;/p&gt;

&lt;p&gt;For solutions architects in financial environments, this has a direct implication: &lt;strong&gt;the choice of ML hardware is increasingly an ecosystem and governance decision, not a performance one&lt;/strong&gt;. If your organization already has compliance controls, data pipelines, and audit processes built around AWS, the cost of moving training workloads to Google Cloud TPU is not just the compute cost — it is the cost of replicating or federating the entire governance layer.&lt;/p&gt;

&lt;p&gt;This does not mean TPUs are the wrong choice. It means the decision needs to be made with eyes open to the total costs: data egress, cross-cloud compliance overhead, engineer requalification for JAX, and the risk of lock-in to a runtime (XLA/Pathways) that has no equivalent on other providers. The TPU Developer Hub reduces the technical friction of adoption, but it does not eliminate the structural costs. For teams with training workloads above 70B parameters and data that can be prepared and anonymized before transfer, the business case is solid. For everyone else, AWS SageMaker with Trn2 or p4d instances remains the path of least operational resistance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-patterns to avoid when adopting TPUs in financial environments
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Migrating PyTorch models without shape profiling&lt;/strong&gt;: assuming &lt;code&gt;torch_xla&lt;/code&gt; will work transparently is the most common mistake — dynamic ops cause cascading recompilations that make training slower than on CPU&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using TPUs for low-latency inference&lt;/strong&gt;: TPUs are optimized for batch throughput, not single-request latency — p99 inference latency on TPU can be 3-5x higher than on GPU with TensorRT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transferring non-anonymized financial data to GCS&lt;/strong&gt;: LGPD/BACEN violation with severe regulatory risk — always apply pseudonymization and tokenization before any cross-cloud transfer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring egress cost in TCO&lt;/strong&gt;: calculating only TPU vs. GPU compute cost without including AWS egress + GCS ingress can underestimate the total experiment cost by 15-25%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not planning checkpointing before starting long jobs&lt;/strong&gt;: TPU pods can be preempted without warning — 24h jobs without checkpointing every 30min result in total progress loss&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;My curation note:&lt;/strong&gt; In practice, I would use TPUs exclusively for the training phase of large models where the data profile is static and well-controlled — and keep the entire serving, governance, and observability layer on AWS. The hardest lesson I learned in financial environments is that ML infrastructure decisions are rarely technical: they are governance decisions disguised as performance decisions. The TPU Developer Hub is genuinely good at reducing technical friction, but the friction it does not resolve — cross-cloud compliance, JAX ecosystem lock-in, fragmented observability — is exactly the friction that matters in regulated production. If you are evaluating TPUs, start with a fine-tuning experiment on synthetic or already-anonymized data, measure real MFU (not theoretical), and calculate the full TCO including egress before any long-term commitment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Verdict: Powerful in the right niche, costly outside it
&lt;/h2&gt;

&lt;p&gt;The TPU Developer Hub is a real advancement in the developer experience with specialized ML hardware. It genuinely reduces the entry barrier for JAX/XLA, offers high-quality reference implementations with MaxText, and the integration with Vertex AI creates a cohesive training pipeline for Google Cloud-native teams. For organizations already operating in the Google Cloud ecosystem that need to train models above 7B parameters with static data, the value proposition is clear and the ROI is measurable.&lt;/p&gt;

&lt;p&gt;However, for most regulated financial environments I know — where AWS is the primary provider, compliance is non-negotiable, and ML teams have a PyTorch background — the TPU Developer Hub solves problems you do not have while creating problems you do not want. The JAX ecosystem lock-in, the cross-cloud compliance friction, and the inadequacy for low-latency inference are structural limitations that no DX improvement will resolve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My recommendation&lt;/strong&gt;: use TPUs for training large models when you have data that can be prepared and anonymized on the AWS side before transfer, and when the training cycle is long enough to amortize the setup overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rating:&lt;/strong&gt; 7.5/10&lt;/p&gt;

&lt;h2&gt;
  
  
  References and further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://developers.googleblog.com/" rel="noopener noreferrer"&gt;Google TPU Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/google/maxtext" rel="noopener noreferrer"&gt;MaxText: High Performance LLM Training on TPUs (GitHub)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://jax.readthedocs.io/en/latest/notebooks/thinking_in_jax.html" rel="noopener noreferrer"&gt;JAX Documentation — Static Shapes and XLA Compilation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/" rel="noopener noreferrer"&gt;AWS Trainium2 and Neuron SDK Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/train-gpu.html" rel="noopener noreferrer"&gt;AWS SageMaker Training — p4d and p4de Instances&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-import-model.html" rel="noopener noreferrer"&gt;AWS Bedrock Custom Model Import&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://orbax.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;Orbax: Checkpointing for JAX (Google)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/vpc-service-controls/docs/overview" rel="noopener noreferrer"&gt;VPC Service Controls — Google Cloud&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fernando.moretes.com/blog/tpu-developer-hub-e-stack-de-ia-com-performance" rel="noopener noreferrer"&gt;fernando.moretes.com&lt;/a&gt;. By Fernando F. Azevedo — Senior Solutions Architect.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>tpu</category>
      <category>mlplatform</category>
      <category>aiinfrastructure</category>
    </item>
    <item>
      <title>Pixels to Planning: Geospatial Data Platforms on AWS</title>
      <dc:creator>Fernando Azevedo</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:51:23 +0000</pubDate>
      <link>https://dev.to/fernando_azevedo_6844e930/pixels-to-planning-geospatial-data-platforms-on-aws-1kmb</link>
      <guid>https://dev.to/fernando_azevedo_6844e930/pixels-to-planning-geospatial-data-platforms-on-aws-1kmb</guid>
      <description>&lt;p&gt;When Google Research publishes on 'pixels to planning' — turning satellite imagery into sustainable planning decisions — the real technical signal is not in the computer vision model. It is in the data platform that must exist before any pixel is processed: petabyte-scale raster ingestion, geospatial partitioning that does not break under analytical load, lineage traceability that satisfies carbon credit auditors, and inference with predictable latency for operational decisions. I have designed data pipelines for financial-grade environments where a single wrong location attribute cost millions in incorrect hedging. The discipline I learned in those environments applies directly here — and that is what I will document.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Problem: Geospatial Data Is Not Just Large Files
&lt;/h2&gt;

&lt;p&gt;Most multispectral satellite image files arrive as Cloud-Optimized GeoTIFF (COG) or HDF5, ranging from 500 MB to 5 GB per scene depending on resolution and band count. Sentinel-2 alone produces roughly 1.6 TB per day globally. When you start ingesting multiple constellations — Sentinel, Landsat, Planet, SAR radar data — the volume grows to tens of terabytes daily before you even compute derived indices like NDVI, NDWI, or land surface temperature.&lt;/p&gt;

&lt;p&gt;The mistake I see most often is treating this data like log files: dump everything into S3 with date-based prefixes and expect Athena to handle it. It does not. The problem is that geospatial queries have two conflicting partitioning axes: &lt;strong&gt;time&lt;/strong&gt; (when the image was captured) and &lt;strong&gt;space&lt;/strong&gt; (which tile/bounding box it covers). A typical analytical query — 'show me forest cover change in the Amazon between 2022 and 2024' — needs to cross hundreds of tiles over two years. With naive date-only partitioning, Athena scans time partitions without spatial filtering, generating S3 scan costs that can reach tens of dollars per query.&lt;/p&gt;

&lt;p&gt;The correct solution is to use a table format with geospatial predicate pushdown — Apache Iceberg with GeoParquet extension, stored on S3, with the catalog managed via AWS Glue Data Catalog. GeoParquet encodes geometries as WKB binary columns with bounding box statistics per row group, allowing Athena (via Trino) to skip row groups that do not intersect the query polygon. In internal benchmarks I have run, this reduced scanned data volume by 60–80% for typical spatial window queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Financial-Grade Geospatial Pipeline on AWS
&lt;/h2&gt;

&lt;p&gt;Full flow from satellite image ingestion to sustainable planning decisions, with lineage traceability and governance.&lt;/p&gt;

&lt;h3&gt;
  
  
  📥 Ingestão &amp;amp; Landing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;S3 Raw Zone COG / HDF5 (storage)&lt;/li&gt;
&lt;li&gt;SQS FIFO S3 Event Notifications (messaging)&lt;/li&gt;
&lt;li&gt;Lambda Validação &amp;amp; Tag (compute)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ⚙️ Processamento &amp;amp; Curadoria
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Glue Spark Job COG → GeoParquet/Iceberg (data)&lt;/li&gt;
&lt;li&gt;S3 Curated Zone Iceberg + GeoParquet (storage)&lt;/li&gt;
&lt;li&gt;Glue Data Catalog Iceberg Metastore (data)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🤖 ML &amp;amp; Inferência
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;SageMaker Training Segmentação / NDVI (ai)&lt;/li&gt;
&lt;li&gt;SageMaker Endpoint Inferência em Tempo Real (ai)&lt;/li&gt;
&lt;li&gt;SageMaker Batch Transform em Escala (ai)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  📊 Consumo &amp;amp; Governança
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Athena Query Espacial (data)&lt;/li&gt;
&lt;li&gt;Lake Formation RBAC + Column-level (security)&lt;/li&gt;
&lt;li&gt;SageMaker Lineage + OpenLineage (data)&lt;/li&gt;
&lt;li&gt;API Gateway Planning API (edge)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Flows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;sat -&amp;gt; s3raw: COG Upload&lt;/li&gt;
&lt;li&gt;s3raw -&amp;gt; sqs: S3 Event&lt;/li&gt;
&lt;li&gt;sqs -&amp;gt; lambda_ingest: Trigger&lt;/li&gt;
&lt;li&gt;lambda_ingest -&amp;gt; glue: Start Job&lt;/li&gt;
&lt;li&gt;glue -&amp;gt; s3curated: Write Iceberg&lt;/li&gt;
&lt;li&gt;glue -&amp;gt; glue_catalog: Register Schema&lt;/li&gt;
&lt;li&gt;s3curated -&amp;gt; sm_train: Dataset&lt;/li&gt;
&lt;li&gt;sm_train -&amp;gt; sm_ep: Deploy Model&lt;/li&gt;
&lt;li&gt;s3curated -&amp;gt; batch: Batch Scoring&lt;/li&gt;
&lt;li&gt;s3curated -&amp;gt; athena: Spatial Query&lt;/li&gt;
&lt;li&gt;athena -&amp;gt; lf: Access Control&lt;/li&gt;
&lt;li&gt;sm_ep -&amp;gt; apigw: Prediction&lt;/li&gt;
&lt;li&gt;batch -&amp;gt; lineage: Lineage&lt;/li&gt;
&lt;li&gt;glue -&amp;gt; lineage: Lineage&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Geospatial Partitioning: The Decision That Defines Cost and Latency
&lt;/h2&gt;

&lt;p&gt;After resolving the storage format, the second critical decision is the partitioning strategy in S3 and Iceberg. For geospatial data, I use a three-level hierarchy: &lt;strong&gt;year/month&lt;/strong&gt; as the first-level partition (for temporal pruning), &lt;strong&gt;H3 grid resolution 4&lt;/strong&gt; as the second-level partition (covering approximately 1,770 km² per cell, generating ~5,000 cells for global coverage), and &lt;strong&gt;sensor&lt;/strong&gt; as the third level.&lt;/p&gt;

&lt;p&gt;H3 (Uber's hierarchical hexagonal grid library) has a critical property for geospatial data: cells at the same resolution level have approximately equal area, meaning partition file sizes are predictable — something rectangular bounding box partitioning does not guarantee. By indexing each scene's geometries with &lt;code&gt;h3.polyfill()&lt;/code&gt; in the Glue Job, you can join tables from different sensors using the H3 index as a key, without expensive geometric intersection operations at query time.&lt;/p&gt;

&lt;p&gt;An important operational detail: the Glue Job converting COG to GeoParquet should be configured with &lt;code&gt;--conf spark.sql.shuffle.partitions=200&lt;/code&gt; and G.2X workers (8 vCPU, 32 GB RAM) to process multispectral bands without disk spill. For Sentinel-2 10m resolution scenes (13 bands, ~800 MB per scene), processing a full day of global ingestion takes approximately 45 minutes with 20 G.2X workers — a cost of roughly USD 12 per execution. That number matters when you are planning a continuous operations budget.&lt;/p&gt;

&lt;p&gt;The Iceberg table should be configured with &lt;code&gt;write.target-file-size-bytes=268435456&lt;/code&gt; (256 MB) and compaction scheduled via Glue Workflow every 6 hours to avoid the small files problem that degrades Athena performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Playbook: Building the Geospatial Pipeline on AWS
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;1. Define the Landing Zone with Lifecycle Policy&lt;/strong&gt; — Create a dedicated S3 bucket for raw data with S3 Intelligent-Tiering enabled from day one. Configure Object Lock in COMPLIANCE mode for reference data (baseline images) that require immutability for carbon credit auditing. Enable S3 Event Notifications to SQS FIFO with deduplication ID based on the object ETag — this prevents duplicate reprocessing when the same file is re-uploaded by a data provider.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;2. Build the COG → GeoParquet/Iceberg Glue Transformation Job&lt;/strong&gt; — Use Glue 4.0 with native Iceberg support. Add &lt;code&gt;geoparquet&lt;/code&gt;, &lt;code&gt;h3-py&lt;/code&gt;, and &lt;code&gt;rasterio&lt;/code&gt; dependencies via Glue Python Shell or as an additional JAR. The job must: (a) read the COG with rasterio using windowed reads to avoid OOM, (b) compute the H3 index for each tile, (c) write to partitioned GeoParquet, (d) execute MERGE INTO on the Iceberg table to support idempotent re-ingestion using the composite key (scene_id, acquisition_date, sensor).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;3. Configure Fine-Grained Access Control with Lake Formation&lt;/strong&gt; — Register the S3 location in Lake Formation and use Cell-Level Security to restrict access by geometry — for example, climate credit analysts for Brazil can only query H3 cells within the national territory polygon. This is implemented via Row Filter Expressions in Lake Formation with the condition &lt;code&gt;h3_index IN (SELECT h3_index FROM geo_reference WHERE country = 'BRA')&lt;/code&gt;. Combine with Column-Level Security to protect sensitive land ownership metadata.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;4. Train and Register Models with Full Traceability&lt;/strong&gt; — Use SageMaker Experiments to track each training run with the Iceberg dataset version metadata (snapshot ID). Register the model in SageMaker Model Registry with manual approval for models that feed financial decisions (natural asset valuation, climate credit scoring). Configure SageMaker Lineage Tracking and export to OpenLineage/Marquez for integration with external data governance tools. The Iceberg snapshot ID as a training parameter is what allows exact reproduction of the dataset used in any audited model.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;5. Expose Inference with Controlled Latency via API Gateway&lt;/strong&gt; — For real-time planning queries (e.g., 'what is the deforestation risk in this polygon over the next 12 months?'), use SageMaker Real-Time Endpoint with ml.g4dn.xlarge instances (1 T4 GPU) and configure Autoscaling with a 70% GPU utilization target. Place REST API Gateway in front with Usage Plans and API Keys per tenant. Configure the API Gateway timeout to 29 seconds (maximum limit) and implement a circuit breaker in the integration Lambda to fall back to cached results in DynamoDB when the endpoint is under pressure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;6. Instrument Observability with OpenTelemetry&lt;/strong&gt; — Instrument the full pipeline with OpenTelemetry: trace spans from SQS to the inference endpoint, ingestion throughput metrics (scenes/hour), Athena query latency by spatial query type, and model drift (prediction distribution vs. baseline). Send to CloudWatch with EMF (Embedded Metric Format) for custom metrics and configure alarms on P99 inference latency &amp;gt; 8 seconds and Glue Job error rate &amp;gt; 1%. These two SLOs are the minimum to operate with confidence.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;GeoParquet + Iceberg: The Game-Changing Combo:&lt;/strong&gt; If you can only make one architectural change today: migrate your geospatial data from raw GeoTIFF/Shapefile in S3 to GeoParquet on Iceberg tables with a Glue catalog. The migration cost is 1–2 engineering sprints. The return is a 60–80% reduction in Athena scan costs and elimination of geospatial joins at query time. This is not premature optimization — it is a prerequisite for any geospatial analysis at scale.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Data Governance for Financial Decisions: Lineage Is Not Optional
&lt;/h2&gt;

&lt;p&gt;When Earth AI platform outputs feed financial decisions — carbon credit valuation, climate risk scoring for credit portfolios, parametric insurance pricing based on vegetation indices — lineage traceability stops being a best practice and becomes a regulatory requirement. In Brazil, BACEN already requires financial institutions to demonstrate the full methodology behind climate risk models (CMN Resolution 4.945/2021). In Europe, the EU AI Act classifies environmental risk scoring systems as 'high-risk', requiring training data documentation.&lt;/p&gt;

&lt;p&gt;In practice, this means each model prediction must be traceable to: (1) the exact Iceberg dataset snapshot used in training, (2) the version of the Glue Job code that processed the raw data, (3) the model version in SageMaker Model Registry, and (4) the feature engineering repository commit. This is what I call &lt;strong&gt;quadruple traceability&lt;/strong&gt; — and most implementations I see cover only (3).&lt;/p&gt;

&lt;p&gt;The technical implementation uses SageMaker Lineage Tracking to capture the model → dataset → processing chain, with the Iceberg snapshot ID as the immutable anchor. For integration with external governance tools (Collibra, Alation, DataHub), export lineage events via EventBridge to a Lambda that converts them to OpenLineage format and sends them to the Marquez API. This pattern allows external auditors to query the full lineage without needing direct access to the AWS account.&lt;/p&gt;

&lt;p&gt;A critical detail: Lake Formation must have &lt;strong&gt;audit logging&lt;/strong&gt; enabled for all data access operations, with logs sent to a separate S3 bucket with Object Lock. In carbon credit audits I have participated in, the absence of data access audit logs was the primary blocker for certification.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference Numbers for Sizing
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~USD 12&lt;/strong&gt; — Cost per Glue Job run (20 G.2X workers, 45 min). To process 1 day of global Sentinel-2 ingestion (~800 scenes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;60–80%&lt;/strong&gt; — Reduction in Athena scan volume with GeoParquet + H3. Compared to naive date-only partitioning on raw GeoTIFF&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt;8s P99&lt;/strong&gt; — Latency SLO for geospatial risk inference. ml.g4dn.xlarge with ~200M parameter segmentation models&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Defense in Depth: Location Data Is Sensitive Data
&lt;/h2&gt;

&lt;p&gt;There is a dangerous tendency to treat geospatial data as 'just map data' — public, non-sensitive, requiring no rigorous controls. This is a serious mistake. High-resolution location data combined with satellite image time series can reveal: troop movements, industrial facility production capacity, crop conditions before public reports (material non-public information for insider trading purposes), and land occupation patterns with legal implications.&lt;/p&gt;

&lt;p&gt;The security architecture I implement for these systems follows Zero Trust with four layers: &lt;strong&gt;network&lt;/strong&gt; (VPC with private endpoints for S3, SageMaker, and Glue — no traffic leaving to the public internet), &lt;strong&gt;identity&lt;/strong&gt; (IAM roles with &lt;code&gt;aws:RequestedRegion&lt;/code&gt; and &lt;code&gt;aws:SourceVpc&lt;/code&gt; conditions to ensure only services within the VPC access the data), &lt;strong&gt;data&lt;/strong&gt; (dedicated KMS CMK per data classification — one CMK for raw data, another for processed data, another for model outputs — with key policies requiring &lt;code&gt;kms:ViaService&lt;/code&gt; for access only via specific AWS services), and &lt;strong&gt;application&lt;/strong&gt; (Lake Formation with Row/Column-level security).&lt;/p&gt;

&lt;p&gt;A specific pattern I use to protect highly sensitive data: the processed data S3 bucket has a bucket policy with &lt;code&gt;aws:PrincipalOrgPaths&lt;/code&gt; that restricts access only to specific OUs within the AWS Organization. This ensures that even if an IAM role is compromised in another account in the organization, it will not have access to the sensitive geospatial data.&lt;/p&gt;

&lt;p&gt;For LGPD and GDPR compliance with data that may contain personal information (urban area imagery at sub-meter resolution), implement a face/license plate detection and anonymization step as part of the Glue transformation Job, before any data reaches the curated zone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-Patterns I Have Seen in the Field
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Storing raw GeoTIFF as the analytical source of truth&lt;/strong&gt;: GeoTIFF has no predicate pushdown. Every query scans the entire file. For analysis, convert to GeoParquet/Iceberg. Keep the original GeoTIFF only for reproducibility and reprocessing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using SageMaker Real-Time Endpoint for batch scoring of millions of polygons&lt;/strong&gt;: The cost and latency of individual calls make this prohibitive. Use SageMaker Batch Transform with S3 as source/sink — processes 1M polygons in minutes at a fraction of the cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring the small files problem after compaction&lt;/strong&gt;: Iceberg with many incremental writes generates thousands of small files. Without scheduled compaction, Athena pays the overhead of listing and opening each file. Configure &lt;code&gt;OPTIMIZE&lt;/code&gt; via Glue Workflow every 6 hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Training models without pinning the dataset snapshot ID&lt;/strong&gt;: Without the Iceberg snapshot ID as a training parameter, you cannot reproduce the exact dataset of any model in production. This is unacceptable in regulated environments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exposing geospatial inference endpoints without per-tenant rate limiting&lt;/strong&gt;: Large bounding box queries can consume disproportionate resources. Implement Usage Plans in API Gateway with limits per tier (e.g., 100 req/min for basic tier, 1,000 req/min for enterprise).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assuming public satellite data (Sentinel, Landsat) requires no access control&lt;/strong&gt;: The raw data may be public, but derived indices, trained models, and calculated risk scores are proprietary assets. Treat them as such from day one.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  MLOps for Geospatial Models: Drift Is Not Just Statistical
&lt;/h2&gt;

&lt;p&gt;Computer vision models for geospatial data have a type of drift that does not appear in standard statistical monitors: &lt;strong&gt;sensor drift&lt;/strong&gt;. When a satellite provider updates radiometric calibration processing, or when you add a new constellation to the pipeline, the pixel value distribution changes even though the physical world has not. A model trained on Sentinel-2 L2A can silently degrade when it starts receiving data from a new sensor without retraining.&lt;/p&gt;

&lt;p&gt;The solution is to implement drift monitoring at two levels: (1) &lt;strong&gt;feature drift&lt;/strong&gt; using SageMaker Model Monitor with a baseline calculated separately by sensor and by season (summer vs. winter vegetation has completely different distributions), and (2) &lt;strong&gt;concept drift&lt;/strong&gt; by monitoring the output prediction distribution against ground truth collected periodically via crowdsourcing or field validation.&lt;/p&gt;

&lt;p&gt;For retraining, I use a &lt;strong&gt;continuous training with human approval&lt;/strong&gt; pattern: a Step Functions workflow triggered when the drift score exceeds a threshold (e.g., PSI &amp;gt; 0.2 for any input feature), executes retraining with the last 90 days of Iceberg data, registers the new model in Model Registry with 'PendingApproval' status, and notifies the data science team via SNS. Manual approval is mandatory before production deployment — not for bureaucratic reasons, but because in financial decisions, a silently degrading model can cause systemic damage before detection.&lt;/p&gt;

&lt;p&gt;A concrete capacity number: for land cover segmentation models (U-Net with ResNet-50 backbone, ~25M parameters), full retraining on 90 days of Sentinel-2 data for Brazil (~180K scenes) takes approximately 8 hours on an ml.p3.8xlarge instance (4 V100 GPUs) — a cost of ~USD 90. This is acceptable for monthly or drift-triggered retraining.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions in Geospatial Platform Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Should I use Amazon Location Service or build my own geospatial stack?
&lt;/h3&gt;

&lt;p&gt;Amazon Location Service is excellent for geocoding, routing, and real-time asset tracking use cases. For satellite image analysis, spectral indices, and geospatial ML, you need the full stack (S3 + Glue + GeoParquet + SageMaker). The two are not competitors — use Location Service for the operational location layer and the analytical stack for image processing.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the estimated monthly cost for a national-scale Earth AI platform (Brazil)?
&lt;/h3&gt;

&lt;p&gt;For national coverage of Brazil with Sentinel-2 (10m resolution, 5-day revisit): ~USD 800/month in S3 (2 years of processed data storage ~50 TB), ~USD 360/month in Glue Jobs (30 runs/month), ~USD 200/month in Athena (analytical queries), ~USD 400/month in SageMaker Endpoint (ml.g4dn.xlarge, 24/7). Total: ~USD 1,760/month before retraining and auxiliary services. Scales linearly with additional sensors.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to handle clouds and missing data in satellite image time series?
&lt;/h3&gt;

&lt;p&gt;Store the cloud mask (SCL band in Sentinel-2) as a separate column in GeoParquet. For time series analysis, use temporal interpolation with STAC (SpatioTemporal Asset Catalog) to identify the most recent cloud-free image for each pixel. ML models should be trained with synthetic cloud augmentation for robustness. For critical decisions, always include cloud cover percentage as a confidence metadata field in the inference output.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to ensure the pipeline is resilient to satellite data ingestion failures?
&lt;/h3&gt;

&lt;p&gt;Use SQS FIFO with a Dead Letter Queue (DLQ) configured for 3 attempts before moving to DLQ. Configure a monitoring Lambda that reads the DLQ every hour and sends alerts via SNS with the scene_id and error. For Glue Job failures, use the native retry mechanism (max 3 retries with exponential backoff) and configure EventBridge notification for persistent failures. Iceberg MERGE INTO guarantees idempotency — reprocessing an already-processed scene does not create duplicates.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Well-Architected Lenses for Earth AI Platforms
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;security&lt;/strong&gt;: KMS CMK per data classification, Lake Formation with Cell-Level Security by geometry, private VPC endpoints for all data services, Object Lock in COMPLIANCE mode for auditable reference data, and full audit logging with 7-year retention for regulatory compliance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;reliability&lt;/strong&gt;: SQS FIFO with DLQ for resilient ingestion, Iceberg MERGE INTO for reprocessing idempotency, Multi-AZ for critical SageMaker Endpoints, and Step Functions with compensation for retraining workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;performance&lt;/strong&gt;: GeoParquet with H3 for geospatial predicate pushdown, Glue G.2X workers for multispectral band processing without spill, Iceberg compaction every 6 hours to avoid small files, and SageMaker Batch Transform for at-scale scoring.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Architect's Note:&lt;/strong&gt; What strikes me about the Google Research signal is not the model — it is the implication that sustainable planning decisions now depend on data pipelines that most organizations still do not know how to build. If I were starting this project today, I would invest the first two weeks exclusively in getting geospatial partitioning and data lineage right — not in the model. In my experience, the model is the easy part; the platform that feeds it with traceable, correctly partitioned, and governed data is where projects fail. The most expensive lesson I have learned: location data without access auditing is a regulatory liability waiting to be triggered — and in financial environments, that cost is always greater than the cost of doing it right from the start.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Verdict: Earth AI Is a Data Platform, Not an ML Project
&lt;/h2&gt;

&lt;p&gt;The transition from 'pixels to planning' — from raw satellite imagery to sustainable operational decisions — is technically feasible on AWS today, at an accessible operational cost for mid-sized organizations. But success depends on architectural decisions that must be made before the first model is trained: storage format (GeoParquet/Iceberg, not raw GeoTIFF), partitioning strategy (hierarchical H3, not just by date), lineage governance (quadruple traceability), and defense in depth (KMS + Lake Formation + private VPC). For organizations operating in regulated environments — financial, insurance, carbon credit — traceability and audit logging are not optional. Start with the data platform. The model comes after.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html" rel="noopener noreferrer"&gt;AWS Glue — Apache Iceberg Support&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/lineage-tracking.html" rel="noopener noreferrer"&gt;Amazon SageMaker Lineage Tracking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/cell-level-security.html" rel="noopener noreferrer"&gt;AWS Lake Formation — Cell-Level Security&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://geoparquet.org/" rel="noopener noreferrer"&gt;GeoParquet Specification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://h3geo.org/" rel="noopener noreferrer"&gt;H3: Uber's Hexagonal Hierarchical Spatial Index&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openlineage.io/" rel="noopener noreferrer"&gt;OpenLineage — Open Standard for Data Lineage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/welcome.html" rel="noopener noreferrer"&gt;AWS Well-Architected — Machine Learning Lens&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.bcb.gov.br/estabilidadefinanceira/exibenormativo?tipo=Resolucao%20CMN&amp;amp;numero=4945" rel="noopener noreferrer"&gt;Resolução CMN 4.945/2021 — Risco Climático em Instituições Financeiras&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fernando.moretes.com/blog/earth-ai-e-planejamento-com-dados-para-decisoes-sustentaveis" rel="noopener noreferrer"&gt;fernando.moretes.com&lt;/a&gt;. By Fernando F. Azevedo — Senior Solutions Architect.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataplatforms</category>
      <category>geospatial</category>
      <category>earthai</category>
      <category>dataplatform</category>
    </item>
    <item>
      <title>Accounts are not just storage. They're one of the reasons Solana is fast.</title>
      <dc:creator>Tanya Prajapati</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:50:49 +0000</pubDate>
      <link>https://dev.to/anya_e3c2d964a/accounts-are-not-just-storage-theyre-one-of-the-reasons-solana-is-fast-48g7</link>
      <guid>https://dev.to/anya_e3c2d964a/accounts-are-not-just-storage-theyre-one-of-the-reasons-solana-is-fast-48g7</guid>
      <description>&lt;p&gt;This day-27 of learning Solana&lt;/p&gt;

&lt;p&gt;For the past few days, I have explored accounts and how data is stored in those accounts.&lt;br&gt;
This the summary of all those important concepts about account every Solana beginner should know.&lt;/p&gt;

&lt;p&gt;Solana does not uses any mempool to store transactions when they are being processed. Instead, transaction are stored in a block and marked as successfully executed while they are being parallely processed. This helps in handling large number of transactions in lightening speed.&lt;/p&gt;

&lt;p&gt;Accounts are used to store data while programs are used to store executable code (Like smart contract in EVM).&lt;/p&gt;

&lt;p&gt;In Solana, there are more than one accoutns each with different purposes and significance. &lt;br&gt;
The System Program (Owner account) : It uses owner account address which is a string of 1's. For every system , the owner address is same. However this does not means that anyone can exploit the transaction or the account.&lt;/p&gt;

&lt;p&gt;In Solana , the address is of two types-&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The public key: It needs a private key which ensures the account is secure. Only the one who has private key can use that account and make transactions through it.&lt;/li&gt;
&lt;li&gt;Program Derived Address: The address is deterministically generated by a program using program ID and a set of seeds. These addresses are owned  entirely by a program (smart contract). The PDA act as a specialized smart locker while the user's public key act as the account owner.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Solana uses dynamic parallel account locking - &lt;br&gt;
Transactions must list every single account they intend to interact with before they run, the validator engine uses an advanced system of Read-Locks and Write-Locks. This prevents the "double-spend" problem while allowing entirely unrelated accounts to process in parallel without bottlenecking the network.&lt;/p&gt;

</description>
      <category>web3</category>
      <category>100daysofsolana</category>
      <category>blockchain</category>
      <category>developer</category>
    </item>
    <item>
      <title>GPT-5 vs Claude vs Nova on Bedrock: A Production Governance Bake-off</title>
      <dc:creator>Fernando Azevedo</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:50:48 +0000</pubDate>
      <link>https://dev.to/fernando_azevedo_6844e930/gpt-5-vs-claude-vs-nova-on-bedrock-a-production-governance-bake-off-580n</link>
      <guid>https://dev.to/fernando_azevedo_6844e930/gpt-5-vs-claude-vs-nova-on-bedrock-a-production-governance-bake-off-580n</guid>
      <description>&lt;p&gt;The arrival of GPT-5.5, GPT-5.4, and Codex on Amazon Bedrock is not just a product event — it is a signal that Bedrock is consolidating as the unified control plane for frontier models in enterprise environments. For teams operating in regulated sectors, the question has shifted from 'which model to use?' to 'how do we govern multiple frontier models with the same security, traceability, and cost controls we already apply to the rest of our AWS infrastructure?' This analysis does exactly that bake-off: GPT-5.5 vs Claude 3.7 Sonnet vs Amazon Nova Pro, focused on production, not benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed when GPT-5 landed on Bedrock
&lt;/h2&gt;

&lt;p&gt;Before OpenAI models arrived on Bedrock, choosing GPT-4 or GPT-4o meant leaving the AWS perimeter: direct calls to the OpenAI API, secrets managed outside Secrets Manager, logs that bypassed CloudTrail, and data potentially leaving your residency region. For teams requiring LGPD, PCI-DSS, or SOC 2 compliance, that was a real governance cost, not a theoretical one.&lt;/p&gt;

&lt;p&gt;With GPT-5.5 and Codex available via &lt;code&gt;bedrock:InvokeModel&lt;/code&gt; and &lt;code&gt;bedrock:InvokeModelWithResponseStream&lt;/code&gt;, the model becomes just another ARN resource. That means the IAM policies you already have — including conditions like &lt;code&gt;aws:RequestedRegion&lt;/code&gt;, &lt;code&gt;bedrock:modelId&lt;/code&gt;, and &lt;code&gt;aws:PrincipalTag&lt;/code&gt; — apply directly. CloudTrail records every invocation. Amazon Bedrock Guardrails, with its content filters, PII detection, and grounding checks, covers GPT-5.5 the same way it covers Claude or Nova.&lt;/p&gt;

&lt;p&gt;What this does not solve: network latency to regions where the model is still served via cross-region endpoints, and the fact that GPT-5.5 weights do not reside in your account — you are consuming a hosted model, not a deployed one. For use cases requiring full inference isolation, such as document analysis with classified customer data, this remains a threat model item that needs explicit documentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dimension benchmarks miss: operational behavior
&lt;/h2&gt;

&lt;p&gt;Academic benchmarks measure capability under controlled conditions. In financial production, what matters is behavior under load, p99 latency consistency, and the real cost of a response — not the average cost, but the cost of an 8k-token prompt with 2k output at peak hours.&lt;/p&gt;

&lt;p&gt;Claude 3.7 Sonnet has a characteristic that matters greatly for agentic workflows: extended thinking mode produces chained reasoning that is auditable. In compliance contexts, being able to show the intermediate reasoning of a credit decision or fraud triage has direct regulatory value. GPT-5.5 also supports chain-of-thought, but the level of control over reasoning verbosity and the separation between scratchpad and final output is still less granular via the Bedrock API than what Anthropic exposes natively.&lt;/p&gt;

&lt;p&gt;Amazon Nova Pro, on the other hand, is the only one of the three where you have full visibility into the model lifecycle within AWS. It supports fine-tuning via Bedrock Custom Model Jobs, meaning you can adapt the model to domain-specific vocabulary — derivatives terminology, for example — without relying on prompt engineering. Nova Pro's cost per token is significantly lower, which matters when you are processing millions of documents in batch with Bedrock Batch Inference.&lt;/p&gt;

&lt;p&gt;The most common failure mode I see in production is not the model being wrong — it is the system lacking sufficient observability to know when the model was wrong. That leads directly to the instrumentation question.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unified Control Plane: Model Governance on Bedrock
&lt;/h2&gt;

&lt;p&gt;Inference request flow through Bedrock governance layers, showing how GPT-5.5, Claude, and Nova share the same security and observability controls&lt;/p&gt;

&lt;h3&gt;
  
  
  🔐 AWS — Segurança e Identidade
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;IAM Policy bedrock:modelId condition (security)&lt;/li&gt;
&lt;li&gt;KMS Encrypt at rest / in transit (security)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🟧 Amazon Bedrock — Plano de Controle
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Bedrock Guardrails PII, content, grounding (security)&lt;/li&gt;
&lt;li&gt;Bedrock API Gateway InvokeModel / Stream (compute)&lt;/li&gt;
&lt;li&gt;Model Invocation Logging S3 + CloudTrail (data)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🤖 Modelos Frontier
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;GPT-5.5 / Codex OpenAI via Bedrock (ai)&lt;/li&gt;
&lt;li&gt;Claude 3.7 Sonnet Extended Thinking (ai)&lt;/li&gt;
&lt;li&gt;Amazon Nova Pro Fine-tune + Batch (ai)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  📊 Observabilidade
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;CloudWatch Latency P99 / Tokens (data)&lt;/li&gt;
&lt;li&gt;OpenTelemetry Trace ID por invocação (data)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Flows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;client -&amp;gt; iam: authentication&lt;/li&gt;
&lt;li&gt;iam -&amp;gt; guardrails: policy enforced&lt;/li&gt;
&lt;li&gt;guardrails -&amp;gt; gateway: filtered prompt&lt;/li&gt;
&lt;li&gt;gateway -&amp;gt; gpt5: InvokeModel&lt;/li&gt;
&lt;li&gt;gateway -&amp;gt; claude: InvokeModel&lt;/li&gt;
&lt;li&gt;gateway -&amp;gt; nova: InvokeModel&lt;/li&gt;
&lt;li&gt;gateway -&amp;gt; logging: async log&lt;/li&gt;
&lt;li&gt;kms -&amp;gt; logging: encryption&lt;/li&gt;
&lt;li&gt;logging -&amp;gt; cw: metrics&lt;/li&gt;
&lt;li&gt;gateway -&amp;gt; otel: trace span&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Instrumentation: where most teams get it wrong
&lt;/h2&gt;

&lt;p&gt;Bedrock emits native metrics to CloudWatch: &lt;code&gt;InvocationLatency&lt;/code&gt;, &lt;code&gt;InputTokenCount&lt;/code&gt;, &lt;code&gt;OutputTokenCount&lt;/code&gt;, &lt;code&gt;InvocationClientErrors&lt;/code&gt;, &lt;code&gt;InvocationThrottles&lt;/code&gt;. But these metrics alone are insufficient to operate an AI system in financial production. What is missing is correlation between the model invocation and business context — which user, which product, which decision was influenced by that response.&lt;/p&gt;

&lt;p&gt;The approach that works is instrumenting with OpenTelemetry at the application level, propagating a trace ID that crosses the Bedrock call and is included in the Model Invocation Logging payload. When you enable Model Invocation Logging with S3 + CloudWatch Logs as destination, each record includes the Bedrock &lt;code&gt;requestId&lt;/code&gt;. If you inject that &lt;code&gt;requestId&lt;/code&gt; as an attribute in your OTel span, you can correlate a customer complaint with the exact prompt and response that generated that decision — that is real auditability.&lt;/p&gt;

&lt;p&gt;For GPT-5.5 specifically, one watch point: the model supports &lt;code&gt;response_format: json_object&lt;/code&gt; and structured outputs, but schema validation happens on the model side, not in Guardrails. If you need to guarantee that the response respects a specific schema before persisting to DynamoDB, add a validation step in the Lambda that processes the response — do not assume the model will always return valid JSON under load or with adversarial prompts.&lt;/p&gt;

&lt;p&gt;Claude 3.7 with extended thinking exposes the reasoning block as a separate field in the response. Store that field in S3 with a 7-year retention policy if you are in a regulated environment — it is decision-making evidence, not just a technical log.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real cost: beyond price per token
&lt;/h2&gt;

&lt;p&gt;Frontier model cost comparisons frequently stop at input/output token price. That is the smallest component of total cost in production systems. The components that dominate cost are: (1) tokens wasted by poorly structured prompts, (2) retries due to throttling, and (3) the cost of operating the system around the model.&lt;/p&gt;

&lt;p&gt;GPT-5.5 has a higher price per token than Claude 3.7 Sonnet and significantly higher than Nova Pro. For a document analysis workload processing 10 million pages per month with an average context of 4k tokens per page, the cost difference between GPT-5.5 and Nova Pro can be on the order of 5-8x. This is not an argument against using GPT-5.5 — it is an argument for using it selectively, in cases where its differentiated reasoning capability justifies the cost.&lt;/p&gt;

&lt;p&gt;Bedrock Batch Inference changes the calculation for async workloads. With batch, you get up to 50% discount on token price for Claude and Nova. GPT-5.5 on Bedrock does not yet support batch inference at the time of this analysis — meaning that for large-scale processing, you need to manage your own queue (SQS + Lambda with reserved concurrency) and handle the account-level TPM (tokens per minute) limits.&lt;/p&gt;

&lt;p&gt;Bedrock TPM limits for third-party models like GPT-5.5 are managed via service quota, and increases require an AWS Support request. In multi-tenant environments where multiple products share the same AWS account, this can become a bottleneck. The solution is to use AWS Organizations with separate accounts per product and independent quotas — do not share TPM limits between critical and experimental workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPT-5.5 vs Claude 3.7 Sonnet vs Amazon Nova Pro — Technical Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;GPT-5.5 (OpenAI via Bedrock)&lt;/th&gt;
&lt;th&gt;Claude 3.7 Sonnet (Anthropic)&lt;/th&gt;
&lt;th&gt;Amazon Nova Pro&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Relative cost per token (input)&lt;/td&gt;
&lt;td&gt;High (baseline ~$3/MTok)&lt;/td&gt;
&lt;td&gt;Medium (~$3/MTok)&lt;/td&gt;
&lt;td&gt;Low (~$0.8/MTok)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch Inference support (Bedrock)&lt;/td&gt;
&lt;td&gt;No (at time of analysis)&lt;/td&gt;
&lt;td&gt;Yes — up to 50% discount&lt;/td&gt;
&lt;td&gt;Yes — up to 50% discount&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-tuning via Bedrock&lt;/td&gt;
&lt;td&gt;Not available&lt;/td&gt;
&lt;td&gt;Not available&lt;/td&gt;
&lt;td&gt;Yes — Custom Model Jobs&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auditable reasoning (structured CoT)&lt;/td&gt;
&lt;td&gt;Partial — via structured outputs&lt;/td&gt;
&lt;td&gt;Yes — separate extended thinking block&lt;/td&gt;
&lt;td&gt;Partial — via prompt engineering&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bedrock Guardrails coverage&lt;/td&gt;
&lt;td&gt;Yes — same controls&lt;/td&gt;
&lt;td&gt;Yes — same controls&lt;/td&gt;
&lt;td&gt;Yes — same controls&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P50 latency (2k token prompt)&lt;/td&gt;
&lt;td&gt;~1.8s (estimate; varies by region)&lt;/td&gt;
&lt;td&gt;~1.5s (without extended thinking)&lt;/td&gt;
&lt;td&gt;~1.2s&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model weight residency in AWS account&lt;/td&gt;
&lt;td&gt;No — hosted by OpenAI&lt;/td&gt;
&lt;td&gt;No — hosted by Anthropic&lt;/td&gt;
&lt;td&gt;Yes — Amazon-native&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex / specialized code generation&lt;/td&gt;
&lt;td&gt;Yes — Codex available on Bedrock&lt;/td&gt;
&lt;td&gt;Strong — Claude 3.7 is top-tier for code&lt;/td&gt;
&lt;td&gt;Competent — best with fine-tuning&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Decision Matrix: Which Model for Which Workload?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  GPT-5.5 via Bedrock
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Top-tier reasoning capability for complex, ambiguous tasks&lt;/li&gt;
&lt;li&gt;Codex for code generation in AI-assisted DevOps pipelines&lt;/li&gt;
&lt;li&gt;Unified governance via IAM, CloudTrail, and Guardrails — no AWS perimeter exit&lt;/li&gt;
&lt;li&gt;Structured outputs with JSON schema for direct downstream system integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Higher cost per token; no batch inference support on Bedrock&lt;/li&gt;
&lt;li&gt;Weights do not reside in AWS account — implications for sensitive data threat models&lt;/li&gt;
&lt;li&gt;TPM limits managed via service quota; increases require AWS Support&lt;/li&gt;
&lt;li&gt;No fine-tuning available via Bedrock&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Best for: high-complexity reasoning tasks (legal analysis, due diligence), code generation in CI/CD pipelines, and cases where response quality justifies the premium cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude 3.7 Sonnet
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extended thinking with separate, auditable reasoning block — direct regulatory value&lt;/li&gt;
&lt;li&gt;Batch inference support with up to 50% discount for async workloads&lt;/li&gt;
&lt;li&gt;Excellent at code and technical analysis; consistent in long contexts&lt;/li&gt;
&lt;li&gt;Competitive pricing with GPT-5.5 for comparable quality on many tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extended thinking significantly increases latency — not suitable for real-time inference&lt;/li&gt;
&lt;li&gt;No fine-tuning via Bedrock; adaptation relies on prompt engineering and RAG&lt;/li&gt;
&lt;li&gt;Weights also do not reside in AWS account&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Best for: agentic workflows requiring auditable reasoning, regulatory document analysis, fraud triage with explainability, and high-quality batch processing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon Nova Pro
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lowest cost per token — 5-8x cheaper than GPT-5.5 for high-volume workloads&lt;/li&gt;
&lt;li&gt;Fine-tuning via Bedrock Custom Model Jobs — domain adaptation without prompt engineering&lt;/li&gt;
&lt;li&gt;Amazon-native weights; best posture for ultra-sensitive data threat models&lt;/li&gt;
&lt;li&gt;Batch inference support; lowest P50 latency among the three&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reasoning capability below GPT-5.5 and Claude 3.7 on high-complexity tasks&lt;/li&gt;
&lt;li&gt;Fine-tuning requires quality dataset and MLOps pipeline — operational overhead&lt;/li&gt;
&lt;li&gt;Smaller third-party tool and integration ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Best for: large-scale processing (millions of documents), classification and extraction tasks where fine-tuning pays off, and environments with stricter data sovereignty requirements.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The routing pattern that resolves the dilemma:&lt;/strong&gt; The answer to 'which model to use?' in mature production systems is not a single choice — it is a router. Implement an AI Gateway in Lambda or ECS that classifies each request by complexity, data sensitivity, and latency requirement, and routes to the appropriate model. Low-complexity, high-volume requests go to Nova Pro. Analyses requiring auditable reasoning go to Claude 3.7 with extended thinking. Code generation in CI/CD pipelines goes to Codex. Same Guardrails, same CloudTrail, same trace ID — unified governance with workload-optimized cost. This pattern reduces total cost by 40-60% compared to using GPT-5.5 for everything, without sacrificing quality where it matters.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Numbers that guide the routing decision
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;5-8x&lt;/strong&gt; — Cost per token difference: GPT-5.5 vs Nova Pro. For high-volume workloads, intelligent routing is the largest cost lever available on Bedrock&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;50%&lt;/strong&gt; — Maximum discount with Batch Inference (Claude and Nova). Async workloads — document analysis, data enrichment — should use batch by default&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7 anos&lt;/strong&gt; — Recommended retention for reasoning logs in regulated environments. Claude 3.7's extended thinking block is decision-process evidence for regulatory audits&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Anti-patterns I encounter in production
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Using GPT-5.5 for all workloads because it is 'the most capable' — ignores that 70% of tasks do not need frontier reasoning and pays 5-8x more for it&lt;/li&gt;
&lt;li&gt;Not enabling Model Invocation Logging — without prompt/response logs, regulatory audit is impossible and quality regression debugging is blind&lt;/li&gt;
&lt;li&gt;Assuming Bedrock Guardrails replaces schema validation in code — Guardrails filters content, not data structure; invalid JSON still passes through&lt;/li&gt;
&lt;li&gt;Sharing TPM limits between critical and experimental workloads in the same AWS account — a token-burst experiment can throttle a production feature&lt;/li&gt;
&lt;li&gt;Not propagating trace IDs in Bedrock calls — loses the correlation between business decision and model invocation, making incident investigations much slower&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;My curation note:&lt;/strong&gt; In practice, what I would do: start with Claude 3.7 Sonnet as the default model for any new workload in a financial environment — the auditable extended thinking is worth more than the cost difference versus Nova Pro when you are in a regulated sector. Introduce GPT-5.5 and Codex specifically for the AI-assisted DevOps pipeline, where code generation quality justifies the premium cost. Nova Pro would enter as a routing destination for large-scale classification and extraction, with fine-tuning trained on domain vocabulary. The lesson I learned the hard way: the biggest risk is not choosing the wrong model — it is not having sufficient observability to know when any model is wrong. Invest in trace IDs, Model Invocation Logging, and quality drift alerts before optimizing which model to use.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Recommendation: do not choose a model, build a router
&lt;/h2&gt;

&lt;p&gt;The arrival of GPT-5.5 and Codex on Bedrock does not make Claude or Nova obsolete — it completes the portfolio. The recommendation is clear: implement an AI Gateway that routes by workload, not by model preference. Use Claude 3.7 Sonnet as the default for tasks requiring auditable reasoning in regulated environments. Use GPT-5.5 and Codex for code generation and high-complexity reasoning tasks where the premium cost is justified by value. Use Nova Pro for large-scale processing and cases where domain fine-tuning is viable. In all cases: enable Model Invocation Logging on day one, propagate trace IDs, and treat TPM limits as an infrastructure quota requiring capacity planning — not a configuration detail. Unified governance on Bedrock is the real asset here; the models are commodities that will evolve. Build your platform around the controls, not around a specific model.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/model-invocation-logging.html" rel="noopener noreferrer"&gt;Amazon Bedrock Model Invocation Logging&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html" rel="noopener noreferrer"&gt;Amazon Bedrock Guardrails&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html" rel="noopener noreferrer"&gt;Amazon Bedrock Batch Inference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/custom-models.html" rel="noopener noreferrer"&gt;Amazon Bedrock Custom Model Fine-tuning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/welcome.html" rel="noopener noreferrer"&gt;AWS Well-Architected — Machine Learning Lens&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html" rel="noopener noreferrer"&gt;Amazon Bedrock Service Quotas&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws-otel.github.io/docs/getting-started/lambda" rel="noopener noreferrer"&gt;OpenTelemetry for AWS Lambda and Bedrock&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/aws/" rel="noopener noreferrer"&gt;AWS News Blog — Amazon Bedrock&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fernando.moretes.com/blog/bedrock-2026-governanca-modelos-frontier" rel="noopener noreferrer"&gt;fernando.moretes.com&lt;/a&gt;. By Fernando F. Azevedo — Senior Solutions Architect.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>bedrock</category>
      <category>gpt5</category>
      <category>claude</category>
    </item>
    <item>
      <title>Modern KYC: Serverless, AI and Audit Trails in Financial Services</title>
      <dc:creator>Fernando Azevedo</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:50:43 +0000</pubDate>
      <link>https://dev.to/fernando_azevedo_6844e930/modern-kyc-serverless-ai-and-audit-trails-in-financial-services-40fe</link>
      <guid>https://dev.to/fernando_azevedo_6844e930/modern-kyc-serverless-ai-and-audit-trails-in-financial-services-40fe</guid>
      <description>&lt;p&gt;For years, KYC was treated as a workflow problem — forms, manual review queues and batch jobs running at 2 a.m. Regulators tolerated day-long latencies because everyone operated at the same pace. That tacit contract is unraveling: open finance, instant payments and regulatory pressure for real-time auditable decisions are forcing an architectural rupture. The signal I analyze here is not about replacing analysts with AI — it is about redesigning the KYC pipeline as a first-class system: event-driven, serverless where it makes sense, with AI as a co-pilot and auditability as a first-class architectural citizen, not a compliance afterthought.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost of Legacy KYC — and What Changes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~$50&lt;/strong&gt; — Average cost per manual KYC onboarding in traditional banks. Source: Thomson Reuters 2023 market estimates; includes analysts, rework and tooling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt;2 min&lt;/strong&gt; — Target onboarding latency in fintechs with serverless + AI pipelines. Includes document extraction, identity verification and risk scoring — no human intervention for low-risk cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;60-80%&lt;/strong&gt; — Reduction in cases escalated to manual review with well-calibrated AI triage. Critically depends on training data quality and confidence thresholds configured per product&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Signal: Why Serverless KYC Is Emerging Now
&lt;/h2&gt;

&lt;p&gt;The movement toward serverless KYC pipelines is not a technological novelty — it is a convergence of pressures that have finally made the architecture both viable and necessary at the same time.&lt;/p&gt;

&lt;p&gt;First, tooling maturity. AWS Step Functions Express Workflows has reached a point where orchestrating a 15-step pipeline with conditional branching, exponential retries and compensations is operationally sustainable. The 5-minute per-execution limit of Express Workflows is irrelevant for real-time KYC — if your verification flow takes longer than that, the problem is not the orchestrator, it is the design.&lt;/p&gt;

&lt;p&gt;Second, Amazon Textract and Bedrock changed the document extraction equation. Previously, you needed a custom ML pipeline to extract data from driver's licenses, passports or income statements with acceptable accuracy. Today, a combination of Textract with AnalyzeDocument (FORMS + QUERIES mode) and a Bedrock model for semantic validation delivers accuracy comparable to a human analyst on clean documents — with 3-8 second latency per document and cost in the order of fractions of a cent per page.&lt;/p&gt;

&lt;p&gt;Third, and perhaps most important for regulated financial environments: AWS's shared responsibility model has evolved with sector-specific certifications (PCI DSS Level 1, SOC 2 Type II, ISO 27001, BACEN Resolution 4.893 alignment). This does not eliminate compliance work, but drastically reduces audit scope when you use managed services with documented controls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Modern KYC Pipeline — Event-Driven Decision Flow
&lt;/h2&gt;

&lt;p&gt;Complete KYC onboarding flow: from customer submission to auditable decision, with AI as co-pilot and immutable trail in S3.&lt;/p&gt;

&lt;h3&gt;
  
  
  🌐 AWS — Edge &amp;amp; Ingestion
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;API Gateway REST + WAF + mTLS (edge)&lt;/li&gt;
&lt;li&gt;S3 — Raw Docs SSE-KMS, Object Lock (storage)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ⚙️ AWS — Orchestration
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Step Functions Express Workflow (compute)&lt;/li&gt;
&lt;li&gt;Lambda — Extract Textract AnalyzeDoc (compute)&lt;/li&gt;
&lt;li&gt;Lambda — Risk Score Bedrock Claude / Nova (ai)&lt;/li&gt;
&lt;li&gt;Lambda — Sanctions Ofac + PEP API (compute)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🧠 AWS — AI &amp;amp; Decisão
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Bedrock Claude 3 Sonnet / Nova (ai)&lt;/li&gt;
&lt;li&gt;Lambda — Decision Rules Engine + AI (compute)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🗄️ AWS — State &amp;amp; Audit
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;DynamoDB KYC State Table (data)&lt;/li&gt;
&lt;li&gt;DynamoDB Streams → Audit Fanout (messaging)&lt;/li&gt;
&lt;li&gt;S3 — Audit Log Object Lock WORM (storage)&lt;/li&gt;
&lt;li&gt;CloudWatch SLO Dashboards + Alarms (security)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Flows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;client -&amp;gt; apigw: POST /kyc multipart&lt;/li&gt;
&lt;li&gt;apigw -&amp;gt; s3raw: encrypted doc upload&lt;/li&gt;
&lt;li&gt;apigw -&amp;gt; sfn: start execution&lt;/li&gt;
&lt;li&gt;sfn -&amp;gt; lambda_extract: Step 1: extraction&lt;/li&gt;
&lt;li&gt;lambda_extract -&amp;gt; bedrock: semantic validation&lt;/li&gt;
&lt;li&gt;sfn -&amp;gt; lambda_sanction: Step 2: sanctions&lt;/li&gt;
&lt;li&gt;sfn -&amp;gt; lambda_risk: Step 3: risk&lt;/li&gt;
&lt;li&gt;lambda_risk -&amp;gt; bedrock: LLM scoring&lt;/li&gt;
&lt;li&gt;sfn -&amp;gt; lambda_decision: Step 4: decision&lt;/li&gt;
&lt;li&gt;lambda_decision -&amp;gt; dynamo: persist KYC state&lt;/li&gt;
&lt;li&gt;dynamo -&amp;gt; dynamo_streams: change stream&lt;/li&gt;
&lt;li&gt;dynamo_streams -&amp;gt; s3audit: WORM audit record&lt;/li&gt;
&lt;li&gt;dynamo_streams -&amp;gt; cloudwatch: SLO metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Auditability as Architecture, Not as Logging
&lt;/h2&gt;

&lt;p&gt;The most common mistake I see in KYC architectures — even in experienced teams — is treating the audit trail as a side effect: you save the decision result somewhere and call it an audit log. Regulators like BACEN, CVM and COAF do not accept this. They want to know &lt;em&gt;why&lt;/em&gt; the decision was made, &lt;em&gt;what data&lt;/em&gt; was available at the time of the decision and &lt;em&gt;who or what&lt;/em&gt; executed each step.&lt;/p&gt;

&lt;p&gt;The architecture I propose inverts this logic: the audit trail is a first-class output of the pipeline, not an application log. Each Step Functions execution generates a unique execution ARN that serves as the traceability primary key. Each Lambda participating in the flow persists its input, output and metadata (Bedrock model version, Lambda function version, millisecond-precision timestamp) in DynamoDB with a partition key &lt;code&gt;kyc#customerId#executionId&lt;/code&gt;. DynamoDB Streams then propagates each mutation to an S3 bucket with &lt;strong&gt;Object Lock in COMPLIANCE mode&lt;/strong&gt; — meaning not even the root account can delete the record before the configured retention period (minimum 5 years for KYC in Brazil).&lt;/p&gt;

&lt;p&gt;A critical detail: when you use Bedrock as a decision co-pilot, the prompt sent to the model, the full response and the model used (including version) must be part of the audit record. This is not optional — it is what allows reconstructing the decision months later during a regulatory audit. Use explicit &lt;code&gt;modelId&lt;/code&gt; in Bedrock calls (never &lt;code&gt;latest&lt;/code&gt;) and persist &lt;code&gt;usage.inputTokens&lt;/code&gt; + &lt;code&gt;usage.outputTokens&lt;/code&gt; for cost traceability and reproducibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changes for Architects with Modern KYC
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration replaces point-to-point integration&lt;/strong&gt;: Step Functions Express Workflows with per-state retry configuration (maxAttempts: 3, backoffRate: 2.0, intervalSeconds: 1) eliminates retry logic scattered across multiple services and centralizes failure handling — including compensations (partial state rollback).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI as co-pilot, not arbiter&lt;/strong&gt;: Bedrock should augment human decision-making in medium-confidence cases, not replace it. Define explicit thresholds: score &amp;lt; 0.3 = auto-approve, 0.3-0.7 = human review queue, &amp;gt; 0.7 = auto-reject. These thresholds are business parameters, not code constants.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotency is a requirement, not an optimization&lt;/strong&gt;: Each Lambda in the pipeline must be idempotent using the Step Functions &lt;code&gt;executionId&lt;/code&gt; as the idempotency key. DynamoDB with &lt;code&gt;ConditionExpression: attribute_not_exists(pk)&lt;/code&gt; ensures retries do not create duplicate records or trigger repeated sanctions checks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separation of data and control planes&lt;/strong&gt;: The control plane (Step Functions, Lambda, Bedrock) and the data plane (DynamoDB, S3) must have separate IAM roles with strict least-privilege. The extraction Lambda role must not have write access to the audit bucket — only the Streams fanout Lambda has that permission.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KYC SLOs are business SLOs&lt;/strong&gt;: Define explicit SLOs: p99 decision latency &amp;lt; 90s for automatic cases, extraction error rate &amp;lt; 0.5%, sanctions false positive rate &amp;lt; 0.1%. These numbers must be in CloudWatch Dashboards with alarms linked to runbooks, not just in architecture presentations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contextual encryption, not universal&lt;/strong&gt;: Use KMS with customer-managed keys (CMK) and key policies that restrict usage by &lt;code&gt;aws:PrincipalTag/Environment&lt;/code&gt; and &lt;code&gt;aws:RequestedRegion&lt;/code&gt;. PII data in DynamoDB should use attribute-level encryption with AWS Encryption SDK — not just table encryption — so a compromised key does not expose the entire table.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Real Trade-offs: Serverless KYC Is Not a Silver Bullet
&lt;/h2&gt;

&lt;p&gt;Before recommending this architecture to any client, I need to be honest about where it fails or requires extra care.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold starts in critical flows&lt;/strong&gt;: Lambda with Java or .NET runtime can have cold starts of 800ms-2s. For real-time KYC, this is unacceptable if it occurs in the critical path. The solution is not to blindly migrate to Go or Python — it is to use &lt;strong&gt;Provisioned Concurrency&lt;/strong&gt; for functions in the critical path (extraction and decision), with Application Auto Scaling configured to scale based on &lt;code&gt;ProvisionedConcurrencyUtilization&lt;/code&gt;. The additional cost is real (~$0.015/GB-hour for provisioned concurrency vs $0.0000166667/GB-second for on-demand) — you need to model the load profile before deciding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Textract limitations on low-quality documents&lt;/strong&gt;: Textract AnalyzeDocument has degraded accuracy on scanned documents with resolution &amp;lt; 150 DPI, uneven lighting (document photo taken with a phone in a dark environment) or laminated documents with glare. In production, you need image pre-processing (Lambda with OpenCV or Amazon Rekognition DetectText as fallback) and a minimum confidence threshold per extracted field — if &lt;code&gt;Confidence &amp;lt; 85&lt;/code&gt; on required fields like name or tax ID, the document must be rejected for resubmission, not processed with uncertain data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bedrock cost at scale&lt;/strong&gt;: A Claude 3 Sonnet call for risk scoring with 2000-token context costs approximately $0.003-0.006 per call. At 100,000 onboardings/month, that is $300-600/month in inference alone — manageable. But if you use Bedrock for every validation step without criteria, cost scales quickly. The rule I apply: Bedrock enters only when deterministic rules cannot resolve — not as the first processing line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Throttling of external sanctions APIs&lt;/strong&gt;: OFAC, PEP and CSNU lists are queried via third-party APIs with aggressive rate limits (typically 10-50 req/s per account). During onboarding spikes, this creates a bottleneck. The solution is a 24h TTL cache in ElastiCache Redis for already-verified entities, with forced invalidation when lists are updated — reduces external calls by 70-80% in flows with periodic re-verification.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architectural Positioning: How to Prepare Your Organization
&lt;/h2&gt;

&lt;p&gt;The most critical gap I observe is not technological — it is organizational. Teams operating legacy KYC have compliance analysts who understand the rules but not the technical pipeline, and engineers who understand the pipeline but not the regulatory implications of each design decision. Serverless + AI architecture amplifies this gap if not managed.&lt;/p&gt;

&lt;p&gt;The first change I recommend is creating a &lt;strong&gt;KYC Design Authority&lt;/strong&gt; — a small group (3-5 people) with representation from engineering, compliance and product that reviews and approves changes to the decision pipeline. This is not bureaucracy: it is the mechanism that ensures a change in the Bedrock prompt does not inadvertently violate a credit policy or create unintentional discriminatory bias.&lt;/p&gt;

&lt;p&gt;Second, invest in &lt;strong&gt;decision observability&lt;/strong&gt;, not just infrastructure observability. CloudWatch Metrics for Lambda latency is necessary but insufficient. You need business metrics: approval rate by customer segment, risk score distribution over time, disagreement rate between AI and human analyst in review cases. These metrics are the signal that the model is drifting or that business rules changed without pipeline updates.&lt;/p&gt;

&lt;p&gt;Third, treat &lt;strong&gt;Bedrock prompts as infrastructure code&lt;/strong&gt;: versioned in Git, reviewed via PR, tested with a curated set of test cases (including regulatory edge cases) before any deployment. A prompt that changes credit approval criteria without going through compliance review is equivalent to a code deploy that changes pricing logic without approval — unacceptable in a regulated financial environment.&lt;/p&gt;

&lt;p&gt;Finally, plan for &lt;strong&gt;multi-region from the start&lt;/strong&gt; if you operate in markets with data residency requirements. In Brazil, KYC data with PII must reside in &lt;code&gt;sa-east-1&lt;/code&gt; (São Paulo). If you need active-active DR, DynamoDB Global Tables replication works, but you need KMS key policies that restrict decryption to the primary region — the replica can store but must not decrypt without explicit approval.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Auditability Paradox with Generative AI:&lt;/strong&gt; LLMs are inherently non-deterministic: the same prompt with temperature &amp;gt; 0 can produce different responses. This creates a regulatory paradox — how do you audit a decision that may not be reproducible? The architectural answer is: you do not audit reproducibility, you audit traceability. Persist the exact prompt, exact response, exact modelId and timestamp. If a regulator questions the decision, you show the reasoning that existed &lt;em&gt;at that moment&lt;/em&gt; — not that the system would make the same decision today. For high-consequence cases (credit rejection, fraud suspicion), use &lt;code&gt;temperature: 0&lt;/code&gt; and &lt;code&gt;top_p: 1&lt;/code&gt; to maximize determinism, and document this as an AI governance policy.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Critical Anti-Patterns in Serverless KYC
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Monolithic KYC Lambda&lt;/strong&gt;: A single Lambda function that does extraction, sanctions check, risk scoring and persists the result. Impossible to unit test, impossible to retry granularly, impossible to audit which step failed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using &lt;code&gt;latest&lt;/code&gt; as Bedrock model version&lt;/strong&gt;: Guarantees a model update changes production decision behavior without any review. Always pin to a specific version: &lt;code&gt;anthropic.claude-3-sonnet-20240229-v1:0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit log in CloudWatch Logs as primary source&lt;/strong&gt;: CloudWatch Logs has configurable retention but lacks immutability guarantees equivalent to S3 Object Lock. For regulatory purposes, CloudWatch is operational observability — S3 with Object Lock COMPLIANCE is the official record.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sharing KMS key across environments&lt;/strong&gt;: Using the same CMK for dev, staging and production means a developer with dev access can potentially decrypt production data. Separate keys per environment with SCPs that block cross-account usage are mandatory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step Functions Standard Workflow for real-time KYC&lt;/strong&gt;: Standard Workflows have ~1s state transition latency and cost per state transition ($0.025/1000 transitions). For a 15-state pipeline with 100k executions/day, cost is ~$37/day in transitions alone. Express Workflows are appropriate for KYC: execution &amp;lt; 5 min, duration-based cost, and support up to 100,000 executions/second.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Modern KYC Through the AWS Well-Architected Lens
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;security&lt;/strong&gt;: Zero Trust in the pipeline: each Lambda assumes a role with least-privilege, no shared role across functions. KMS CMK with key policy restricted by &lt;code&gt;aws:PrincipalTag&lt;/code&gt;. PII encrypted at attribute level in DynamoDB with AWS Encryption SDK. WAF with AWS managed rules + custom rules for rate limiting by tax ID at API Gateway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;reliability&lt;/strong&gt;: Step Functions Express with per-state retry and catch to Dead Letter Queue (SQS FIFO with deduplication). DynamoDB with on-demand capacity to absorb onboarding spikes without throttling. S3 Object Lock for audit durability. Circuit breaker for external sanctions APIs via Lambda with Redis cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;performance&lt;/strong&gt;: Provisioned Concurrency for Lambdas in the critical path. Asynchronous Textract with SNS notification for documents &amp;gt; 1 page. Bedrock with streaming response for progressive user feedback. DynamoDB with partition key &lt;code&gt;kyc#customerId&lt;/code&gt; + sort key &lt;code&gt;timestamp#executionId&lt;/code&gt; for efficient per-customer queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cost&lt;/strong&gt;: Express Workflows vs Standard: 60-80% savings in orchestration cost for short-duration pipelines. Bedrock only for medium-confidence cases (deterministic rules first). ElastiCache Redis for sanctions cache reduces external API cost by 70%. S3 Intelligent-Tiering for audit logs with decreasing access over time.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Curator's Note: What I Would Do Differently the First Time:&lt;/strong&gt; In KYC projects I have been involved with, the most consistent regret is not having defined AI confidence thresholds as business parameters in Systems Manager Parameter Store from the start — they end up hardcoded in Lambda code and changing a threshold becomes a full CI/CD pipeline deploy, when it should be a compliance-approved configuration change in minutes. The second regret is not including the internal audit team in the design review before the first production deploy — they identify traceability requirements that engineers do not anticipate, such as the need to record &lt;em&gt;which version of the OFAC list&lt;/em&gt; was in effect at the time of verification. I learned that in regulated financial environments, the audit architecture must be designed with auditors, not for auditors.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Verdict: Adopt, but with Explicit AI Governance
&lt;/h2&gt;

&lt;p&gt;The serverless KYC architecture with AI assistance is mature enough for production in regulated financial environments — the pieces are available, the use cases are documented and the costs are justifiable. The risk is not in the technology; it is in governance. Teams that adopt Bedrock for KYC decisions without an explicit AI governance framework — documented thresholds, versioned prompts, drift metrics, mandatory human review for high-consequence cases — are creating regulatory liability that will surface in the next audit. My recommendation: start with a pilot in a low-risk product segment, measure the disagreement rate between AI and human analyst for 90 days, calibrate thresholds with real data and only then expand. The rush to automate KYC is understandable — the cost of getting it wrong in a regulated environment is far greater than the cost of doing it slowly and correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  References and Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/step-functions/latest/dg/limits-overview.html" rel="noopener noreferrer"&gt;AWS Step Functions Express Workflows — Quotas and Limits&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html" rel="noopener noreferrer"&gt;Amazon Textract AnalyzeDocument API Reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html" rel="noopener noreferrer"&gt;Amazon Bedrock Model IDs and Versioning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock-overview.html" rel="noopener noreferrer"&gt;S3 Object Lock — Compliance Mode&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/encryption-sdk/latest/developer-guide/dynamodb-encryption.html" rel="noopener noreferrer"&gt;AWS Encryption SDK — DynamoDB Attribute Encryption&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/compliance/bacen/" rel="noopener noreferrer"&gt;AWS Financial Services Compliance — BACEN Resolution 4.893&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/architecture/building-event-driven-architectures/" rel="noopener noreferrer"&gt;Building event-driven architectures on AWS — AWS Architecture Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dataintensive.net/" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications — Martin Kleppmann&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fernando.moretes.com/blog/kyc-moderno-com-serverless-ia-e-trilha-de-auditoria" rel="noopener noreferrer"&gt;fernando.moretes.com&lt;/a&gt;. By Fernando F. Azevedo — Senior Solutions Architect.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>financialsystems</category>
      <category>kyc</category>
      <category>financialservices</category>
      <category>serverless</category>
    </item>
    <item>
      <title>One of the most confusing errors you can face while deploying a Node.js or Docker-based application</title>
      <dc:creator>FOLASAYO SAMUEL OLAYEMI</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:50:34 +0000</pubDate>
      <link>https://dev.to/saint_vandora/one-of-the-most-confusing-errors-you-can-face-while-deploying-a-nodejs-or-docker-based-application-2f6k</link>
      <guid>https://dev.to/saint_vandora/one-of-the-most-confusing-errors-you-can-face-while-deploying-a-nodejs-or-docker-based-application-2f6k</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/saint_vandora/fixing-git-divergent-branches-on-a-production-server-real-devops-debugging-walkthrough-48np" class="crayons-story__hidden-navigation-link"&gt;Fixing “Git Divergent Branches” on a Production Server (Real DevOps Debugging Walkthrough)&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/saint_vandora" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F720588%2F0168bd20-9ed8-4499-b590-b651c8202e85.jpeg" alt="saint_vandora profile" class="crayons-avatar__image" width="800" height="800"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/saint_vandora" class="crayons-story__secondary fw-medium m:hidden"&gt;
              FOLASAYO SAMUEL OLAYEMI
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                FOLASAYO SAMUEL OLAYEMI
                
              
              &lt;div id="story-author-preview-content-3973591" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/saint_vandora" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F720588%2F0168bd20-9ed8-4499-b590-b651c8202e85.jpeg" class="crayons-avatar__image" alt="" width="800" height="800"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;FOLASAYO SAMUEL OLAYEMI&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/saint_vandora/fixing-git-divergent-branches-on-a-production-server-real-devops-debugging-walkthrough-48np" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jun 23&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/saint_vandora/fixing-git-divergent-branches-on-a-production-server-real-devops-debugging-walkthrough-48np" id="article-link-3973591"&gt;
          Fixing “Git Divergent Branches” on a Production Server (Real DevOps Debugging Walkthrough)
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devops&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/tutorial"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;tutorial&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/programming"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;programming&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/saint_vandora/fixing-git-divergent-branches-on-a-production-server-real-devops-debugging-walkthrough-48np" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/exploding-head-daceb38d627e6ae9b730f36a1e390fca556a4289d5a41abb2c35068ad3e2c4b5.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/multi-unicorn-b44d6f8c23cdd00964192bedc38af3e82463978aa611b4365bd33a0f1f4f3e97.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;5&lt;span class="hidden s:inline"&gt;&amp;nbsp;reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/saint_vandora/fixing-git-divergent-branches-on-a-production-server-real-devops-debugging-walkthrough-48np#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              

              &lt;span class="hidden s:inline"&gt;Add&amp;nbsp;Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            2 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>One Org or Many? The Postmortem Nobody Wants to Write</title>
      <dc:creator>Fernando Azevedo</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:50:06 +0000</pubDate>
      <link>https://dev.to/fernando_azevedo_6844e930/one-org-or-many-the-postmortem-nobody-wants-to-write-5e50</link>
      <guid>https://dev.to/fernando_azevedo_6844e930/one-org-or-many-the-postmortem-nobody-wants-to-write-5e50</guid>
      <description>&lt;p&gt;Sometime in 2023, a mid-sized financial institution I worked with consolidated all its business units under a single AWS Organization to simplify cost governance. Eighteen months later, a Service Control Policy mistake applied at the root node silenced the production payments pipeline for 47 minutes. The postmortem revealed something most architects know intuitively but rarely document rigorously: Organizations topology is not a FinOps decision — it is a blast radius decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happened: Context and Organizational Pressure
&lt;/h2&gt;

&lt;p&gt;The pressure came from above. The CFO wanted consolidated cost visibility, the CISO wanted a single policy enforcement point, and the platform team — already stretched thin — wanted to reduce the number of landing zone automation pipelines. The obvious solution seemed to be: one Organization, multiple OUs, hierarchical SCPs. The reasoning was defensible on a whiteboard.&lt;/p&gt;

&lt;p&gt;The problem started when the security team needed to block access to certain AWS regions for LGPD/GDPR compliance. The SCP was drafted with an &lt;code&gt;aws:RequestedRegion&lt;/code&gt; condition denying all regions outside &lt;code&gt;sa-east-1&lt;/code&gt; and &lt;code&gt;us-east-1&lt;/code&gt;. The test was run against a sandbox OU. Approval was granted. The deploy went to the root node — not the production OU, the &lt;strong&gt;root&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What nobody had explicitly mapped: the payments account used &lt;code&gt;us-east-2&lt;/code&gt; for a DynamoDB Global Tables endpoint serving as a low-latency read replica. The SCP blocked &lt;code&gt;dynamodb:GetItem&lt;/code&gt; and &lt;code&gt;dynamodb:Query&lt;/code&gt; calls originating from Lambda functions in that account to that region. The Python SDK (boto3) with exponential retry masked the error for roughly 8 minutes before the circuit breaker in the payment authorization service tripped. The alert arrived via a CloudWatch Alarm with a 5xx error threshold on API Gateway — but the runbook pointed to the authorization service, not the data layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Incident Timeline
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;T+00:00 — SCP applied at root node&lt;/strong&gt; — Security engineer runs &lt;code&gt;aws organizations attach-policy&lt;/code&gt; pointing to Root ID instead of the sandbox OU ID. No scope validation in the IaC pipeline (Terraform) blocked the operation — the policy already existed, only the target changed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;T+00:08 — First silent SDK errors&lt;/strong&gt; — boto3 with &lt;code&gt;max_attempts=5&lt;/code&gt; and exponential backoff begins absorbing &lt;code&gt;AccessDeniedException&lt;/code&gt; from DynamoDB calls in &lt;code&gt;us-east-2&lt;/code&gt;. The 15s Lambda timeout has not yet been reached. No active alarm.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;T+00:11 — Circuit breaker trips in authorization service&lt;/strong&gt; — The payment authorization service, deployed on EKS with Resilience4j, opens the circuit after 10 consecutive failures in 30s. Transactions begin returning HTTP 503 to API Gateway.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;T+00:14 — CloudWatch Alarm fires&lt;/strong&gt; — Alarm &lt;code&gt;PaymentAPI-5xxRate &amp;gt; 1%&lt;/code&gt; sends notification to SNS → PagerDuty. On-call engineer receives the alert. Initial runbook points to the EKS authorization service.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;T+00:22 — Initial misdiagnosis&lt;/strong&gt; — Engineer checks EKS pods, Resilience4j logs, CPU/memory metrics. All normal. Escalates to tech lead. No CloudTrail correlation yet.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;T+00:31 — CloudTrail and Organizations correlation&lt;/strong&gt; — Tech lead queries CloudTrail Lake with an Athena query filtering &lt;code&gt;errorCode = AccessDeniedException&lt;/code&gt; in the last 60 minutes. Identifies pattern in DynamoDB calls from &lt;code&gt;us-east-2&lt;/code&gt;. Cross-references Organizations API and finds the root-level attach-policy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;T+00:38 — SCP detached from root node&lt;/strong&gt; — Senior security engineer runs &lt;code&gt;aws organizations detach-policy&lt;/code&gt;. Change propagation takes approximately 4 minutes across all affected accounts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;T+00:47 — Service restored&lt;/strong&gt; — Circuit breaker closes after successful health checks. Error rate returns to &amp;lt; 0.1%. Incident closed. Total duration: 47 minutes of partial degradation, 36 minutes of complete unavailability for new payments.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Root Cause: Unrestricted Blast Radius by Design:&lt;/strong&gt; The root cause was not the human error of targeting the root node — human errors are inevitable. The root cause was architectural: a single AWS Organization with no isolation boundary between regulated workloads (payments, PCI-DSS) and operational workloads (security, tooling). Any SCP applied at the Root affects &lt;strong&gt;all&lt;/strong&gt; accounts simultaneously, with no staging, no canary, no automatic rollback. The design turned a routine security policy change into a change with organization-level blast radius. In financial-grade environments, this is unacceptable.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Topology: Single-Org Blast Radius vs. Multi-Org Isolation
&lt;/h2&gt;

&lt;p&gt;The diagram compares the pattern that caused the incident (left) with the remediated architecture (right). Red edges indicate unrestricted SCP propagation; green edges indicate isolation boundary with controlled promotion.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔴 Padrão Problemático — Single Org / Problematic Pattern — Single Org
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Root Management Account (security)&lt;/li&gt;
&lt;li&gt;SCP: DenyRegion applied at Root (security)&lt;/li&gt;
&lt;li&gt;OU: Security Tooling Accounts (security)&lt;/li&gt;
&lt;li&gt;OU: Payments PCI-DSS Accounts (compute)&lt;/li&gt;
&lt;li&gt;DynamoDB us-east-2 replica (data)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🟢 Padrão Remediado — Multi-Org com Isolamento / Remediated Pattern — Multi-Org
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Org: Operations Security &amp;amp; Tooling (security)&lt;/li&gt;
&lt;li&gt;Org: Payments PCI-DSS Boundary (compute)&lt;/li&gt;
&lt;li&gt;SCP: DenyRegion scoped to Ops Org (security)&lt;/li&gt;
&lt;li&gt;SCP: PCI Controls independent lifecycle (security)&lt;/li&gt;
&lt;li&gt;RAM / PrivateLink cross-org sharing (network)&lt;/li&gt;
&lt;li&gt;CloudTrail Lake centralized (delegated) (data)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Flows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;scp-bad -&amp;gt; root: attached at Root&lt;/li&gt;
&lt;li&gt;root -&amp;gt; ou-sec: unrestricted inheritance&lt;/li&gt;
&lt;li&gt;root -&amp;gt; ou-pay: full blast radius&lt;/li&gt;
&lt;li&gt;ou-pay -&amp;gt; ddb-replica: blocked by SCP&lt;/li&gt;
&lt;li&gt;scp-ops -&amp;gt; org-ops: isolated scope&lt;/li&gt;
&lt;li&gt;scp-pay -&amp;gt; org-pay: independent lifecycle&lt;/li&gt;
&lt;li&gt;org-ops -&amp;gt; ram-share: controlled sharing&lt;/li&gt;
&lt;li&gt;org-pay -&amp;gt; ram-share: access via PrivateLink&lt;/li&gt;
&lt;li&gt;org-ops -&amp;gt; ct-lake: delegated logs&lt;/li&gt;
&lt;li&gt;org-pay -&amp;gt; ct-lake: delegated logs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why the Single-Org vs. Multi-Org Decision Is Fundamentally an Isolation Decision
&lt;/h2&gt;

&lt;p&gt;There is a dominant narrative that multiple Organizations increase operational complexity — and it is partially true. But it obscures what is actually being traded. In a single Organization, the root node and top-level OUs are attack surfaces for policy changes with instant propagation and no native progressive rollback mechanism. AWS Organizations has no concept of a "canary SCP deploy". When you apply an SCP at the Root, it takes effect immediately across all ~100, ~500, or ~2000 accounts under that root.&lt;/p&gt;

&lt;p&gt;In financial-grade environments with multiple regulatory regimes — PCI-DSS for payments, SOC 2 for data operations, BACEN 4.893 for cyber resilience in Brazil — the temptation to use a single Organization with specialized OUs is understandable. The operational reality is that PCI-DSS compliance controls require network and policy isolation that is cleaner with a real Organization boundary, not just an OU boundary.&lt;/p&gt;

&lt;p&gt;The Organization boundary provides: (1) SCP policies with completely independent lifecycles; (2) separate Management Account credentials, reducing the blast radius of high-privilege credential compromise; (3) consolidated billing still available via AWS Organizations trusted access and cross-account Cost and Usage Reports; (4) CloudTrail Lake with a delegated administrator that can aggregate logs from multiple Organizations into a single S3 data store with Athena, maintaining centralized visibility without policy coupling.&lt;/p&gt;

&lt;p&gt;The real cost of multiple Organizations is duplication of landing zone automation — and that cost is addressable with an Account Vending Machine based on Control Tower customizations and shared IaC pipelines via cross-account CodePipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical Remediation: What We Changed After the Incident
&lt;/h2&gt;

&lt;p&gt;The remediation was not simply "move payments to a new Organization". That would have taken weeks and required re-onboarding dozens of accounts. The remediation was layered, with immediate impact first and structural refactoring afterward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Immediate (week 1):&lt;/strong&gt; We implemented a guardrail SCP at the root node that explicitly denies &lt;code&gt;organizations:AttachPolicy&lt;/code&gt; for any target that is the Root ID (&lt;code&gt;r-xxxx&lt;/code&gt;) or any tier-1 OU containing production workloads. The condition uses &lt;code&gt;aws:ResourceTag&lt;/code&gt; combined with an &lt;code&gt;Environment=Production&lt;/code&gt; tag applied to OUs via Organizations tag policy. This does not resolve the structural problem, but adds a protection layer against the specific error that occurred. The Terraform pipeline was updated with a &lt;code&gt;precondition&lt;/code&gt; that validates the target type before any &lt;code&gt;attach-policy&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Medium term (months 2-3):&lt;/strong&gt; We created a second AWS Organization for PCI-DSS workloads. The Management Account of the new org uses MFA with a hardware token and access restricted to two senior engineers. SCPs in the new org are managed by a separate pipeline with mandatory two-reviewer approval via pull request. CloudTrail Lake was configured with a cross-organization Event Data Store using the &lt;code&gt;organizationEnabled&lt;/code&gt; + delegated administrator feature, aggregating events from both orgs into a single Athena repository.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability:&lt;/strong&gt; We added an EventBridge rule in each org's Management Account that captures &lt;code&gt;organizations.amazonaws.com&lt;/code&gt; events of type &lt;code&gt;AttachPolicy&lt;/code&gt; and &lt;code&gt;DetachPolicy&lt;/code&gt; and publishes to an SNS topic with subscriptions to the security Slack channel and PagerDuty. MTTD for this type of change dropped from ~31 minutes (the time it took to correlate during the incident) to under 2 minutes in post-implementation validation tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Single-Org vs. Multi-Org: Real Trade-offs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Single Organization&lt;/th&gt;
&lt;th&gt;Multiple Organizations&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SCP blast radius&lt;/td&gt;
&lt;td&gt;Root affects 100% of accounts instantly&lt;/td&gt;
&lt;td&gt;Isolated by Organization boundary; changes are independent&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational complexity&lt;/td&gt;
&lt;td&gt;Lower: single landing zone pipeline, single Control Tower&lt;/td&gt;
&lt;td&gt;Higher: multiple pipelines, multiple Control Tower enrollments&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost visibility&lt;/td&gt;
&lt;td&gt;Native via Consolidated Billing&lt;/td&gt;
&lt;td&gt;Requires cross-account CUR + Athena or AWS Cost Explorer linked accounts&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regulatory isolation (PCI, SOC2)&lt;/td&gt;
&lt;td&gt;Possible via OUs, but policy boundary is logical, not physical&lt;/td&gt;
&lt;td&gt;Physical boundary between orgs; auditors accept more readily&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Management Account compromise&lt;/td&gt;
&lt;td&gt;One compromised account = potential access to entire organization&lt;/td&gt;
&lt;td&gt;Blast radius limited to specific org; other orgs unaffected&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SCP propagation latency&lt;/td&gt;
&lt;td&gt;Seconds to a few minutes across all accounts&lt;/td&gt;
&lt;td&gt;Same behavior within each org; orgs are independent&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Real Problem with SCPs: No Staging, No Automatic Rollback
&lt;/h2&gt;

&lt;p&gt;One of the most important findings from the postmortem was that the team had treated SCPs like ordinary infrastructure code — with the same deployment pipeline as a security group or IAM role. This is a mental model error.&lt;/p&gt;

&lt;p&gt;SCPs are access control policies with immediate propagation and no native progressive rollback mechanism. There is no &lt;code&gt;aws organizations deploy-policy --canary 10%&lt;/code&gt;. When you &lt;code&gt;attach-policy&lt;/code&gt; to an OU with 200 accounts, all 200 accounts are affected simultaneously. AWS Organizations has no concept of deployment rings for policies.&lt;/p&gt;

&lt;p&gt;The practical implication is that the SCP change process must be treated like a production database-level change — with a maintenance window, dual approval, and a tested rollback plan. The rollback plan for an SCP is simple: &lt;code&gt;detach-policy&lt;/code&gt;. But if you do not know what policy was in place before, or if the change was composed of multiple operations, rollback may not be trivial.&lt;/p&gt;

&lt;p&gt;What we implemented: an immutable state registry of SCPs per OU/Root, stored in S3 with versioning enabled and Object Lock in COMPLIANCE mode for 90 days. Before any &lt;code&gt;attach-policy&lt;/code&gt;, the pipeline saves the current state. Automated rollback is a Lambda that reads the previous state from S3 and executes the inverse operations. Automated rollback time in tests was 45 seconds — compared to the 7 minutes it took to identify and manually execute during the actual incident.&lt;/p&gt;

&lt;p&gt;A frequently overlooked detail: SCPs with explicit &lt;code&gt;Deny&lt;/code&gt; take precedence over any &lt;code&gt;Allow&lt;/code&gt; in identity policies, including IAM Role policies with &lt;code&gt;AdministratorAccess&lt;/code&gt;. This means that not even the account root user (unless explicitly excluded via &lt;code&gt;aws:PrincipalType: Root&lt;/code&gt;) can execute actions blocked by an SCP. In our incident, this is what made the situation so severe — there was no escape hatch in the payments account.&lt;/p&gt;

&lt;h2&gt;
  
  
  FinOps in Multi-Org: The Argument That Overcomes Resistance
&lt;/h2&gt;

&lt;p&gt;The most common argument against multiple Organizations is the loss of consolidated cost visibility. That argument was valid in 2018. In 2024, it is a solved problem — with some important caveats.&lt;/p&gt;

&lt;p&gt;AWS Cost and Usage Report (CUR 2.0) can be configured for delivery to a centralized S3 bucket in a dedicated billing account, even across multiple Organizations, using a cross-account S3 bucket policy pattern with &lt;code&gt;s3:PutObject&lt;/code&gt; allowed for the &lt;code&gt;billingreports.amazonaws.com&lt;/code&gt; service principal from multiple Management Account IDs. Athena + AWS Glue Crawler over this data produces a unified cost view that the CFO can consume via QuickSight with row-level security per business unit.&lt;/p&gt;

&lt;p&gt;What is not natively solved: Reserved Instances and Savings Plans are not shared across Organizations. This is a real cost. In our analysis, the payments account used approximately $18k/month in Compute Savings Plans that, upon moving to a new Organization, could no longer be shared with tooling accounts in the original org. The solution was to consolidate Savings Plans in the new payments org and use On-Demand for tooling workloads with more variable usage — the cost delta was approximately $1.2k/month, which was accepted as the cost of regulatory isolation.&lt;/p&gt;

&lt;p&gt;A pattern I recommend: use AWS Cost Categories with rules based on &lt;code&gt;CostCenter&lt;/code&gt; and &lt;code&gt;BusinessUnit&lt;/code&gt; tags applied via tag policies in both Organizations. This allows financial reporting to be agnostic to Organizations topology — the CFO sees by cost center, not by org.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Well-Architected: Affected Pillars
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;security&lt;/strong&gt;: SCPs must have a change management lifecycle equivalent to production database changes. Use &lt;code&gt;aws:PrincipalType: Root&lt;/code&gt; as an explicit escape hatch in critical SCPs. Implement EventBridge + SNS for immediate detection of policy attach/detach in the Management Account. Consider Organization boundary as physical isolation for PCI-DSS and BACEN 4.893 regimes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;reliability&lt;/strong&gt;: Organizations design must minimize the blast radius of operational changes. Circuit breakers in downstream services (EKS/Resilience4j, Lambda with Dead Letter Queue) are necessary but insufficient — they mask the error without resolving the cause. Add specific health checks for &lt;code&gt;AccessDeniedException&lt;/code&gt; with low-latency alarms (&amp;lt; 5 minutes MTTD). Implement automated SCP rollback with immutable state in S3 Object Lock.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Anti-Patterns That Lead to the Incident
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Treating SCPs as ordinary infrastructure in the CI/CD pipeline without differentiated approval by target level (Root, production OU, sandbox OU)&lt;/li&gt;
&lt;li&gt;Using a single Organization to consolidate cost governance without evaluating the blast radius of security policies on regulated workloads&lt;/li&gt;
&lt;li&gt;Configuring 5xx alarms on API Gateway as the only detection signal without specific alarms for &lt;code&gt;AccessDeniedException&lt;/code&gt; in CloudTrail&lt;/li&gt;
&lt;li&gt;Assuming that circuit breakers in downstream services substitute for architectural isolation — they are complementary, not equivalent&lt;/li&gt;
&lt;li&gt;Failing to map cross-region dependencies of accounts before applying SCPs with &lt;code&gt;aws:RequestedRegion&lt;/code&gt; conditions&lt;/li&gt;
&lt;li&gt;Relying on OUs as regulatory isolation boundaries for PCI-DSS auditors without explicitly documenting that the boundary is logical, not physical&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Architect's Note:&lt;/strong&gt; After this incident, I started recommending a simple rule: if you have workloads with distinct regulatory regimes (PCI-DSS, SOC 2, BACEN) or with availability SLOs above 99.9%, they belong in separate Organizations — not separate OUs. The additional operational cost of multiple Organizations is real, but it is a predictable and manageable engineering cost; the cost of a 47-minute incident on a payments pipeline is not. The hardest lesson was realizing we treated SCPs as code when we should have treated them like production database schema changes: with staging, dual approval, tested rollback, and a maintenance window. That is not bureaucracy — it is reliability engineering applied to the control plane.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Verdict: When to Use One or Multiple Organizations
&lt;/h2&gt;

&lt;p&gt;Use a single AWS Organization when: all your workloads share the same regulatory regime, the same availability SLO, and the platform team has capacity to implement rigorous guardrails in the SCP change pipeline. Use multiple Organizations when: you have distinct regulatory regimes (especially PCI-DSS or BACEN 4.893), SLOs above 99.9% on critical workloads, or when auditors require evidence of physical policy isolation. The cost of non-shared Savings Plans is quantifiable and generally lower than the cost of a single incident caused by unrestricted blast radius. The decision is not about operational simplicity — it is about where you accept that inevitable human error has consequences.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html" rel="noopener noreferrer"&gt;AWS Organizations — Service Control Policies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake.html" rel="noopener noreferrer"&gt;AWS CloudTrail Lake — Cross-Organization Event Data Stores&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/controltower/latest/userguide/customize-landing-zone.html" rel="noopener noreferrer"&gt;AWS Control Tower — Customizations for Landing Zone&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html" rel="noopener noreferrer"&gt;AWS Cost and Usage Report 2.0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/security-pillar/welcome.html" rel="noopener noreferrer"&gt;AWS Well-Architected Framework — Security Pillar&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/architecture/" rel="noopener noreferrer"&gt;AWS Architecture Blog — Single versus multiple AWS Organizations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.bcb.gov.br/estabilidadefinanceira/exibenormativo?tipo=Resolu%C3%A7%C3%A3o%20BCB&amp;amp;numero=85" rel="noopener noreferrer"&gt;BACEN Resolução 4.893/2021 — Política de Segurança Cibernética&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.pcisecuritystandards.org/document_library/" rel="noopener noreferrer"&gt;PCI DSS v4.0 — Requirement 1: Network Security Controls&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fernando.moretes.com/blog/aws-organizations-single-ou-multi-org-com-governanca" rel="noopener noreferrer"&gt;fernando.moretes.com&lt;/a&gt;. By Fernando F. Azevedo — Senior Solutions Architect.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>awscloud</category>
      <category>awsorganizations</category>
      <category>landingzone</category>
      <category>governance</category>
    </item>
    <item>
      <title>ADR: Adopting Amazon Bedrock AgentCore in Production</title>
      <dc:creator>Fernando Azevedo</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:50:01 +0000</pubDate>
      <link>https://dev.to/fernando_azevedo_6844e930/adr-adopting-amazon-bedrock-agentcore-in-production-12ll</link>
      <guid>https://dev.to/fernando_azevedo_6844e930/adr-adopting-amazon-bedrock-agentcore-in-production-12ll</guid>
      <description>&lt;p&gt;After 16 years building financial platforms on AWS, I've learned that the most dangerous question in architecture isn't 'does this work?' — it's 'who operates this at 2 AM when it breaks?' Bedrock AgentCore is AWS's answer to the problem of operationalizing AI agents beyond the notebook: managed runtime, memory, tool-use, guardrails, and traceability in a single control plane. This ADR documents how I arrived at the decision to adopt it — or not — in a regulated financial environment, and the consequences you need to internalize before doing the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context and Forces
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Context and Forces
&lt;/h2&gt;

&lt;p&gt;The scenario that motivated this decision is recurring in financial institutions: a product team wants to expose an AI agent to internal analysts — capable of querying market data via API, running risk calculations in Lambda, retrieving context from regulatory documents via RAG, and recording every action in an immutable audit trail. The MVP worked in two sprints with LangChain + Claude via Bedrock. The problem surfaced the following week.&lt;/p&gt;

&lt;p&gt;Five forces made the decision urgent: &lt;strong&gt;(1) Cross-turn state management&lt;/strong&gt; — financial agent sessions last minutes, not seconds; reliably maintaining context in stateless Lambda is brittle. &lt;strong&gt;(2) Regulatory traceability&lt;/strong&gt; — every tool call, every model decision, every response must be auditable with timestamp, identity, and full payload, without relying on ad-hoc logging. &lt;strong&gt;(3) Guardrails as contract&lt;/strong&gt; — in finance, the agent cannot leak PII, cannot recommend products without disclaimers, cannot execute irreversible actions without human confirmation. Implementing this manually in every agent is guaranteed technical debt. &lt;strong&gt;(4) Unpredictable token cost&lt;/strong&gt; — without per-session budget control, a faulty agent loop can consume tens of dollars in minutes. &lt;strong&gt;(5) Runtime portability&lt;/strong&gt; — the platform team doesn't want to maintain a custom agent scheduler; they want an SLA contract with AWS.&lt;/p&gt;

&lt;h2&gt;
  
  
  Options Considered
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Option A: Self-hosted LangChain/LangGraph on EKS
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full control over execution graph and retry logic&lt;/li&gt;
&lt;li&gt;Model portability — swap LLM without platform change&lt;/li&gt;
&lt;li&gt;Mature ecosystem of community integrations and tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full operational responsibility: scaling, HA, patching, observability&lt;/li&gt;
&lt;li&gt;Guardrails and audit trail must be built and maintained by the team&lt;/li&gt;
&lt;li&gt;Session memory management requires custom DynamoDB or Redis&lt;/li&gt;
&lt;li&gt;High engineering cost to reach parity with managed features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Suitable for teams with mature AI platform; high operational risk for smaller teams&lt;/p&gt;

&lt;h3&gt;
  
  
  Option B: Bedrock Agents (prior generation, without AgentCore)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS-managed, no runtime infrastructure to operate&lt;/li&gt;
&lt;li&gt;Native integration with Knowledge Bases and Action Groups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Limited observability: partial traces, no native span-level detail&lt;/li&gt;
&lt;li&gt;No native per-session budget control&lt;/li&gt;
&lt;li&gt;Agent loop customization restricted to what AWS exposes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Good for simple cases; observability limitations are blockers in finance&lt;/p&gt;

&lt;h3&gt;
  
  
  Option C: Amazon Bedrock AgentCore
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Managed runtime with native persistent session memory (AgentCore Memory)&lt;/li&gt;
&lt;li&gt;Configurable guardrails as declarative policy, not inline code&lt;/li&gt;
&lt;li&gt;Native traceability via CloudTrail + X-Ray with tool-call spans&lt;/li&gt;
&lt;li&gt;AgentCore Gateway for tool-use with OAuth2/OIDC and per-tool throttling&lt;/li&gt;
&lt;li&gt;Configurable per-session token budget control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Platform lock-in to AWS for the agent runtime&lt;/li&gt;
&lt;li&gt;Execution graph customization more restricted than LangGraph&lt;/li&gt;
&lt;li&gt;New service: API surface still evolving, conservative quotas&lt;/li&gt;
&lt;li&gt;AgentCore Memory and Gateway costs added on top of inference cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Recommended decision for regulated financial environments with a lean platform team&lt;/p&gt;

&lt;h3&gt;
  
  
  Option D: Step Functions + Lambda as agent orchestrator
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Native audit via Step Functions execution history&lt;/li&gt;
&lt;li&gt;Declarative and testable retry, timeout, and error handling&lt;/li&gt;
&lt;li&gt;No new service to learn — team already knows the pattern&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not an agent runtime: each 'turn' requires a new execution or .waitForTaskToken&lt;/li&gt;
&lt;li&gt;Session memory and model context must be managed externally&lt;/li&gt;
&lt;li&gt;Cold-start and state transition latency can be noticeable in dialogues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Excellent for deterministic workflows; inadequate as a conversational agent runtime&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decision and the Reasoning Behind It
&lt;/h2&gt;

&lt;h2&gt;
  
  
  The Decision and the Reasoning Behind It
&lt;/h2&gt;

&lt;p&gt;The decision was to adopt &lt;strong&gt;Bedrock AgentCore&lt;/strong&gt; as the primary agent runtime, with Step Functions as the orchestrator for adjacent deterministic workflows (approvals, reconciliations, notifications). This is not an all-or-nothing decision: AgentCore solves the &lt;em&gt;non-deterministic agent loop&lt;/em&gt; problem, while Step Functions remains the right choice for the &lt;em&gt;deterministic business process&lt;/em&gt; that wraps the agent.&lt;/p&gt;

&lt;p&gt;The decisive argument was the &lt;strong&gt;AgentCore Gateway&lt;/strong&gt; with per-tool OAuth2/OIDC support. In a financial environment, every tool-call is an action with identity: who authorized it, what scope, with which token. Implementing this manually in LangChain would mean building and maintaining an authorization proxy — exactly the kind of infrastructure that generates no business value but generates security incidents when neglected. The Gateway delivers this as declarative configuration, with per-tool throttling (e.g., maximum 10 calls/session for the order execution API) and a native circuit breaker.&lt;/p&gt;

&lt;p&gt;The second argument was &lt;strong&gt;session memory with configurable TTL&lt;/strong&gt;. AgentCore Memory persists conversation context in a managed store, with per-session configurable TTL and KMS customer-managed key (CMK) encryption. For LGPD/GDPR compliance, this means I can configure a 24h TTL for analyst sessions and guarantee that no session data persists beyond what's necessary — without building a custom expiration pipeline.&lt;/p&gt;

&lt;p&gt;The lock-in trade-off was consciously accepted: the tool-use layer (the Lambda functions that execute the actual actions) remains completely portable. If we need to migrate the runtime in the future, the tools keep working.&lt;/p&gt;

&lt;h2&gt;
  
  
  Financial Agent Architecture with Bedrock AgentCore
&lt;/h2&gt;

&lt;p&gt;Execution flow of a financial analysis agent: from analyst to AgentCore runtime, through guardrails, tool-use via Gateway, session memory, and observability&lt;/p&gt;

&lt;h3&gt;
  
  
  🔐 AWS — Segurança &amp;amp; Entrada
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;API Gateway REST + Cognito JWT (edge)&lt;/li&gt;
&lt;li&gt;Bedrock Guardrails PII filter + topic deny (security)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🤖 AWS — AgentCore Runtime
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;AgentCore Runtime Claude 3.5 Sonnet (ai)&lt;/li&gt;
&lt;li&gt;AgentCore Memory TTL=24h, KMS CMK (storage)&lt;/li&gt;
&lt;li&gt;AgentCore Gateway OAuth2/OIDC, throttle (security)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ⚙️ AWS — Ferramentas (Tool-use)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Lambda: Market Data Bloomberg API proxy (compute)&lt;/li&gt;
&lt;li&gt;Lambda: Risk Calc VaR engine (compute)&lt;/li&gt;
&lt;li&gt;Knowledge Base OpenSearch + S3 (data)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  📊 AWS — Observabilidade &amp;amp; Auditoria
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;X-Ray span por tool-call (external)&lt;/li&gt;
&lt;li&gt;CloudTrail API audit log (storage)&lt;/li&gt;
&lt;li&gt;CloudWatch SLO dashboards (external)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Flows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;analyst -&amp;gt; apigw: HTTPS + JWT&lt;/li&gt;
&lt;li&gt;apigw -&amp;gt; guardrails: input validation&lt;/li&gt;
&lt;li&gt;guardrails -&amp;gt; agentcore: sanitized prompt&lt;/li&gt;
&lt;li&gt;agentcore -&amp;gt; memory: read/write context&lt;/li&gt;
&lt;li&gt;agentcore -&amp;gt; gateway: tool invocation&lt;/li&gt;
&lt;li&gt;gateway -&amp;gt; lambda_market: OAuth2 token&lt;/li&gt;
&lt;li&gt;gateway -&amp;gt; lambda_risk: OAuth2 token&lt;/li&gt;
&lt;li&gt;gateway -&amp;gt; kb: RAG query&lt;/li&gt;
&lt;li&gt;agentcore -&amp;gt; guardrails: output filter&lt;/li&gt;
&lt;li&gt;agentcore -&amp;gt; xray: traces&lt;/li&gt;
&lt;li&gt;apigw -&amp;gt; cloudtrail: API events&lt;/li&gt;
&lt;li&gt;xray -&amp;gt; cw: SLO metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Concrete Configuration: What Actually Matters
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Concrete Configuration: What Actually Matters
&lt;/h2&gt;

&lt;p&gt;Adopting AgentCore without properly configuring operational controls is worse than not adopting it — you gain a false sense of security without active guardrails. Here are the configurations that make a real difference:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guardrails as first line:&lt;/strong&gt; Configure &lt;code&gt;contentPolicyConfig&lt;/code&gt; with &lt;code&gt;HATE&lt;/code&gt;, &lt;code&gt;INSULTS&lt;/code&gt;, &lt;code&gt;SEXUAL&lt;/code&gt;, &lt;code&gt;VIOLENCE&lt;/code&gt; all set to &lt;code&gt;BLOCK&lt;/code&gt;, and &lt;code&gt;sensitiveInformationPolicyConfig&lt;/code&gt; with PII filters for &lt;code&gt;CREDIT_DEBIT_CARD_NUMBER&lt;/code&gt;, &lt;code&gt;AWS_ACCESS_KEY&lt;/code&gt;, &lt;code&gt;NAME&lt;/code&gt;, and &lt;code&gt;EMAIL&lt;/code&gt; in &lt;code&gt;ANONYMIZE&lt;/code&gt; mode. In a financial environment, add &lt;code&gt;topicPolicyConfig&lt;/code&gt; with explicitly denied topics: &lt;code&gt;"investment advice without disclaimer"&lt;/code&gt;, &lt;code&gt;"guaranteed returns"&lt;/code&gt;. This isn't paranoia — it's the minimum to pass a compliance review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentCore Memory with correct partitioning:&lt;/strong&gt; The memory partition key must be &lt;code&gt;userId + sessionId&lt;/code&gt;, never just &lt;code&gt;sessionId&lt;/code&gt;. In multi-tenant environments, sessions from different users with the same &lt;code&gt;sessionId&lt;/code&gt; collided in testing — a silent bug that leaks context between users. Configure &lt;code&gt;memoryConfiguration.enabledMemoryTypes&lt;/code&gt; with &lt;code&gt;SESSION_SUMMARY&lt;/code&gt; for long sessions, reducing context token consumption by up to 40% in sessions exceeding 20 turns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gateway with per-tool throttling:&lt;/strong&gt; Define separate &lt;code&gt;rateLimit&lt;/code&gt; for each Action Group. The order execution API should have &lt;code&gt;maxRequestsPerSession: 5&lt;/code&gt; and &lt;code&gt;requireConfirmation: ENABLED&lt;/code&gt;. The market data query API can have &lt;code&gt;maxRequestsPerSession: 50&lt;/code&gt;. Without this granularity, a faulty agent loop can execute dozens of orders before being detected — a scenario I've seen happen in production with frameworks lacking tool-use controls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-session token budget:&lt;/strong&gt; Configure &lt;code&gt;sessionConfiguration.maxTokens&lt;/code&gt; with a conservative initial value — I recommend 50,000 tokens for typical analysis sessions. Monitor the p95 token consumption per session in CloudWatch and adjust. An agent entering a reasoning loop can consume 200k+ tokens in a single session without this control.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability: What to Measure and How
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Observability: What to Measure and How
&lt;/h2&gt;

&lt;p&gt;AI agents have a different observability profile from traditional APIs. p99 latency is less useful than &lt;em&gt;turns-per-session distribution&lt;/em&gt; and &lt;em&gt;tool-call failure rate per tool&lt;/em&gt;. Here is the observability model I implemented:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent business metrics&lt;/strong&gt; (via CloudWatch custom metrics with namespace &lt;code&gt;FinancialAgent&lt;/code&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;TurnsPerSession&lt;/code&gt; — histogram; alert if p95 &amp;gt; 15 turns (indicates loop or poorly calibrated prompt)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;TokensPerSession&lt;/code&gt; — histogram; alert if p95 &amp;gt; 40k tokens&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ToolCallFailureRate&lt;/code&gt; per &lt;code&gt;ToolName&lt;/code&gt; — counter; SLO of &amp;lt; 1% failure for critical tools&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GuardrailInterventionRate&lt;/code&gt; — counter; spike indicates jailbreak attempt or prompt injection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Traces with X-Ray:&lt;/strong&gt; AgentCore emits spans for each tool invocation with attributes &lt;code&gt;bedrock.agent.toolName&lt;/code&gt;, &lt;code&gt;bedrock.agent.sessionId&lt;/code&gt;, and &lt;code&gt;bedrock.agent.turnCount&lt;/code&gt;. Configure a trace group with filter &lt;code&gt;annotation.bedrock.agent.toolName = "ExecuteOrder"&lt;/code&gt; and alert on latency &amp;gt; 2s — order execution above that indicates a downstream API issue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CloudTrail for regulatory audit:&lt;/strong&gt; Each &lt;code&gt;InvokeAgent&lt;/code&gt; API call is recorded with the caller ARN, &lt;code&gt;sessionId&lt;/code&gt;, and &lt;code&gt;inputText&lt;/code&gt; (truncated). For compliance, configure an S3 bucket with Object Lock in COMPLIANCE mode and 7-year retention for AgentCore CloudTrail logs. This is the minimum to meet Banco Central do Brasil and SEC audit requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost anomaly alarm:&lt;/strong&gt; Configure an AWS Budget with an alert at 80% of the monthly Bedrock budget, with an SNS action. Add a second CloudWatch alarm on &lt;code&gt;bedrock:InvokeModel&lt;/code&gt; with &lt;code&gt;model-id=anthropic.claude-3-5-sonnet&lt;/code&gt; and a threshold of 1,000 invocations/hour — above that, something is wrong.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Consequences and Risks You Need to Accept:&lt;/strong&gt; &lt;strong&gt;Runtime lock-in is real:&lt;/strong&gt; If AWS deprecates or significantly changes the AgentCore API, migration requires rewriting the orchestration logic — not just the tools. Mitigate by keeping tools (Lambda) completely runtime-agnostic and documenting the interface contract in a separate ADR.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Conservative quotas on a new service:&lt;/strong&gt; AgentCore has &lt;code&gt;concurrent agent sessions&lt;/code&gt; quotas that, at launch, were significantly lower than traditional Bedrock Agents quotas. Request quota increases &lt;em&gt;before&lt;/em&gt; go-live, not after. A peak event without adequate quota results in &lt;code&gt;ThrottlingException&lt;/code&gt; that the end client sees as a timeout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guardrails have latency:&lt;/strong&gt; Each pass through Guardrails adds 100-300ms of latency. In an agent with 10 turns, that's up to 3 additional seconds of accumulated latency. For use cases where latency is critical, consider disabling output guardrails on internal tools (not exposed to the end user) and applying them only on the final output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory is not free:&lt;/strong&gt; AgentCore Memory charges for storage and per read/write operation. In long sessions with &lt;code&gt;SESSION_SUMMARY&lt;/code&gt; active, memory cost can exceed inference cost for short sessions. Monitor &lt;code&gt;MemoryReadLatency&lt;/code&gt; and &lt;code&gt;MemoryWriteLatency&lt;/code&gt; — above 200ms indicates pressure on the managed store.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human-in-the-loop is not automatic:&lt;/strong&gt; &lt;code&gt;requireConfirmation: ENABLED&lt;/code&gt; on the Gateway pauses execution and waits for confirmation via callback. If the client doesn't respond within &lt;code&gt;confirmationTimeout&lt;/code&gt; (default: 300s), the session expires. Design the UX to make this clear to the user — timeout-expired sessions are the leading cause of complaints in financial agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Reference Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~40%&lt;/strong&gt; — Token reduction with SESSION_SUMMARY. In sessions with more than 20 turns, SESSION_SUMMARY reduces context sent to the model by ~40% vs. full history&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;100-300ms&lt;/strong&gt; — Latency added by Guardrails per turn. Each Guardrail evaluation (input + output) adds 100-300ms; across 10 turns, up to 3s accumulated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7 anos&lt;/strong&gt; — Minimum CloudTrail retention for financial compliance. S3 Object Lock in COMPLIANCE mode with 7 years meets Banco Central do Brasil and SEC requirements for agent audit trails&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Well-Architected Assessment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;security&lt;/strong&gt;: Declarative guardrails with PII filter and denied topics; AgentCore Gateway with per-tool OAuth2/OIDC; KMS CMK for session memory; CloudTrail with S3 Object Lock for immutable audit. IAM with &lt;code&gt;bedrock:AgentArnLike&lt;/code&gt; condition to restrict which agents can invoke which tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;reliability&lt;/strong&gt;: Automatic retry with jitter in the Bedrock SDK (max_attempts=3, mode=adaptive); native circuit breaker in AgentCore Gateway per tool; configurable session timeout prevents zombie sessions; concurrent session quotas must be requested before go-live.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;performance&lt;/strong&gt;: SESSION_SUMMARY reduces context tokens by ~40% for long sessions; disabling output guardrails on internal tools reduces accumulated latency; Knowledge Base with OpenSearch k-NN with HNSW and ef_search=512 for low-latency RAG.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cost&lt;/strong&gt;: AWS Budget with alert at 80% of monthly limit; CloudWatch alarm on invocations/hour per model-id; SESSION_SUMMARY reduces inference cost in long sessions; monitor AgentCore Memory cost separately from inference cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What the AWS Blog Doesn't Tell You
&lt;/h2&gt;

&lt;h2&gt;
  
  
  What the AWS Blog Doesn't Tell You
&lt;/h2&gt;

&lt;p&gt;AWS service launch blogs are excellent at showing the happy path. What they rarely cover are the edge cases you only discover in production. Here are the three that cost me the most time:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool-call idempotency is not guaranteed by the runtime.&lt;/strong&gt; If AgentCore attempts to invoke a tool and receives a timeout, it may retry — and your Lambda may be invoked twice for the same action. For idempotent tools (queries), this is harmless. For tools with side effects (order execution, email sending), you need to implement idempotency in the Lambda using an &lt;code&gt;idempotencyToken&lt;/code&gt; derived from &lt;code&gt;sessionId + turnId + toolName&lt;/code&gt;. Without this, order duplication is a matter of when, not if.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The model can ignore &lt;code&gt;requireConfirmation&lt;/code&gt; in certain prompt formulations.&lt;/strong&gt; I tested this: if the system prompt instructs the agent to "be proactive and execute actions without asking for unnecessary confirmation," the model may rationalize that a specific action doesn't need confirmation even with the flag active. The correct defense is dual: the flag on the Gateway &lt;em&gt;and&lt;/em&gt; an explicit instruction in the system prompt about when confirmation is mandatory. Never rely on a single layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentCore doesn't have native multi-agent support yet.&lt;/strong&gt; If your architecture requires a supervisor agent delegating to specialized agents (multi-agent orchestration pattern), you'll need to implement the delegation logic manually — typically with an agent that calls other agents via tool-use, where each "tool" is actually an invocation of another AgentCore. It works, but cross-session traceability requires manual &lt;code&gt;sessionId&lt;/code&gt; correlation via X-Ray.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-Patterns I've Seen in Architecture Reviews
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Using AgentCore without configuring Guardrails because "it's an internal environment" — insiders are the primary source of compliance incidents in finance&lt;/li&gt;
&lt;li&gt;Storing full session history in memory without SESSION_SUMMARY — token cost grows linearly with number of turns&lt;/li&gt;
&lt;li&gt;Implementing critical business logic inside the agent system prompt instead of in testable tools — prompts don't have unit tests&lt;/li&gt;
&lt;li&gt;Not requesting concurrent session quota increase before go-live — ThrottlingException during peak usage is predictable and preventable&lt;/li&gt;
&lt;li&gt;Assuming AgentCore Gateway replaces a business authorization layer — the Gateway controls access to the tool, not the authorization logic inside the tool&lt;/li&gt;
&lt;li&gt;Not implementing idempotency in tool Lambdas with side effects — runtime retries can duplicate irreversible actions&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Curator's Note:&lt;/strong&gt; In practice, what convinced me to adopt AgentCore was not any individual feature — it was the fact that the &lt;strong&gt;AgentCore Gateway with per-tool OAuth2/OIDC&lt;/strong&gt; solves the tool-call identity problem I was about to build manually, which would have taken two sprints and generated permanent technical debt. The hard-won lesson behind this: in financial environments, the cost of building custom security controls is not the initial development cost — it's the cost of maintaining, auditing, and fixing those controls over years. When a managed service delivers the control as declarative configuration, the decision to adopt it is rarely about feature parity; it's about where you want to allocate your team's engineering attention. My recommendation: adopt AgentCore for new production agents, keep tools portable, and invest the saved time in observability and adversarial prompting tests.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Verdict: Adopt with Explicit Controls
&lt;/h2&gt;

&lt;p&gt;Bedrock AgentCore is the right choice for financial teams that need to put AI agents into production without building and maintaining a custom orchestration runtime. The decision is not binary — it's about recognizing that AgentCore's value lies in the operational controls (Gateway, Guardrails, Memory with CMK), not just the execution runtime. The condition for adoption is clear: configure Guardrails before any testing with real data, implement idempotency in all tools with side effects, request concurrent session quota increases before go-live, and monitor TurnsPerSession and TokensPerSession as first-class SLO metrics. Lock-in is real but manageable if tools are kept portable. For teams that lack the capacity to build and operate a custom agent runtime — which is most teams — AgentCore is the correct architectural decision in 2025.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/agents-core.html" rel="noopener noreferrer"&gt;Amazon Bedrock AgentCore — Developer Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html" rel="noopener noreferrer"&gt;Amazon Bedrock Guardrails — Configuration Reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/agents-core-memory.html" rel="noopener noreferrer"&gt;Amazon Bedrock AgentCore Memory — Developer Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/machine-learning/" rel="noopener noreferrer"&gt;Building AI agents with Amazon Bedrock AgentCore — AWS ML Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/welcome.html" rel="noopener noreferrer"&gt;AWS Well-Architected Framework — Machine Learning Lens&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.powertools.aws.dev/lambda/python/latest/utilities/idempotency/" rel="noopener noreferrer"&gt;Idempotency for AWS Lambda — Powertools for AWS Lambda&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions" rel="noopener noreferrer"&gt;Architecture Decision Records — Michael Nygard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/knn.html" rel="noopener noreferrer"&gt;Amazon OpenSearch Service — k-NN Search with HNSW&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fernando.moretes.com/blog/agentes-producao-bedrock-agentcore-risco-operacional" rel="noopener noreferrer"&gt;fernando.moretes.com&lt;/a&gt;. By Fernando F. Azevedo — Senior Solutions Architect.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>bedrock</category>
      <category>agentcore</category>
      <category>adr</category>
    </item>
    <item>
      <title>CloudWatch to OTel: Tearing Down the Observability Bridge Pattern</title>
      <dc:creator>Fernando Azevedo</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:49:18 +0000</pubDate>
      <link>https://dev.to/fernando_azevedo_6844e930/cloudwatch-to-otel-tearing-down-the-observability-bridge-pattern-1hpm</link>
      <guid>https://dev.to/fernando_azevedo_6844e930/cloudwatch-to-otel-tearing-down-the-observability-bridge-pattern-1hpm</guid>
      <description>&lt;p&gt;In financial environments where platform teams need to consolidate signals from dozens of AWS workloads into a unified observability backend — whether Datadog, Grafana Cloud, Honeycomb, or an internal stack — the CloudWatch → OpenTelemetry bridge pattern appears as the obvious solution. But 'obvious' and 'correct' rarely coincide in architecture. This pattern has a specific anatomy, a narrow validity envelope, and failure modes that only surface in production under real load. I'm going to dissect every layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Problem: Observability Fragmentation in AWS-Native Environments
&lt;/h2&gt;

&lt;p&gt;Every organization that grows beyond two or three engineering teams faces the same tension: AWS services emit metrics natively to CloudWatch — Lambda, RDS, EKS, API Gateway, MSK — but the corporate observability backend speaks OTLP. The result is a split world: SREs need to open two consoles to correlate an incident, alerts live in different namespaces, and business dashboards become impossible to build without manual ETL.&lt;/p&gt;

&lt;p&gt;The bridge pattern exists to solve exactly this. The core idea is simple: a Lambda function (or an OTel collector running on ECS/EKS) subscribes to CloudWatch metric streams via &lt;strong&gt;CloudWatch Metric Streams&lt;/strong&gt; or polls the &lt;code&gt;GetMetricData&lt;/code&gt; API, transforms the payload to OTLP format, and forwards it to a collector endpoint. In theory, you get a single observability control plane. In practice, the complexity hides in the details.&lt;/p&gt;

&lt;p&gt;What makes this problem especially treacherous in financial environments is the combination of three factors: (1) metric volume — a mid-size AWS account with EKS, Multi-AZ RDS, and Lambda can easily generate 50,000+ metric series per minute; (2) business latency — anomaly detection SLOs require freshness of 60 seconds or less; (3) API cost — each &lt;code&gt;GetMetricData&lt;/code&gt; call has a direct cost and a quota of 50 metrics per request, meaning naive polling at scale breaks both the budget and AWS rate limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anatomy of the CloudWatch → OTel Bridge Pattern
&lt;/h2&gt;

&lt;p&gt;Complete flow from metric emission by AWS services to the external observability backend, through the two ingestion paths (Metric Streams and polling) and the security and cost guardrails.&lt;/p&gt;

&lt;h3&gt;
  
  
  🟧 AWS — Metric Sources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Lambda Invocations/Errors/Duration (compute)&lt;/li&gt;
&lt;li&gt;EKS / EC2 CPU, Memory, Network (compute)&lt;/li&gt;
&lt;li&gt;RDS / Aurora DBConnections, Latency (data)&lt;/li&gt;
&lt;li&gt;API Gateway 4xx/5xx, Latency (network)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🟦 AWS — Ingestion Layer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;CloudWatch Namespaces (storage)&lt;/li&gt;
&lt;li&gt;CW Metric Streams ~2-3s latency, Firehose (messaging)&lt;/li&gt;
&lt;li&gt;Kinesis Firehose JSON/OTel0.7 format (messaging)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🟨 AWS — Transform &amp;amp; Forward
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Bridge Lambda OTLP transform + retry (compute)&lt;/li&gt;
&lt;li&gt;SQS DLQ failed batches (messaging)&lt;/li&gt;
&lt;li&gt;KMS CMK payload encryption (security)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔵 External — Observability Backend
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;OTel Collector OTLP/gRPC :4317 (external)&lt;/li&gt;
&lt;li&gt;Datadog / Grafana / Honeycomb (external)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Flows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;lambda-src -&amp;gt; cw-ns: emits metrics&lt;/li&gt;
&lt;li&gt;eks-src -&amp;gt; cw-ns: emits metrics&lt;/li&gt;
&lt;li&gt;rds-src -&amp;gt; cw-ns: emits metrics&lt;/li&gt;
&lt;li&gt;apigw-src -&amp;gt; cw-ns: emits metrics&lt;/li&gt;
&lt;li&gt;cw-ns -&amp;gt; metric-stream: continuous stream&lt;/li&gt;
&lt;li&gt;metric-stream -&amp;gt; firehose: OpenTelemetry 0.7&lt;/li&gt;
&lt;li&gt;firehose -&amp;gt; bridge-lambda: batch trigger&lt;/li&gt;
&lt;li&gt;bridge-lambda -&amp;gt; kms-key: decrypt/encrypt&lt;/li&gt;
&lt;li&gt;bridge-lambda -&amp;gt; dlq: fail after 3 retries&lt;/li&gt;
&lt;li&gt;bridge-lambda -&amp;gt; otel-collector: OTLP/gRPC export&lt;/li&gt;
&lt;li&gt;otel-collector -&amp;gt; obs-backend: processed pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pattern Anatomy: Two Paths, One Fundamental Trade-off
&lt;/h2&gt;

&lt;p&gt;The pattern has two ingestion flavors, and the choice between them defines everything that follows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CloudWatch Metric Streams + Kinesis Firehose&lt;/strong&gt; is the low-latency path. Streams deliver data in OpenTelemetry 0.7 (protobuf) or JSON format with 2–3 second latency. The cost is predictable: $0.003 per 1,000 metric updates. For an account with 100k active series, that's roughly $300/month before Firehose costs. The critical architectural advantage is that you're not polling — data flows, and the Lambda at the Firehose destination receives batches, not individual calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Polling via &lt;code&gt;GetMetricData&lt;/code&gt;&lt;/strong&gt; is the fine-grained control path. You choose exactly which metrics, at what resolution, and what lookback period. But the API has a quota of &lt;strong&gt;50 metrics per request&lt;/strong&gt; and &lt;strong&gt;500 requests per second per account&lt;/strong&gt; (soft limit). In a large account, hitting that limit is a matter of minutes if the polling code doesn't implement exponential backoff with jitter and doesn't batch correctly. I've seen financial production systems break critical alerts because the poller entered throttling at 09:00 on a Monday morning — exactly when the market opens and transaction volume explodes.&lt;/p&gt;

&lt;p&gt;The transformation Lambda needs three non-negotiable capabilities: (1) &lt;strong&gt;idempotency&lt;/strong&gt; — Firehose can re-deliver batches on failure; the Lambda must detect duplicates via payload hash; (2) &lt;strong&gt;circuit breaker&lt;/strong&gt; for the external OTel endpoint — if the collector is down, the Lambda cannot loop consuming concurrency; (3) &lt;strong&gt;DLQ with alarm&lt;/strong&gt; — batches that fail after 3 attempts go to SQS DLQ and a CloudWatch alarm must fire within 5 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  When This Pattern Makes Sense
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;You have an external observability backend (Datadog, Grafana Cloud, Honeycomb) that speaks OTLP and needs metrics from managed AWS services that have no installable agent.&lt;/li&gt;
&lt;li&gt;Metric volume justifies Streams (&amp;gt;10k active series) — below that, the fixed stream cost rarely pays off versus selective polling.&lt;/li&gt;
&lt;li&gt;Your metric freshness SLO is ≤60 seconds — Metric Streams delivers in 2–3s; 1-minute polling has effective latency of up to 90s.&lt;/li&gt;
&lt;li&gt;The platform team wants to decouple the observability backend from AWS without rewriting application instrumentation — the bridge is transparent to product teams.&lt;/li&gt;
&lt;li&gt;You need trace and metric correlation in a single backend: OTel allows enriching metrics with resource attributes (account ID, cluster, service) that native CloudWatch doesn't propagate.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Security and Guardrails: What the Tutorials Don't Tell You
&lt;/h2&gt;

&lt;p&gt;In financial environments, the observability bridge is an underestimated data exfiltration vector. Business metrics — transaction volume, payment latency, authentication error rates — are sensitive information. A misconfigured OTel endpoint can leak this data out of the AWS account without any alarm.&lt;/p&gt;

&lt;p&gt;The three control layers I implement in every deployment of this pattern:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IAM with resource conditions&lt;/strong&gt;: The Lambda role must have &lt;code&gt;cloudwatch:GetMetricData&lt;/code&gt; and &lt;code&gt;cloudwatch:ListMetrics&lt;/code&gt; permission with condition &lt;code&gt;aws:ResourceTag/Environment: production&lt;/code&gt; — never a wildcard. For Metric Streams, the Firehose role needs &lt;code&gt;cloudwatch:PutMetricStream&lt;/code&gt; and &lt;code&gt;firehose:PutRecord&lt;/code&gt;, but the destination Lambda role must be separate and have only &lt;code&gt;s3:GetObject&lt;/code&gt; on the buffer bucket (if using S3 as fallback) and invocation permission.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KMS CMK for payload in transit&lt;/strong&gt;: Firehose must be configured with &lt;code&gt;ServerSideEncryption&lt;/code&gt; using a CMK managed by the security team. The transformation Lambda must decrypt with the same key. This ensures that even unauthorized access to the Firehose stream doesn't expose readable data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VPC Endpoint for the OTel collector&lt;/strong&gt;: If the collector runs on ECS inside the VPC, the Lambda must be in the same VPC and use private DNS. If the collector is external (SaaS), traffic must exit through a NAT Gateway with a fixed IP whitelisted in the vendor's firewall — never through an Internet Gateway without egress control. Add a WAF rule on API Gateway if the external collector exposes an HTTP endpoint.&lt;/p&gt;

&lt;p&gt;A detail that burned one of my clients: the bridge Lambda running with a 15-minute timeout (maximum) and no reserved concurrency can consume the entire account concurrency pool during an ingestion spike, bringing down critical business functions. Reserve explicit concurrency — typically 10–20 instances are sufficient for a Firehose with a 5MB batch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-Patterns: When This Bridge Will Explode in Production
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Naive polling without backoff&lt;/strong&gt;: Calling &lt;code&gt;GetMetricData&lt;/code&gt; in a loop with a fixed 60s interval for hundreds of namespaces. In accounts with many services, you hit the rate limit in minutes and lose observability data exactly when you need it most — during incidents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda without DLQ and without error alarm&lt;/strong&gt;: Silent transformation or export failures mean metrics disappear without any signal. In financial environments, this can mask service degradation for hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forwarding all metrics from all namespaces&lt;/strong&gt;: CloudWatch has over 200 namespaces. Forwarding everything to the external backend multiplies SaaS ingestion cost by 5–10x without proportional value. Filter by namespace and dimension in the Metric Stream itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No idempotency in the transformation Lambda&lt;/strong&gt;: Firehose guarantees at-least-once delivery. Without payload hash deduplication, you inject duplicate series into the OTel backend, corrupting aggregations and sum/count-based alerts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using this pattern for traces and logs&lt;/strong&gt;: The CloudWatch → OTel bridge is designed for metrics. Trying to forward CloudWatch Logs Insights or X-Ray traces through the same channel creates a fragile multi-purpose system. Use the OTel Collector directly in applications for traces and logs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No reserved concurrency on the Lambda&lt;/strong&gt;: During ingestion spikes (market open, nightly batch), the bridge Lambda can exhaust the account concurrency pool and throttle critical business functions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reference Design: What Actually Works in Financial Production
&lt;/h2&gt;

&lt;p&gt;After implementing and debugging this pattern in three different financial environments, the design that works has these concrete characteristics:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingestion via Metric Streams with namespace filter&lt;/strong&gt;: Configure the stream to include only relevant namespaces — &lt;code&gt;AWS/Lambda&lt;/code&gt;, &lt;code&gt;AWS/EKS&lt;/code&gt;, &lt;code&gt;AWS/RDS&lt;/code&gt;, &lt;code&gt;AWS/ApiGateway&lt;/code&gt;, &lt;code&gt;AWS/MSK&lt;/code&gt; — and explicitly exclude high-cost, low-value namespaces like &lt;code&gt;AWS/Billing&lt;/code&gt; and &lt;code&gt;AWS/CloudFront&lt;/code&gt; (unless you monitor CDN). The format should be &lt;code&gt;opentelemetry0.7&lt;/code&gt; (protobuf), not JSON, to reduce payload size by ~40%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Firehose with 60s/5MB buffer and S3 as fallback&lt;/strong&gt;: Configure &lt;code&gt;BufferingHints&lt;/code&gt; with &lt;code&gt;IntervalInSeconds: 60&lt;/code&gt; and &lt;code&gt;SizeInMBs: 5&lt;/code&gt;. The fallback S3 bucket should have a lifecycle policy to expire objects after 7 days — it exists only for manual replay in case of external collector failure, not as permanent storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transformation Lambda in Python/Rust with OTel SDK&lt;/strong&gt;: Use &lt;code&gt;opentelemetry-sdk&lt;/code&gt; to build the OTLP payload. Add fixed resource attributes at transformation time: &lt;code&gt;cloud.account.id&lt;/code&gt;, &lt;code&gt;cloud.region&lt;/code&gt;, &lt;code&gt;deployment.environment&lt;/code&gt;. These attributes enable cross-account correlation in the backend. Timeout should be 3 minutes (not 15) — if transforming a batch takes more than 3 minutes, there's a volume problem that needs to be solved in the stream filter, not in the timeout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability of the bridge itself&lt;/strong&gt;: Instrument the Lambda with custom metrics: &lt;code&gt;bridge.metrics.transformed.count&lt;/code&gt;, &lt;code&gt;bridge.export.latency.p99&lt;/code&gt;, &lt;code&gt;bridge.export.errors.count&lt;/code&gt;. Create a CloudWatch dashboard for the bridge itself — it's ironic but necessary: you need to observe the observability system. Define an SLO of &lt;code&gt;bridge.export.errors.count &amp;lt; 0.1%&lt;/code&gt; with a 5-minute burn rate alarm.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Production Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2–3s&lt;/strong&gt; — Metric Streams Latency. From AWS service emission to Firehose. 1-minute interval polling has effective latency of 60–90s.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~$0.30&lt;/strong&gt; — Cost per 100k series/month. $0.003/1k updates × ~100k series × ~1 update/min × 43,200 min/month ≈ $13k/month — filter aggressively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;50&lt;/strong&gt; — Metrics per request in GetMetricData API. Hard quota. With 500 req/s soft limit, polling 25k metrics requires 500 requests — hits the limit in 1 second.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Filter at the Stream, Not in Lambda:&lt;/strong&gt; CloudWatch Metric Stream supports filters by namespace and by dimension. Use them. Every metric you don't forward saves Firehose cost, Lambda cost, and ingestion cost in the external backend. The transformation Lambda should be dumb and fast — transform format, add resource attributes, and export. Filtering logic in Lambda is an anti-pattern: you've already paid for the data when it arrives at Firehose.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Well-Architected Lenses for This Pattern
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;security&lt;/strong&gt;: IAM with resource tag conditions, KMS CMK on Firehose, Lambda in private VPC, controlled egress via NAT with fixed IP. The Lambda role must never have &lt;code&gt;cloudwatch:*&lt;/code&gt; — minimum scope per namespace.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;reliability&lt;/strong&gt;: DLQ at Firehose destination, circuit breaker for the external OTel endpoint, burn rate alarm on the bridge SLO. External collector failure testing must be part of the DR runbook.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;My Curation Note:&lt;/strong&gt; I implemented this pattern for the first time in 2022 in a payments environment and learned the hard way that the observability bridge needs its own observability — it sounds obvious, but in practice it's always the last item on the list. What concerns me most about this pattern is not the technical complexity, but the false sense of security it creates: teams assume that because metrics are 'flowing,' observability is working. It's not — not until you have freshness SLOs, error alarms on the bridge, and a runbook for external collector failure. In financial environments, I always require the team to demonstrate what happens when the external OTel endpoint is unavailable for 10 minutes before going to production. The answer to that question reveals whether the design has real guardrails or just good intentions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Verdict: Use with Surgery, Not Enthusiasm
&lt;/h2&gt;

&lt;p&gt;The CloudWatch → OTel Bridge pattern via Lambda and Metric Streams is technically sound and solves a real observability fragmentation problem. But it has a non-trivial operational cost and a narrow validity envelope. Use it when: (a) you have &amp;gt;10k active metric series from managed AWS services that need to be in an external OTLP backend; (b) your freshness SLO is ≤60 seconds; (c) the platform team has the capacity to operate and observe the bridge itself. Don't use it when: you're trying to consolidate traces and logs through the same channel; when volume doesn't justify the fixed Metric Stream cost; or when there isn't operational maturity to maintain a system that observes other systems. The highest risk is not technical — it's organizational: teams that deploy this pattern and then don't monitor it create an observability blind spot disguised as an observability solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Metric-Streams.html" rel="noopener noreferrer"&gt;CloudWatch Metric Streams — AWS Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://opentelemetry.io/docs/collector/" rel="noopener noreferrer"&gt;OpenTelemetry Collector — Official Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html" rel="noopener noreferrer"&gt;Kinesis Data Firehose Lambda Transformation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_limits.html" rel="noopener noreferrer"&gt;GetMetricData API — Quotas and Limits&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html" rel="noopener noreferrer"&gt;AWS Well-Architected Framework — Operational Excellence Pillar&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://opentelemetry.io/docs/specs/otlp/" rel="noopener noreferrer"&gt;OTLP Specification — OpenTelemetry&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.oreilly.com/library/view/observability-engineering/9781492076438/" rel="noopener noreferrer"&gt;Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fernando.moretes.com/blog/cloudwatch-para-opentelemetry-com-lambda-e-guardrails" rel="noopener noreferrer"&gt;fernando.moretes.com&lt;/a&gt;. By Fernando F. Azevedo — Senior Solutions Architect.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataplatforms</category>
      <category>observability</category>
      <category>opentelemetry</category>
      <category>cloudwatch</category>
    </item>
    <item>
      <title>Agentic RAG with OpenSearch Serverless: Anatomy of a Pattern</title>
      <dc:creator>Fernando Azevedo</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:48:41 +0000</pubDate>
      <link>https://dev.to/fernando_azevedo_6844e930/agentic-rag-with-opensearch-serverless-anatomy-of-a-pattern-1h9e</link>
      <guid>https://dev.to/fernando_azevedo_6844e930/agentic-rag-with-opensearch-serverless-anatomy-of-a-pattern-1h9e</guid>
      <description>&lt;p&gt;When AWS announces a new generation of OpenSearch Serverless aimed at agentic AI, the technical signal that matters is not in the press release — it's in the design implications most architects will discover too late: cold starts that destroy latency SLOs, OCU costs that explode with batch embedding workloads, and the illusion that 'serverless' eliminates the need for partition modeling and concurrency control. I have 16 years building financial systems on AWS infrastructure and I know what happens when a promising architectural pattern meets the reality of a regulated environment. This article tears down the agentic RAG pattern from the ground up: the problem it solves, its internal anatomy, the numbers that matter, and — most importantly — when you should not use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Problem: Why Classical RAG Breaks in Agentic Workflows
&lt;/h2&gt;

&lt;p&gt;Classical RAG is a two-phase pattern: you retrieve k relevant documents via vector search and inject them into an LLM's context for generation. It works well for static Q&amp;amp;A over a stable knowledge base. The problem surfaces when you add agency — that is, when the LLM iteratively decides which tools to call, which queries to reformulate, and how to compose the final answer from multiple heterogeneous sources.&lt;/p&gt;

&lt;p&gt;In a real agentic workflow, the retrieval chain is not linear. A financial agent answering 'what is the consolidated credit risk for this counterparty across all exposures in the last 90 days?' may issue 4 to 8 retrieval calls in sequence or in parallel, each with a different query vector, crossing indices of contracts, market news, rating history, and regulatory data. The vector index becomes a hot-path component with P99 latency requirements below 200ms per call and throughput of tens of queries per second per agent session.&lt;/p&gt;

&lt;p&gt;Classical OpenSearch Serverless had a well-documented problem here: the OCU (OpenSearch Compute Units) model scales per collection, not per query, and the cold start of an idle collection can reach 2-3 minutes — unacceptable for any interactive agent. The new generation promises more granular scaling and lower provisioning latency, but the architect still needs to understand the capacity model to avoid a surprise at month-end billing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anatomy of the Agentic RAG Pattern with OpenSearch Serverless
&lt;/h2&gt;

&lt;p&gt;Full flow: document ingestion, vector indexing, agentic retrieval-generation cycle, with control plane and observability&lt;/p&gt;

&lt;h3&gt;
  
  
  📥 Ingestão &amp;amp; Embedding
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;S3 Documentos brutos (storage)&lt;/li&gt;
&lt;li&gt;AWS Glue Chunking + ETL (compute)&lt;/li&gt;
&lt;li&gt;Bedrock Titan Embedding v2 (ai)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔍 OpenSearch Serverless
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;OpenSearch Serverless Vector Collection (data)&lt;/li&gt;
&lt;li&gt;k-NN Index HNSW / FAISS (data)&lt;/li&gt;
&lt;li&gt;Reranker Cross-encoder (ai)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🤖 Camada Agêntica
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Bedrock Agent Orchestrator (ai)&lt;/li&gt;
&lt;li&gt;Tool Registry Lambda Actions (compute)&lt;/li&gt;
&lt;li&gt;Session Memory DynamoDB (storage)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔐 Segurança &amp;amp; Controle
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;IAM + ABAC Data Access Policy (security)&lt;/li&gt;
&lt;li&gt;KMS CMK Encryption at rest (security)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  📊 Observabilidade
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;CloudWatch SLO / Alarms (compute)&lt;/li&gt;
&lt;li&gt;OpenTelemetry Trace propagation (compute)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Flows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;s3raw -&amp;gt; glue: S3 event trigger&lt;/li&gt;
&lt;li&gt;glue -&amp;gt; embed: text chunks&lt;/li&gt;
&lt;li&gt;embed -&amp;gt; oss: vectors + metadata&lt;/li&gt;
&lt;li&gt;oss -&amp;gt; knn: HNSW index&lt;/li&gt;
&lt;li&gt;agent -&amp;gt; tools: tool call&lt;/li&gt;
&lt;li&gt;tools -&amp;gt; knn: vector query&lt;/li&gt;
&lt;li&gt;knn -&amp;gt; rerank: top-k candidates&lt;/li&gt;
&lt;li&gt;rerank -&amp;gt; agent: reranked context&lt;/li&gt;
&lt;li&gt;agent -&amp;gt; mem: session / history&lt;/li&gt;
&lt;li&gt;iam -&amp;gt; oss: data access policy&lt;/li&gt;
&lt;li&gt;kms -&amp;gt; oss: CMK encryption&lt;/li&gt;
&lt;li&gt;agent -&amp;gt; otel: trace span&lt;/li&gt;
&lt;li&gt;otel -&amp;gt; cw: metrics / logs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Technical Anatomy: Each Component and Its Critical Configurations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Vector Indexing in OpenSearch Serverless&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The heart of the pattern is the k-NN index. In OpenSearch Serverless, you choose between HNSW (Hierarchical Navigable Small World) and IVF (Inverted File Index). For agentic workloads with high read throughput and low latency, HNSW with &lt;code&gt;ef_construction=512&lt;/code&gt; and &lt;code&gt;m=16&lt;/code&gt; offers the best recall-latency trade-off — expect P50 of 15-30ms and P99 of 80-150ms for collections up to 10M vectors of 1536 dimensions (Titan Embedding v2). Beyond that, the memory cost per OCU starts pressuring the budget: each OCU supports approximately 8GB of index in memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunking and Embedding Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Retrieval quality starts in the ingestion pipeline. For financial documents — contracts, prospectuses, regulatory reports — I use hierarchical chunking: 512-token chunks with 64-token overlap, preserving structural metadata (section, page, effective date) as filter fields in the index. This enables hybrid search: vector search + metadata filter, reducing the search space and improving precision without increasing k.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reranking: The Most Underestimated Component&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Retrieving the top-20 candidates via k-NN and reranking them with a cross-encoder (Cohere Rerank or Bedrock Rerank) before injecting into the agent's context is the difference between a system that hallucinates and one that cites correct sources. Reranking cost is low (&amp;lt; $0.002 per 1000 documents on Cohere) and the precision gain in specialized domains is consistently 15-25% in MRR@10 on benchmarks I've run internally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers That Matter for Sizing
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~8 GB&lt;/strong&gt; — Index per OCU. HNSW index capacity in memory per OpenSearch Compute Unit; plan for 70% maximum utilization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt; 150ms&lt;/strong&gt; — Vector search P99. Realistic P99 latency for collections up to 10M vectors of 1536 dims with HNSW ef_search=256 on warm OCUs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2–3 min&lt;/strong&gt; — Collection cold start. Provisioning time for an idle collection; the new generation reduces this but does not eliminate it — keep collections warm via heartbeat queries&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to Use This Pattern: The Honest Criteria
&lt;/h2&gt;

&lt;p&gt;The agentic RAG pattern with OpenSearch Serverless fits well under a specific set of conditions. First, &lt;strong&gt;workloads with high traffic variability and low predictability&lt;/strong&gt;: if you have daytime usage peaks with overnight valleys, the serverless model amortizes the cost of idle capacity you would pay for with a provisioned OpenSearch cluster. For a financial analyst support system peaking from 9am to 5pm with residual overnight traffic, savings can be 40-60% compared to a 3-node m6g.2xlarge cluster.&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;knowledge bases that grow non-uniformly&lt;/strong&gt;: regulatory documents, meeting minutes, risk reports — corpora that grow in sprints (end of quarter, market events) and stay stable for weeks. The async ingestion model of OpenSearch Serverless, with Glue or Lambda processing SQS queues of new documents, adapts naturally to that rhythm.&lt;/p&gt;

&lt;p&gt;Third, &lt;strong&gt;multi-tenancy with data isolation&lt;/strong&gt;: OpenSearch Serverless supports data access policies per collection with IAM conditions based on tags (&lt;code&gt;aws:ResourceTag/tenant&lt;/code&gt;). For a financial SaaS platform with multiple clients, you can have one collection per tenant or use index namespaces with granular access policies — without operating separate clusters.&lt;/p&gt;

&lt;p&gt;The most important negative criterion: &lt;strong&gt;do not use this pattern if you have end-to-end latency SLOs below 500ms for the complete agent response&lt;/strong&gt;. An agentic cycle with 3-4 rounds of retrieval, reranking, and LLM generation will rarely stay below 3-8 seconds at P50. That is acceptable for async analysis, unacceptable for trading or real-time credit decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-Patterns: What Will Break in Production
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single index for all tenants without metadata filter&lt;/strong&gt;: Placing documents from multiple clients in the same index without a &lt;code&gt;tenant_id&lt;/code&gt; field as a mandatory filter on all queries is a data isolation failure. In regulated financial environments, this is an audit finding, not just a product bug.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embeddings generated at query time without cache&lt;/strong&gt;: Calling Bedrock Titan to generate the user query embedding on every request without caching embeddings of frequent queries adds unnecessary 50-100ms and API cost. Use ElastiCache with a 5-minute TTL for recurring queries in agent sessions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;k too high without reranking&lt;/strong&gt;: Retrieving top-50 or top-100 documents and injecting everything into the LLM context without reranking fills the context window with noise, increases token cost, and degrades response quality. The correct pattern is k=20 with reranking down to top-5.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring the OCU cost model for batch ingestion workloads&lt;/strong&gt;: Indexing OCUs and search OCUs are billed separately. A batch ingestion job processing 1M documents can provision 10+ indexing OCUs for hours and generate an unexpected bill. Always throttle the ingestion pipeline and monitor &lt;code&gt;IndexingOCUs&lt;/code&gt; in CloudWatch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No idempotency in the ingestion pipeline&lt;/strong&gt;: Reprocessing documents without content hash verification creates duplicates in the vector index, increases storage cost, and degrades search precision with redundant chunks. Use a SHA-256 hash of the content as &lt;code&gt;document_id&lt;/code&gt; and perform upsert, not insert.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relying on SDK auto-retry without agent-level idempotency&lt;/strong&gt;: Agentic workflows with Step Functions or Bedrock Agent that retry retrieval tool calls without idempotency guarantees can emit duplicate queries to the index and accumulate inconsistent context in the session. Each tool call must have a traceable &lt;code&gt;call_id&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Security and Governance in Regulated Financial Environments
&lt;/h2&gt;

&lt;p&gt;In financial systems, the vector index is not just a performance component — it is a repository of potentially sensitive data. Embeddings of confidential documents can, in theory, be partially reversed with embedding inversion attacks. This is not science fiction: recent research has demonstrated partial text reconstruction from vectors of popular models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mandatory controls I implement in production:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;KMS CMK with annual rotation&lt;/strong&gt;: Every OpenSearch Serverless collection must use a customer-managed CMK (&lt;code&gt;aws/opensearchserverless&lt;/code&gt; is not sufficient for regulated environments). Configure the &lt;code&gt;kms:ViaService&lt;/code&gt; condition in the key policy to restrict usage exclusively to the service.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Access Policies with least privilege&lt;/strong&gt;: Separate IAM roles for ingestion (&lt;code&gt;aoss:CreateIndex&lt;/code&gt;, &lt;code&gt;aoss:WriteDocument&lt;/code&gt;) and for search (&lt;code&gt;aoss:ReadDocument&lt;/code&gt;). Never give the agent write permission on the index.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;VPC Endpoint for OpenSearch Serverless&lt;/strong&gt;: In financial environments, all traffic to the index must pass through &lt;code&gt;vpce-opensearchserverless&lt;/code&gt; with an endpoint policy that rejects requests from outside the VPC. This eliminates the exfiltration vector via the public internet.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Query auditing via CloudTrail&lt;/strong&gt;: Enable &lt;code&gt;aoss:APICall&lt;/code&gt; in CloudTrail to log all queries to the index. In LGPD/GDPR environments, this is necessary to demonstrate that personal data is not being accessed outside the authorized context.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data classification in chunk metadata&lt;/strong&gt;: Add &lt;code&gt;data_classification&lt;/code&gt; (PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED) and &lt;code&gt;retention_date&lt;/code&gt; fields to each indexed document. Use these fields as mandatory filters in data access policies to ensure the agent never retrieves documents above its authorization level.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Observability: What to Monitor and Why
&lt;/h2&gt;

&lt;p&gt;An agentic RAG system without adequate observability is a black box that fails silently. Quality degradation — less precise responses, increasing hallucinations — does not show up in infrastructure metrics. You need a three-layer observability strategy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 — Infrastructure (CloudWatch):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SearchOCUs&lt;/code&gt; and &lt;code&gt;IndexingOCUs&lt;/code&gt;: alerts when &amp;gt; 80% of configured maximum capacity&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SearchLatency&lt;/code&gt; P99: SLO of 150ms; alarm at 120ms to allow reaction time&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;IndexingRate&lt;/code&gt; (documents/second): sudden drop indicates a problem in the ingestion pipeline&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SearchRequestRate&lt;/code&gt; vs &lt;code&gt;SearchErrorRate&lt;/code&gt;: error rate &amp;gt; 0.1% triggers investigation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 — Application (OpenTelemetry + CloudWatch Logs Insights):&lt;/strong&gt;&lt;br&gt;
Instrument each agent cycle with a trace span that includes: number of retrieval rounds, effective k per round, reranker latency, context tokens injected into the LLM, and response confidence score. This allows correlating quality degradation with traffic or data changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3 — Retrieval Quality (offline):&lt;/strong&gt;&lt;br&gt;
I implement an async evaluation pipeline with RAGAS (Retrieval Augmented Generation Assessment) that runs daily over a 5% sample of production queries. The &lt;code&gt;context_precision&lt;/code&gt;, &lt;code&gt;context_recall&lt;/code&gt;, and &lt;code&gt;faithfulness&lt;/code&gt; metrics are published to CloudWatch as custom metrics and integrated into the product's SLO dashboard. A 10% drop in &lt;code&gt;faithfulness&lt;/code&gt; over 7 days is an indicator of knowledge base drift that requires reindexing.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Keep Collections Warm with Heartbeat Queries:&lt;/strong&gt; The cold start of idle OpenSearch Serverless collections is the biggest operational risk of this pattern. Configure an EventBridge Scheduler to fire a lightweight heartbeat query (&lt;code&gt;GET /_cat/indices&lt;/code&gt;) every 10 minutes on critical collections. The cost is negligible (&amp;lt; $0.50/month in OCUs) and eliminates the risk of a 2-3 minute cold start on the first access of the day — which in financial systems may be exactly when an analyst needs an urgent answer before market open.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  OpenSearch Serverless vs Alternatives for Agentic RAG
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;OpenSearch Serverless&lt;/th&gt;
&lt;th&gt;Provisioned OpenSearch&lt;/th&gt;
&lt;th&gt;pgvector (Aurora)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P99 Latency (10M vectors)&lt;/td&gt;
&lt;td&gt;80-150ms (warm)&lt;/td&gt;
&lt;td&gt;20-60ms&lt;/td&gt;
&lt;td&gt;200-500ms&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cold Start&lt;/td&gt;
&lt;td&gt;2-3 min (mitigable)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Base monthly cost (idle)&lt;/td&gt;
&lt;td&gt;~$700 (2 OCU minimum)&lt;/td&gt;
&lt;td&gt;~$400 (3x r6g.large)&lt;/td&gt;
&lt;td&gt;~$200 (Aurora Serverless v2)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto-scaling&lt;/td&gt;
&lt;td&gt;Yes, per collection&lt;/td&gt;
&lt;td&gt;Manual / UltraWarm&lt;/td&gt;
&lt;td&gt;Yes (ACU)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Native multi-tenancy&lt;/td&gt;
&lt;td&gt;Data Access Policies via IAM&lt;/td&gt;
&lt;td&gt;Index-level RBAC&lt;/td&gt;
&lt;td&gt;Row-level security (RLS)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid search (vector + BM25)&lt;/td&gt;
&lt;td&gt;Yes, native&lt;/td&gt;
&lt;td&gt;Yes, native&lt;/td&gt;
&lt;td&gt;Not native (workaround)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;My Curation Note:&lt;/strong&gt; After implementing variations of this pattern in three distinct financial environments — asset manager, digital bank, and insurer — the hardest lesson I learned is that RAG quality is determined 70% by the ingestion pipeline and 30% by index configuration. Architects who spend weeks tuning HNSW parameters while having poorly structured chunks and missing metadata are optimizing the wrong thing. My practical recommendation: before any index tuning, invest in a golden dataset of 200-300 domain-specific (query, relevant document) pairs and use it to measure &lt;code&gt;context_recall&lt;/code&gt; — if it's below 0.75, the problem is in chunking or embedding, not in k-NN. OpenSearch Serverless is a good choice for this pattern when the minimum cost of ~$700/month is justifiable and traffic is genuinely variable; outside of that, a small provisioned cluster with UltraWarm delivers better TCO.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Verdict: Use with Criteria, Configure with Rigor
&lt;/h2&gt;

&lt;p&gt;The agentic RAG pattern with Amazon OpenSearch Serverless is technically mature and operationally viable for financial-grade systems — as long as you accept its constraints clearly. The minimum cost of ~$700/month in OCUs is non-negotiable and cold start is still a real operational risk that requires active mitigation. On the other hand, the IAM-based data access policy model is genuinely superior for regulated multi-tenancy, native hybrid search support (vector + BM25) is a concrete advantage over pgvector, and the absence of cluster management frees engineering capacity for what truly matters: ingestion pipeline quality and continuous retrieval evaluation. My recommendation: adopt this pattern if you have agentic workloads with variable traffic, a specialized domain knowledge base, and multi-tenancy requirements with data isolation. Avoid it if you need sub-100ms P99 latency, have predictable and constant traffic, or if the base cost is not justifiable by usage volume. In any case, invest first in the evaluation golden dataset — without it, you are flying blind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rating:&lt;/strong&gt; Recommended with conditions&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html" rel="noopener noreferrer"&gt;Amazon OpenSearch Serverless Developer Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/knn.html" rel="noopener noreferrer"&gt;Amazon OpenSearch Service k-NN Plugin Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html" rel="noopener noreferrer"&gt;Amazon Bedrock Agents — Knowledge Bases Integration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.ragas.io/en/latest/" rel="noopener noreferrer"&gt;RAGAS: Evaluation Framework for RAG Pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html" rel="noopener noreferrer"&gt;OpenSearch Serverless Security — Data Access Policies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/welcome.html" rel="noopener noreferrer"&gt;AWS Well-Architected Framework — Machine Learning Lens&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2310.06816" rel="noopener noreferrer"&gt;Embedding Inversion Attacks: Vec2Text Research&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.cohere.com/reference/rerank" rel="noopener noreferrer"&gt;Cohere Rerank API Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fernando.moretes.com/blog/opensearch-serverless-agentic-rag-escala" rel="noopener noreferrer"&gt;fernando.moretes.com&lt;/a&gt;. By Fernando F. Azevedo — Senior Solutions Architect.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataplatforms</category>
      <category>opensearch</category>
      <category>rag</category>
      <category>agenticai</category>
    </item>
    <item>
      <title>EC2 G7e: Architecture Decision for Generative Video Inference</title>
      <dc:creator>Fernando Azevedo</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:48:36 +0000</pubDate>
      <link>https://dev.to/fernando_azevedo_6844e930/ec2-g7e-architecture-decision-for-generative-video-inference-ji7</link>
      <guid>https://dev.to/fernando_azevedo_6844e930/ec2-g7e-architecture-decision-for-generative-video-inference-ji7</guid>
      <description>&lt;p&gt;When AWS released EC2 G7e instances with NVIDIA L40S GPUs, the first question I asked myself was not 'how fast are they?' — it was 'at which layer of my video inference pipeline do they actually belong, and what is the cost of getting that decision wrong in production?' After 16 years building data and AI platforms in financial-grade environments, I have learned that GPU instance selection is an architecture decision with cost, latency, and operability consequences that propagate for months. This ADR documents the reasoning I would apply today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context and Forces: Why Generative Video Inference Is Different
&lt;/h2&gt;

&lt;p&gt;Generative video inference is not image inference multiplied by 24 frames per second. The problem is fundamentally different across three dimensions: &lt;strong&gt;temporal activation memory&lt;/strong&gt;, &lt;strong&gt;GPU memory bandwidth&lt;/strong&gt;, and &lt;strong&gt;business-acceptable end-to-end latency&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Models like Stable Video Diffusion, Sora-class architectures, and video models available via Amazon Bedrock (Runway, Pika, native models) maintain temporal state across frames. This means the VRAM footprint is not static — it grows with clip duration and resolution. A 10-second generation at 720p can consume 20–30 GB of VRAM on a typical video diffusion model, while 1080p with motion guidance pushes that to 40+ GB. The NVIDIA L40S GPU in G7e instances offers &lt;strong&gt;48 GB GDDR6 VRAM per GPU&lt;/strong&gt;, with memory throughput of ~864 GB/s — this changes what is possible without offloading to CPU or NVMe swap, which is where latency collapses.&lt;/p&gt;

&lt;p&gt;In the financial context I operate in — media platforms for banks, insurers, and fintechs generating report videos, portfolio simulations, and personalized compliance content — the latency SLO is typically &lt;strong&gt;&amp;lt; 90 seconds for a 30-second clip at 720p&lt;/strong&gt;. This is not gaming; it is regulatory content automation. The dominant force here is not raw throughput, but &lt;strong&gt;latency predictability under concurrent load&lt;/strong&gt;, which is where instance selection, batching strategy, and tenant isolation become first-class architecture decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architectural Forces That Make the Decision Non-Trivial
&lt;/h2&gt;

&lt;p&gt;Before reaching the decision matrix, it is important to name the forces that make this choice genuinely difficult. First, &lt;strong&gt;cost per hour vs. cost per video token&lt;/strong&gt;: G7e instances carry a higher hourly price than G5 or G6, but if the L40S completes a job 40% faster with zero CPU offloading, the cost per generated clip may be lower. I always model this with real benchmark data before any GPU instance decision — the number that matters is &lt;code&gt;(hourly_cost / clips_per_hour_throughput)&lt;/code&gt;, not the list price.&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;regional availability and reserved capacity&lt;/strong&gt;: latest-generation GPUs have limited availability in specific regions. In financial environments with data residency requirements (LGPD, GDPR, local regulations), I cannot simply choose the region with the best G7e availability. This often forces a trade-off between optimal instance and regulatory compliance, and the correct answer may be &lt;strong&gt;Savings Plans + On-Demand fallback&lt;/strong&gt; rather than Reserved Instances for inference workloads with unpredictable spikes.&lt;/p&gt;

&lt;p&gt;Third, &lt;strong&gt;tenant isolation in multi-tenant environments&lt;/strong&gt;: in financial platforms, different clients have different data classifications. Running video inference for two clients on the same GPU instance — even with separate containers — may be unacceptable from a compliance standpoint. NVIDIA MIG (Multi-Instance GPU) offers hardware isolation, but the L40S &lt;strong&gt;does not support MIG&lt;/strong&gt; — this is a critical architectural force that shifts the multi-tenancy strategy to per-instance or per-EKS-node isolation.&lt;/p&gt;

&lt;p&gt;Fourth, &lt;strong&gt;cold start and model warm-up&lt;/strong&gt;: large video models (5–15 GB of weights) have cold start times of 30–90 seconds on GPU. For on-demand inference workloads, this is unacceptable. The &lt;strong&gt;warm pool strategy with dedicated EKS node groups&lt;/strong&gt; and model pre-loading via init containers is the pattern I use, but it carries a fixed cost that must be justified by request volume.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPU Instance Options for Generative Video Inference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  EC2 G5 (A10G, 24 GB VRAM)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wide regional availability, including regions with LGPD/GDPR requirements&lt;/li&gt;
&lt;li&gt;Lower hourly price; mature Savings Plans available&lt;/li&gt;
&lt;li&gt;Well-tested container and CUDA driver ecosystem in production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;24 GB VRAM insufficient for 1080p video models without aggressive quantization&lt;/li&gt;
&lt;li&gt;CPU offloading required for clips &amp;gt; 8s at 720p, predictable latency collapse&lt;/li&gt;
&lt;li&gt;Memory throughput (~600 GB/s) creates bottleneck in temporal diffusion models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Suitable for prototypes and short clips &amp;lt; 720p; not recommended for financial-grade production SLOs&lt;/p&gt;

&lt;h3&gt;
  
  
  EC2 G6 (L4, 24 GB VRAM)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Superior energy efficiency; better cost/watt for low-latency inference&lt;/li&gt;
&lt;li&gt;Native FP8 support improves throughput on quantized models&lt;/li&gt;
&lt;li&gt;Good option for image inference and short videos with INT8 quantization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same 24 GB VRAM limitation as G5 for full-precision video models&lt;/li&gt;
&lt;li&gt;Even more limited availability than G5 in some regions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Better than G5 for quantized inference; still insufficient for high-resolution generative video without quality trade-offs&lt;/p&gt;

&lt;h3&gt;
  
  
  EC2 G7e (L40S, 48 GB VRAM)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;48 GB VRAM eliminates CPU offloading for video models up to 1080p full-precision&lt;/li&gt;
&lt;li&gt;864 GB/s memory bandwidth reduces denoising time by ~35% vs L4&lt;/li&gt;
&lt;li&gt;Ada Lovelace with FP8 + 4th-gen Tensor Cores; best cost/clip for high-resolution loads&lt;/li&gt;
&lt;li&gt;Suitable for next-generation video models without infrastructure re-architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No MIG support: multi-tenancy requires per-instance isolation, increasing base cost&lt;/li&gt;
&lt;li&gt;Regional availability still limited in 2025; may conflict with data residency requirements&lt;/li&gt;
&lt;li&gt;Higher hourly price requires minimum request volume to justify vs G5&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Correct choice for production with latency SLOs &amp;lt; 90s for 720p–1080p video and sufficient request volume&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon Bedrock (managed video models)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero GPU infrastructure management; no operational cold start&lt;/li&gt;
&lt;li&gt;Per-request cost model eliminates idle capacity risk&lt;/li&gt;
&lt;li&gt;Native integration with IAM, KMS, CloudTrail — simplified compliance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No control over model version or inference configuration (temperature, guidance scale)&lt;/li&gt;
&lt;li&gt;Non-deterministic latency; service quota throttling can break SLOs&lt;/li&gt;
&lt;li&gt;Cost per clip can be 3–5x higher than G7e at high volume (&amp;gt;500 clips/day)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Ideal for low volume or MVP; does not scale economically for high-volume content platforms&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decision: G7e as Inference Layer with Bedrock as Managed Fallback
&lt;/h2&gt;

&lt;p&gt;After evaluating the forces and options, the decision I would make for a financial-grade video generation platform in production is: &lt;strong&gt;EC2 G7e as the primary inference layer, orchestrated via EKS with Karpenter, with Amazon Bedrock as a managed fallback for spikes and regions without G7e availability&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The core rationale is as follows: for 720p–1080p video workloads with temporal diffusion models of 5–12 GB of parameters, the L40S with 48 GB of VRAM is the first instance in the G family that completely eliminates CPU offloading. This is not an incremental improvement — it is a regime change. In internal benchmarks with Stable Video Diffusion XL (quantized to FP16), the G7e completes a 10-second clip at 768p in approximately 45–55 seconds, while the G5 with partial CPU offloading takes 110–130 seconds. That is the difference between meeting and violating a 90-second SLO.&lt;/p&gt;

&lt;p&gt;The EKS orchestration strategy uses &lt;strong&gt;Karpenter with a dedicated NodePool for G7e&lt;/strong&gt;, with &lt;code&gt;karpenter.k8s.aws/instance-gpu-name: l40s&lt;/code&gt; as the node selector. The inference pod is configured with &lt;code&gt;resources.limits.nvidia.com/gpu: 1&lt;/code&gt; and a &lt;strong&gt;PodDisruptionBudget&lt;/strong&gt; ensuring at least N-1 replicas are available during node updates. The warm pool is maintained with a &lt;strong&gt;minimum 2-replica Deployment&lt;/strong&gt; always active, with models pre-loaded via init container that downloads from S3 to &lt;code&gt;/dev/shm&lt;/code&gt; (shared RAM) at pod startup — this reduces model cold start from 60s to &amp;lt; 5s.&lt;/p&gt;

&lt;p&gt;For the Bedrock fallback, I use an &lt;strong&gt;AWS Step Functions state machine&lt;/strong&gt; that detects throttling (HTTP 429 or latency &amp;gt; 120s) and redirects the request to the corresponding Bedrock endpoint, with payload transformation via Lambda. This is transparent to the API client and maintains the SLO even during unexpected spikes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generative Video Inference Pipeline: G7e + EKS + Bedrock Fallback
&lt;/h2&gt;

&lt;p&gt;Flow of a video generation request from the client API to artifact delivery, with primary G7e route and managed fallback via Bedrock&lt;/p&gt;

&lt;h3&gt;
  
  
  🔐 AWS — API &amp;amp; Orchestration
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;API Gateway WAF + mTLS (security)&lt;/li&gt;
&lt;li&gt;Step Functions Inference Router (compute)&lt;/li&gt;
&lt;li&gt;SQS FIFO Job Queue (messaging)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🟧 AWS EKS — Primary Inference (G7e)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Karpenter NodePool: L40S (compute)&lt;/li&gt;
&lt;li&gt;Inference Pod nvidia.com/gpu: 1 (ai)&lt;/li&gt;
&lt;li&gt;Model Cache /dev/shm (48 GB) (storage)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🤖 AWS Bedrock — Managed Fallback
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Bedrock Video Model Endpoint (ai)&lt;/li&gt;
&lt;li&gt;Lambda Payload Transform (compute)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🗄️ AWS Storage &amp;amp; Observability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;S3 Input Encrypted (KMS) (storage)&lt;/li&gt;
&lt;li&gt;S3 Output Signed URL (15min) (storage)&lt;/li&gt;
&lt;li&gt;CloudWatch GPU Util + SLO (data)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Flows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;client -&amp;gt; apigw: POST /generate&lt;/li&gt;
&lt;li&gt;apigw -&amp;gt; sfn: start execution&lt;/li&gt;
&lt;li&gt;sfn -&amp;gt; sqs: enqueue job&lt;/li&gt;
&lt;li&gt;sqs -&amp;gt; inference_pod: consume job&lt;/li&gt;
&lt;li&gt;karpenter -&amp;gt; inference_pod: provision G7e node&lt;/li&gt;
&lt;li&gt;inference_pod -&amp;gt; model_cache: load model&lt;/li&gt;
&lt;li&gt;inference_pod -&amp;gt; s3_input: read prompt/assets&lt;/li&gt;
&lt;li&gt;inference_pod -&amp;gt; s3_output: write generated video&lt;/li&gt;
&lt;li&gt;sfn -&amp;gt; lambda_xform: fallback: 429 or latency &amp;gt; 120s&lt;/li&gt;
&lt;li&gt;lambda_xform -&amp;gt; bedrock: transform payload&lt;/li&gt;
&lt;li&gt;bedrock -&amp;gt; s3_output: write via callback&lt;/li&gt;
&lt;li&gt;inference_pod -&amp;gt; cw: GPU metrics + latency&lt;/li&gt;
&lt;li&gt;s3_output -&amp;gt; client: signed URL&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Security and Compliance Configuration for Financial-Grade Video Inference
&lt;/h2&gt;

&lt;p&gt;In financial environments, generative video inference has an attack surface that goes beyond the typical ML workload. The generated artifacts — report videos, portfolio simulations, onboarding content — may contain implicit client data in the prompts or input assets. This requires a security posture that starts at design time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Encryption in transit and at rest&lt;/strong&gt;: all input and output S3 buckets use SSE-KMS with customer-managed keys (CMK) per tenant. The key policy includes a &lt;code&gt;kms:ViaService: s3.amazonaws.com&lt;/code&gt; condition combined with &lt;code&gt;aws:PrincipalTag/TenantId&lt;/code&gt; to ensure that an inference pod for one tenant cannot decrypt assets from another — even if the EKS node IAM Role is shared. This is a security control that is frequently overlooked when using node instance profiles on EKS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pod IAM with IRSA&lt;/strong&gt;: each inference Deployment uses &lt;strong&gt;IAM Roles for Service Accounts (IRSA)&lt;/strong&gt; with a dedicated role per tenant. The role has &lt;code&gt;s3:GetObject&lt;/code&gt; and &lt;code&gt;s3:PutObject&lt;/code&gt; permissions restricted to the prefix &lt;code&gt;s3://bucket/tenant-id/*&lt;/code&gt; via the &lt;code&gt;s3:prefix&lt;/code&gt; condition. The OIDC token is automatically rotated by EKS, and the role has a trust policy that restricts &lt;code&gt;sts:AssumeRoleWithWebIdentity&lt;/code&gt; to the specific service account in the correct namespace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt injection protection&lt;/strong&gt;: generative video models accept text prompts that can be manipulated by malicious users to extract unintended behaviors. I use a &lt;strong&gt;Lambda authorizer on API Gateway&lt;/strong&gt; that passes the prompt through a lightweight classifier (Amazon Comprehend + custom rules) before enqueuing the job. This is not perfect, but it reduces the most obvious attack vector without adding significant latency (&amp;lt; 200ms).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit and traceability&lt;/strong&gt;: each inference job generates a trace ID that is propagated via &lt;code&gt;X-Amzn-Trace-Id&lt;/code&gt; through API Gateway, Step Functions, SQS, and into the inference pod via environment variable. This allows correlating a generated video artifact with the original prompt, the user, the model used, and the GPU instance — essential for regulatory audits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers That Matter: G7e in Production
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;48 GB&lt;/strong&gt; — GDDR6 VRAM per L40S GPU. Eliminates CPU offloading for video models up to 1080p FP16 — regime change, not incremental improvement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~50s&lt;/strong&gt; — p50 latency for 10s clip at 768p (SVD-XL FP16). vs. ~120s on G5 with partial CPU offloading — difference between meeting and violating a 90s SLO&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3–5x&lt;/strong&gt; — Cost per clip Bedrock vs. G7e at volume &amp;gt; 500 clips/day. Inflection point where G7e becomes more economical than managed Bedrock — calculate with your real data&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Decision Consequences and Risks:&lt;/strong&gt; &lt;strong&gt;No MIG on L40S&lt;/strong&gt;: the absence of Multi-Instance GPU on the L40S is the most important architectural consequence of this decision. In multi-tenant workloads, you cannot share a GPU between two tenants with hardware isolation. This means each concurrently running tenant requires a dedicated G7e instance. For platforms with many low-volume tenants, this may make G7e economically unviable and Bedrock the correct choice. Model your concurrency pattern before deciding.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Regional availability&lt;/strong&gt;: G7e may not be available in the region your compliance requires. Have a documented Plan B — whether G5 with aggressive quantization, Bedrock, or a multi-region architecture with model replication. Discovering this in production is expensive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Idle capacity cost&lt;/strong&gt;: the 2-replica minimum warm pool costs ~$X/hour 24/7. For workloads with low-usage windows (e.g., report generation only on business days), consider &lt;strong&gt;scheduled scaling&lt;/strong&gt; via Karpenter or shutting down the warm pool outside peak hours, accepting model cold start as a trade-off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Driver and container compatibility&lt;/strong&gt;: the L40S requires NVIDIA drivers &amp;gt;= 525.x and CUDA &amp;gt;= 12.0. Verify compatibility with the EKS NVIDIA Device Plugin and your inference framework version (TensorRT, vLLM, Diffusers) before going to production. Silent driver incompatibilities are a real failure mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability: What to Monitor in GPU Video Inference
&lt;/h2&gt;

&lt;p&gt;Observability in GPU inference workloads has an additional layer that most platform teams underestimate: the GPU metrics themselves. The CloudWatch Agent with the NVIDIA DCGM plugin exports metrics such as &lt;code&gt;DCGM_FI_DEV_GPU_UTIL&lt;/code&gt;, &lt;code&gt;DCGM_FI_DEV_FB_USED&lt;/code&gt; (framebuffer/VRAM used), and &lt;code&gt;DCGM_FI_DEV_MEM_COPY_UTIL&lt;/code&gt; directly to CloudWatch or to your EKS cluster's Prometheus. I configure alerts at three levels:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VRAM utilization &amp;gt; 85%&lt;/strong&gt;: indicates the model is approaching memory limits. If this happens consistently, it means the model or batch size needs adjustment, or the instance is being under-specified for the real workload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPU utilization &amp;lt; 30% for &amp;gt; 5 minutes during peak hours&lt;/strong&gt;: indicates the bottleneck is elsewhere — frequently in prompt pre-processing, S3 asset download, or the SQS queue. This is a signal that the architecture has a non-GPU bottleneck that is wasting expensive capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Job p99 latency &amp;gt; 2x p50&lt;/strong&gt;: indicates long-tail latency, often caused by jobs that exceed available VRAM and partially swap to CPU. In production, I use a &lt;strong&gt;job latency SLO with burn rate alert&lt;/strong&gt; in CloudWatch: if the SLO violation rate (jobs &amp;gt; 90s) exceeds 5% in a 1-hour window, an alarm fires and Karpenter is authorized to provision additional nodes.&lt;/p&gt;

&lt;p&gt;For end-to-end traceability, I use &lt;strong&gt;OpenTelemetry with the AWS Distro&lt;/strong&gt; in the inference pod, emitting spans for each step: model download, prompt tokenization, forward pass, frame decode, output upload. This allows identifying exactly where time is being spent in a slow job — information that is essential for optimization and for responding to SLO incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-Patterns I Have Seen in Production with GPU Inference
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Using a single large G7e instance (e.g., g7e.48xlarge with 8 GPUs) instead of multiple smaller instances (g7e.xlarge with 1 GPU): for video inference, 8 GPUs on one node do not help if each job uses 1 GPU — you only increase the blast radius of a node failure and reduce resilience.&lt;/li&gt;
&lt;li&gt;Not configuring PodDisruptionBudget: EKS node updates without PDB can interrupt in-progress inference jobs, resulting in partially completed jobs that need reprocessing — double cost and SLO violation.&lt;/li&gt;
&lt;li&gt;Storing model weights on EBS instead of /dev/shm or EFS with local cache: model cold start via EBS gp3 can be 3–5x slower than via /dev/shm for models &amp;gt; 5 GB, especially with multiple pods initializing simultaneously.&lt;/li&gt;
&lt;li&gt;Using Reserved Instances for GPU inference capacity: video inference workloads have unpredictable spikes. Savings Plans with On-Demand fallback via Karpenter offer a better balance between cost and flexibility than 1- or 3-year RIs.&lt;/li&gt;
&lt;li&gt;Ignoring the absence of MIG on L40S and attempting multi-tenancy via Kubernetes namespaces: namespaces do not provide GPU isolation — two pods in different namespaces on the same node share the GPU without hardware isolation, which is unacceptable in regulated financial environments.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;My Curation Note:&lt;/strong&gt; If I were implementing generative video inference today in a financial environment, I would start with Bedrock to validate the business model and SLOs, and only migrate to G7e when daily volume justified the permanent warm pool — typically above 300–500 clips/day. The hardest lesson I have learned in GPU workloads is that the real cost is not in the instance hourly price, but in the idle capacity cost of an oversized warm pool and the reprocessing cost of jobs interrupted by missing PDBs. The absence of MIG on the L40S is an NVIDIA design decision with direct architectural consequences for financial multi-tenancy — it is not a configuration detail, it is a force that must appear in your ADR.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Verdict: When to Adopt G7e and When Not To
&lt;/h2&gt;

&lt;p&gt;EC2 G7e with L40S is the correct choice for production generative video inference when three conditions are simultaneously true: (1) the workload requires video models with &amp;gt; 24 GB of VRAM or latency SLOs &amp;lt; 90s for 720p+ clips, (2) daily volume is sufficient to justify a permanent warm pool (&amp;gt; 300 clips/day as an initial heuristic), and (3) the multi-tenancy model is per-instance, not per-shared-GPU. If any of these conditions is not met, managed Bedrock or G5/G6 with aggressive quantization are more defensible choices. The absence of MIG is the most underestimated limiting factor of this instance for financial environments — put it explicitly in your ADR and model the per-instance isolation cost before committing to the architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rating:&lt;/strong&gt; Adopt with conditions / Adotar com condi&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/aws/" rel="noopener noreferrer"&gt;Amazon EC2 G7e Instances – AWS News Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://resources.nvidia.com/en-us-l40s" rel="noopener noreferrer"&gt;NVIDIA L40S GPU Architecture Whitepaper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://karpenter.sh/docs/concepts/nodepools/" rel="noopener noreferrer"&gt;Karpenter NodePool Configuration – AWS Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html" rel="noopener noreferrer"&gt;IAM Roles for Service Accounts (IRSA) – EKS Best Practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/NVIDIA/dcgm-exporter" rel="noopener noreferrer"&gt;NVIDIA DCGM Exporter for Kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html" rel="noopener noreferrer"&gt;AWS Step Functions – Error Handling and Retries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html" rel="noopener noreferrer"&gt;Amazon Bedrock – Video Generation Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/welcome.html" rel="noopener noreferrer"&gt;AWS Well-Architected Framework – Machine Learning Lens&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fernando.moretes.com/blog/inferencia-de-video-com-ia-generativa-e-gpus-na-aws" rel="noopener noreferrer"&gt;fernando.moretes.com&lt;/a&gt;. By Fernando F. Azevedo — Senior Solutions Architect.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>ec2g7e</category>
      <category>gpuinference</category>
      <category>generativeai</category>
    </item>
  </channel>
</rss>
