<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yoganand Govind</title>
    <description>The latest articles on DEV Community by Yoganand Govind (@yogieee).</description>
    <link>https://dev.to/yogieee</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3957068%2Fedb7e6c7-a634-48db-b78e-8480de082fc0.jpeg</url>
      <title>DEV Community: Yoganand Govind</title>
      <link>https://dev.to/yogieee</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yogieee"/>
    <language>en</language>
    <item>
      <title>DynamoDB Single-Table Design for Multi-Tenant SaaS</title>
      <dc:creator>Yoganand Govind</dc:creator>
      <pubDate>Sun, 21 Jun 2026 21:38:04 +0000</pubDate>
      <link>https://dev.to/yogieee/dynamodb-single-table-design-for-multi-tenant-saas-3767</link>
      <guid>https://dev.to/yogieee/dynamodb-single-table-design-for-multi-tenant-saas-3767</guid>
      <description>&lt;p&gt;&lt;strong&gt;The DynamoDB Table Behind Autowired.ai — Single-Table Design for Multi-Tenant SaaS&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DynamoDB single-table design has a reputation for being hard to learn and easy to get wrong. That reputation is deserved.&lt;/p&gt;

&lt;p&gt;The AWS documentation makes it sound mechanical: define your access patterns, map them to partition keys and sort keys, add GSIs where needed, and done. What it doesn't tell you is that the decisions you make before your table has a single item are effectively permanent. Changing a partition key structure in a table with production data is a migration with backfill, dual-write phases, and cutover risk. You don't iterate on the DynamoDB schema the way you iterate on application code.&lt;/p&gt;

&lt;p&gt;This post is the data model behind &lt;a href="//Autowired.ai"&gt;Autowired.ai&lt;/a&gt; – the specific key patterns, the three GSIs and why each one exists, and the decisions I'd make differently if I started over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Domain and Why Single-Table Made Sense&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The entity hierarchy for Autowired.ai:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tenant
  └── Project
       └── Workflow (extraction schema + confidence thresholds)
            └── Batch (submitted document set)
                 └── Document (file + extraction result)

User (linked to Tenant)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most operations are tenant-scoped. The access patterns are predictable and unlikely to change fundamentally — list projects for a tenant, list workflows for a project sorted by date, get a batch by ID, list documents for a batch, and filter by processing status.&lt;/p&gt;

&lt;p&gt;A single-table design made sense here for the same reason it makes sense for most serverless SaaS at this scale: one table, one set of CloudWatch metrics, one backup configuration, one IAM policy. PAY_PER_REQUEST billing means you're not pre-provisioning capacity across multiple tables and paying for idle throughput on the ones that don't get hit often.&lt;/p&gt;

&lt;p&gt;The tradeoff is real, though: no ad hoc queries, no schema evolution without migration, and a data model that only makes sense if you document it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Primary Key Design: Tenant Isolation by Structure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The PK/SK pattern follows the entity hierarchy directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Entity       PK                                    SK
──────────────────────────────────────────────────────────────────────
Tenant       TENANT#&amp;lt;tenantId&amp;gt;                     METADATA
User         USER#&amp;lt;userId&amp;gt;                         PROFILE
User↔Tenant  TENANT#&amp;lt;tenantId&amp;gt;                     USER#&amp;lt;userId&amp;gt;
Project      TENANT#&amp;lt;tenantId&amp;gt;                     PROJECT#&amp;lt;projectId&amp;gt;
Workflow     TENANT#&amp;lt;tenantId&amp;gt;#PROJECT#&amp;lt;projectId&amp;gt; WORKFLOW#&amp;lt;workflowId&amp;gt;
Batch        TENANT#&amp;lt;tenantId&amp;gt;#PROJECT#&amp;lt;projectId&amp;gt; BATCH#&amp;lt;batchId&amp;gt;
Document     TENANT#&amp;lt;tenantId&amp;gt;#BATCH#&amp;lt;batchId&amp;gt;     DOCUMENT#&amp;lt;documentId&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The compound PK (&lt;em&gt;TENANT##PROJECT#&lt;/em&gt;) is the decision that matters most here.&lt;/p&gt;

&lt;p&gt;Tenant isolation is structural, not advisory. Because &lt;em&gt;tenantId&lt;/em&gt; is embedded in every PK, a DynamoDB query physically cannot return results across tenant boundaries. There's no middleware to configure, no application-layer filter to apply consistently, and no trust that every code path remembered to add the right &lt;em&gt;FilterExpression&lt;/em&gt;. The key structure enforces it.&lt;/p&gt;

&lt;p&gt;An API handler listing workflows must supply both &lt;em&gt;tenantId&lt;/em&gt; (from the authenticated session context via Clerk) and &lt;em&gt;projectId&lt;/em&gt; (from the request path). The query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;dynamodb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;TableName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TABLE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;KeyConditionExpression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PK = :pk AND begins_with(SK, :prefix)&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;ExpressionAttributeValues&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;:pk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`TENANT#&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;tenantId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;#PROJECT#&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;:prefix&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;WORKFLOW#&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns only that tenant-project's workflows. There's no way to accidentally return another tenant's data without constructing the wrong PK — which requires the wrong &lt;em&gt;tenantId&lt;/em&gt; to be in the session, not a missing filter clause.&lt;/p&gt;

&lt;p&gt;Each workflow also stores &lt;em&gt;confidenceThreshold&lt;/em&gt; and &lt;em&gt;reviewThreshold&lt;/em&gt; — the two values that control document status transitions in the processing pipeline. A document whose extraction confidence falls below &lt;em&gt;confidenceThreshold&lt;/em&gt; is &lt;em&gt;FAILED&lt;/em&gt;. One that falls between &lt;em&gt;reviewThreshold&lt;/em&gt; and &lt;em&gt;confidenceThreshold&lt;/em&gt; is &lt;em&gt;REVIEW_REQUIRED&lt;/em&gt;. These thresholds are per workflow, stored in the workflow item, and passed through to Step Functions at batch execution time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three GSIs, Three Distinct Problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The main table's PK structure handles hierarchical queries well. For everything else, there are GSIs. Each one exists for a specific reason.&lt;/p&gt;

&lt;p&gt;GSI1 — Email Lookup + Date-Sorted Workflow Listing&lt;/p&gt;

&lt;p&gt;GSI1 solves two access patterns that the primary key can't:&lt;/p&gt;

&lt;p&gt;User lookup by email. Clerk manages auth, but there are operations — invitations, notifications, and admin lookups — where you need to find a user record by email, not by Clerk's &lt;em&gt;userId&lt;/em&gt;. The PK has users keyed by &lt;em&gt;userId&lt;/em&gt;. GSI1 maps &lt;em&gt;email&lt;/em&gt; to &lt;em&gt;GSI1PK&lt;/em&gt; and &lt;em&gt;USER# _to _GSI1SK&lt;/em&gt;, giving &lt;strong&gt;O(1)&lt;/strong&gt; lookup by email.&lt;/p&gt;

&lt;p&gt;Workflows sorted by last updated date. The frontend shows workflows in reverse chronological order. The PK query can return all workflows for a project but can't sort them. GSI1 maps &lt;em&gt;TENANT##PROJECT#&lt;/em&gt; to &lt;em&gt;GSI1PK&lt;/em&gt; and an ISO 8601 timestamp to &lt;em&gt;GSI1SK&lt;/em&gt;, enabling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;dynamodb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;TableName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TABLE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;IndexName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;GSI1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;KeyConditionExpression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;GSI1PK = :pk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;ExpressionAttributeValues&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;:pk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`TENANT#&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;tenantId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;#PROJECT#&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;ScanIndexForward&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// most recently updated first&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One GSI serving two patterns works here because the two item types use different &lt;em&gt;GSI1PK&lt;/em&gt; formats — there's no collision. The downside: it's not obvious why GSI1 exists when you read the CDK definition without comments. Document the patterns each GSI serves, not just the attribute names.&lt;/p&gt;

&lt;p&gt;GSI2 — Status-Based Filtering (Processing Pipeline)&lt;/p&gt;

&lt;p&gt;GSI2 exists entirely for the processing pipeline — the Step Functions state machine and Lambdas that update document status.&lt;/p&gt;

&lt;p&gt;The pipeline needs to ask, "What documents in this batch are in &lt;em&gt;PROCESSING&lt;/em&gt; status?" and "What batches for this tenant are &lt;em&gt;FAILED&lt;/em&gt;?" Neither query is possible from the main table's PK structure.&lt;/p&gt;

&lt;p&gt;GSI2 composite keys:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;For documents:  GSI2PK = TENANT#&amp;lt;tenantId&amp;gt;#BATCH#&amp;lt;batchId&amp;gt;#STATUS#&amp;lt;status&amp;gt;
For batches:    GSI2PK = TENANT#&amp;lt;tenantId&amp;gt;#STATUS#&amp;lt;status&amp;gt;
                GSI2SK = &amp;lt;ISO 8601 timestamp&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;REVIEW_REQUIRED&lt;/em&gt; is a first-class status here — not a variant of &lt;em&gt;FAILED&lt;/em&gt;. It represents documents where the Bedrock extraction completed successfully but the combined confidence score fell below the &lt;em&gt;reviewThreshold&lt;/em&gt;. These need human review, not reprocessing. Conflating them with &lt;em&gt;FAILED&lt;/em&gt; would make it impossible to route them correctly.&lt;/p&gt;

&lt;p&gt;The sparse index pattern: Only documents in interesting statuses (&lt;em&gt;PROCESSING, FAILED, REVIEW_REQUIRED&lt;/em&gt;) write GSI2 attributes. Completed items have their GSI2 attributes removed after a retention window. Items that don't write GSI2 attributes simply don't appear in the index.&lt;/p&gt;

&lt;p&gt;This keeps GSI2 lean. In a system processing thousands of documents per day, the vast majority will be in terminal &lt;em&gt;SUCCEEDED&lt;/em&gt; status. If every document item included GSI2 attributes, the index would be orders of magnitude larger than necessary. Sparse index design means the GSI contains only the items you'd actually query.&lt;/p&gt;

&lt;p&gt;GSI3 — Direct Batch Lookup by &lt;em&gt;batchId&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;GSI3 exists to solve a coupling problem.&lt;/p&gt;

&lt;p&gt;The Step Functions state machine receives a batchId in its execution input. That's it. It doesn't receive &lt;em&gt;tenantId&lt;/em&gt; or &lt;em&gt;projectId&lt;/em&gt; — which means it can't construct the primary key &lt;em&gt;TENANT##PROJECT#&lt;/em&gt; needed to query the main table.&lt;/p&gt;

&lt;p&gt;Without GSI3, the state machine would need to carry the full PK context in every execution payload, coupling the Step Functions definition to the DynamoDB key structure. Any change to the key structure would require updating the state machine input format.&lt;/p&gt;

&lt;p&gt;GSI3 maps &lt;em&gt;batchId&lt;/em&gt; to &lt;em&gt;GSI3PK&lt;/em&gt; with a fixed &lt;em&gt;GSI3SK = "BATCH"&lt;/em&gt; for point lookups:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;dynamodb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;TableName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TABLE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;IndexName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;GSI3&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;KeyConditionExpression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;GSI3PK = :batchId AND GSI3SK = :sk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;ExpressionAttributeValues&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;:batchId&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;batchId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;:sk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;BATCH&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The state machine calls this lookup once at initialization, gets the full batch record (including &lt;em&gt;tenantId, projectId, workflowId, thresholds&lt;/em&gt;), and then uses the main table PK for all subsequent operations.&lt;/p&gt;

&lt;p&gt;This pattern — a GSI that gives a downstream consumer its native lookup identifier rather than forcing it to understand the data model — comes up repeatedly in event-driven architectures. The processing pipeline is a consumer of DynamoDB, not an owner of the key structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Diagram&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffqnucv5hrma6dh5f6syo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffqnucv5hrma6dh5f6syo.png" alt="diagram" width="800" height="607"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational Patterns Worth Mentioning&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Extraction results in DynamoDB. Each Document item holds the full extraction output — field values, per-field confidence scores, and the Bedrock verification result. DynamoDB's 400KB item limit covers most extraction results. For documents with many fields or long extracted values that approach the limit, the payload goes to S3 and the Document item holds the reference key. DynamoDB remains the authoritative record for status and metadata; S3 handles the overflow.&lt;/p&gt;

&lt;p&gt;Conditional writes for idempotency. S3 event notifications are at-least-once delivery. The &lt;em&gt;S3IngestionLambda&lt;/em&gt; uses &lt;em&gt;attribute_not_exists(PK)&lt;/em&gt; on every &lt;em&gt;PutItem&lt;/em&gt; — if the document record already exists, the write fails silently and the Lambda exits cleanly. No duplicate records, no reprocessing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;dynamodb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;putItem&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;TableName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TABLE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;Item&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;documentRecord&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;ConditionExpression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;attribute_not_exists(PK)&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;TransactWriteItems for batch initialisation.&lt;/strong&gt; Creating a batch involves writing the batch record, writing all document records, and updating the project's batch count. TransactWriteItems ensures all writes succeed or none do. A partial write – a batch record with no documents or documents with no parent batch – would leave the pipeline in an inconsistent state that's hard to recover from.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PITR is non-optional&lt;/strong&gt;. Point-in-time recovery is enabled. DynamoDB holds the authoritative record for extraction results, batch status, workflow configuration, and tenant data. A bug in the processing pipeline corrupting document records is recoverable within 35 days with PITR. Without it, it's not recoverable at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTL for ephemeral data&lt;/strong&gt;. Temporary processing state and short-lived tokens use the ttl attribute. DynamoDB deletes expired items within ~48 hours — eventually consistent, not guaranteed exact. For compliance use cases requiring exact deletion on a schedule, a Lambda on a cron or explicit deletes are more appropriate than TTL alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lessons Learned&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Define all access patterns before touching a key. Not 80% — all of them. Access patterns you discover after the fact require GSIs, which have write amplification costs. Access patterns you can't fit into any GSI require scans.&lt;/p&gt;

&lt;p&gt;Encode tenant ownership in the PK. Tenant isolation via key structure is more reliable than any application-layer enforcement. The query physically cannot cross tenant boundaries if &lt;em&gt;tenantId&lt;/em&gt; is in every partition key.&lt;/p&gt;

&lt;p&gt;Sparse indexes are a feature. Only write GSI attributes on items that need to appear in that index. For a processing pipeline where most items are in terminal &lt;em&gt;SUCCEEDED&lt;/em&gt; status, keeping active-status items in a sparse index is a significant operational win.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The DynamoDB schema is a public API contract&lt;/strong&gt;. Treat the first design with the same care you'd give a public API. You won't get to iterate on it cheaply.&lt;/p&gt;

</description>
      <category>dynamodb</category>
      <category>aws</category>
      <category>database</category>
      <category>architecture</category>
    </item>
    <item>
      <title>RAG Architecture for SaaS Applications Using Amazon Bedrock</title>
      <dc:creator>Yoganand Govind</dc:creator>
      <pubDate>Sun, 14 Jun 2026 17:53:28 +0000</pubDate>
      <link>https://dev.to/yogieee/rag-architecture-for-saas-applications-using-amazon-bedrock-10df</link>
      <guid>https://dev.to/yogieee/rag-architecture-for-saas-applications-using-amazon-bedrock-10df</guid>
      <description>&lt;p&gt;My first real production batch on Autowired.ai cost 3x what I'd budgeted.&lt;/p&gt;

&lt;p&gt;200 documents. One real customer test. And an AWS bill that showed me exactly how wrong my mental model was.&lt;/p&gt;

&lt;p&gt;The initial architecture felt reasonable: Textract handles OCR, Bedrock extracts all the fields from the OCR output. Simple. Straightforward. And expensive at scale because I was sending every document's full OCR text to a frontier model for every field on every page.&lt;/p&gt;

&lt;p&gt;This post is the story of how I got to ~40% cost reduction. The biggest win wasn't a configuration change. It was rethinking what Bedrock should actually be doing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I Was Actually Paying For&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before touching anything, I instrumented every Bedrock call to log input tokens, output tokens, model ID, cache hit/miss, and latency. One week of real data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System prompt tokens&lt;/strong&gt; — ~35% of every input. The verification schema, field definitions, and output format instructions are completely static and present in full on every single call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full OCR context&lt;/strong&gt; — I was sending the complete Textract response to Bedrock. For an invoice targeting 10 specific fields, maybe 30% of that OCR content was actually relevant. I was paying for the other 70%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model tier mismatch&lt;/strong&gt; — Claude Sonnet for everything, including structured form extraction where fixed-field invoices have consistent, predictable layouts. Sonnet is ~5x Haiku pricing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No result caching&lt;/strong&gt; — Documents from the same vendor, same template, same layout — fresh Bedrock call every time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instrument first. Always. You can't optimise what you haven't measured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Change That Moved the Needle Most&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The original flow was: Textract OCR → send full OCR to Bedrock → Bedrock extracts all fields.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem with this&lt;/strong&gt;: Textract is actually very good at extracting structured fields from well-formatted documents—dates, totals, invoice numbers, and line items from tables. It's purpose-built for this. I was using a foundation model to re-derive values that a specialised OCR service had already extracted correctly.&lt;/p&gt;

&lt;p&gt;The new architecture has three stages:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9s8s3ouo1syamqijcdlj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9s8s3ouo1syamqijcdlj.png" alt="New architecture" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Instead of sending a full 3-page invoice OCR to Bedrock for all 10 fields, you're sending the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Targeted OCR sections for the 2–3 fields Textract couldn't get (gap-fill call)&lt;/li&gt;
&lt;li&gt;The combined extraction result for validation (verification call)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For structured documents like standard invoices, Textract reliably gets 70–80% of target fields. Bedrock only handles the hard cases and confirms the final output. Token count drops significantly.&lt;/p&gt;

&lt;p&gt;For complex unstructured documents — contracts, freeform text — Textract's confidence is lower across more fields, so Bedrock handles more. The architecture adapts naturally based on Textract's per-field confidence scores.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;This single shift was the largest cost driver reduction of everything I did.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Prompt Caching: The Fastest Configuration Win&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both the gap-fill and verification system prompts are static per workflow type. The field schema, output format, confidence rules — none of it changes between documents in a batch.&lt;/p&gt;

&lt;p&gt;Amazon Bedrock supports prompt caching on certain models. Marking the system prompt block as cacheable means the first call in a parallel batch wave pays the cache write cost (~25% premium on that portion), and every subsequent call in the same 5-minute window hits the cache at ~10% of the normal input token price.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const response = await bedrockClient.send(
  new InvokeModelCommand({
    modelId: "anthropic.claude-3-5-sonnet-20241022-v2:0",
    body: JSON.stringify({
      system: [
        {
          type: "text",
          text: systemPrompt,
          cache_control: { type: "ephemeral" },
        },
      ],
      messages: [
        {
          role: "user",
          content: [{ type: "text", text: documentContext }],
        },
      ],
      max_tokens: 1024,
    }),
  })
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a 10-document parallel batch wave, the system prompt is computed once. Nine of the ten calls were read from cache. At 1,100-token system prompts across hundreds of documents per batch, this adds up.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Measured impact: ~20% reduction in input token costs for standard batch workloads.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Sending the Right Context, Not All the Context&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For the gap-fill call, you don't need the full Textract output — just the OCR blocks covering the fields Textract couldn't extract. Textract returns confidence scores per field and per block, so you can filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;function filterOcrForMissingFields(
  textractResult: TextractOutput,
  missingFields: string[]
): string {
  const relevantBlocks = textractResult.blocks
    .filter(block =&amp;gt; isBlockRelevantToFields(block, missingFields))
    .map(block =&amp;gt; block.text)
    .join("\n");

  return relevantBlocks;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the verification call, you don't need the raw OCR at all — you need the combined extraction result (Textract values + gap-fill values) structured as a validation input. Sending the full OCR here is wasted tokens.&lt;/p&gt;

&lt;p&gt;I also audited both system prompts. After removing redundant formatting instructions already encoded in the output schema, verbose field descriptions that could be tightened, and defensive edge-case handling that never triggered, each went from ~2,400 tokens to ~1,100 tokens. Zero accuracy impact.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Measured impact: ~15% reduction in per-invocation input token count.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Testing the Smaller Model — Per Task Type&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not per system. Per task type.&lt;/p&gt;

&lt;p&gt;The gap-fill task and the verification task have different complexity profiles. Gap-fill on structured forms (filling in 2–3 missing fields from a standard invoice) is a simpler task than gap-fill on an unstructured contract. Verification — validating already-extracted values is generally easier than deriving them from raw text.&lt;/p&gt;

&lt;p&gt;I tested Haiku vs. Sonnet on 50 representative documents per workflow type:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structured form gap-fill + verification&lt;/strong&gt;: Haiku within 2% accuracy of Sonnet. Switched to Haiku. ~5x cost reduction on these workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unstructured document gap-fill&lt;/strong&gt;: Haiku 8–12% less accurate than Sonnet. Kept Sonnet. The quality gap mattered.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification on structured forms&lt;/strong&gt;: Haiku performed well — validating extracted values is easier than extracting them. Switched.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Model selection is now a per-workflow config, not a system-wide setting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;function selectModelForTask(workflow: Workflow, task: "gap-fill" | "verify"): string {
  if (workflow.complexityTier === "structured") {
    return "anthropic.claude-3-haiku-20240307-v1:0";
  }
  return "anthropic.claude-3-5-sonnet-20241022-v2:0";
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Measured impact: ~15% overall cost reduction, concentrated in high-volume structured extraction workflows.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Application-Layer Result Caching&lt;/p&gt;

&lt;p&gt;Beyond Bedrock's 5-minute prompt cache, I added document-level result caching at the application layer.&lt;/p&gt;

&lt;p&gt;Cache key:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extraction schema ID + version hash&lt;/li&gt;
&lt;li&gt;Hash of normalised Textract output (whitespace-normalised)&lt;/li&gt;
&lt;li&gt;Confidence threshold settings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cache hit → return the stored extraction result. No gap-fill call, no verification call.&lt;/p&gt;

&lt;p&gt;In production, cache hit rates for first-run batches are low. But during schema development — when you're running the same 20-document test set 5–10 times as you tune field definitions — the cache eliminates 80–90% of Bedrock calls.&lt;/p&gt;

&lt;p&gt;Cache storage: DynamoDB with a 24-hour TTL. A configuration hash in the key means any schema change automatically invalidates affected documents.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Measured impact: 5–30%, highly workload-dependent. Highest during schema development.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The Architecture Diagram&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftb2w0lto2yj1c7foxox8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftb2w0lto2yj1c7foxox8.png" alt="Architecture diagram" width="800" height="543"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Four layers: application cache (skip Bedrock entirely on repeat docs) → Bedrock prompt cache (cheap cache reads on static system prompts) → model selection (Haiku vs Sonnet per workflow type and task) → context filtering (gap-fill only for missing fields, structured input for verification).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Didn't Work&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aggressive prompt compression&lt;/strong&gt;. I stripped whitespace and punctuation from system prompts to reduce token count. Accuracy degraded measurably. Foundation models are trained on well-formatted text; stripping formatting works against the training distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared universal prompt across workflow types&lt;/strong&gt;. I tried building one system prompt with conditional sections, cacheable as a single prefix. Engineering complexity was high, the cache hit rate was lower than expected (different workflow types rarely executed close enough in time to share a cache window), and accuracy dropped on edge cases. Reverted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output post-processing to compress tokens&lt;/strong&gt;. Asking the model to output abbreviated values and expanding them in Lambda saved some output tokens, but it added execution time and increased application complexity—net cost difference: negligible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrapping Up&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The 40% isn't one clever trick. It's an architectural shift plus four boring, measurable optimisations applied to what was left:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Architecture first&lt;/strong&gt;: Textract for what it handles well, Bedrock only for gap-fill and verification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt caching&lt;/strong&gt;: One field change, immediate impact on batch workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context filtering&lt;/strong&gt;: Send the right OCR to Bedrock, not all of it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model tiering&lt;/strong&gt;: Test per task type, not per system&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result caching&lt;/strong&gt;: Highest impact during schema development&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're running production Bedrock workloads and haven't measured where your tokens are going — start there. The data will tell you which of these applies.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>aws</category>
      <category>infrastructure</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I Built the Document Processing Pipeline Behind Autowired.ai: S3, Lambda, Step Functions, and SQS</title>
      <dc:creator>Yoganand Govind</dc:creator>
      <pubDate>Sun, 07 Jun 2026 10:54:13 +0000</pubDate>
      <link>https://dev.to/yogieee/how-i-built-the-document-processing-pipeline-behind-autowiredai-s3-lambda-step-functions-and-3kfn</link>
      <guid>https://dev.to/yogieee/how-i-built-the-document-processing-pipeline-behind-autowiredai-s3-lambda-step-functions-and-3kfn</guid>
      <description>&lt;p&gt;Early in building &lt;a href="//autowired.ai"&gt;Autowired.ai&lt;/a&gt;, I wired up a simple flow: user submits a batch → API calls Textract → API calls Bedrock → API returns results.&lt;/p&gt;

&lt;p&gt;It worked perfectly. For one document.&lt;/p&gt;

&lt;p&gt;The moment I tested with 20 documents, the Lambda timed out. The moment I tested with a corrupted PDF, the entire batch failed. The moment I imagined what happens when Bedrock throttles mid-batch, I realised I had built a pipeline that would fall apart in production in at least five different ways.&lt;/p&gt;

&lt;p&gt;So I scrapped it and rebuilt the whole thing async. This post is what I landed on — and more importantly, why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Core Insight: The API Should Not Touch the Processing&lt;/strong&gt;&lt;br&gt;
The first architectural shift was the most important one: &lt;strong&gt;the API has no role in document processing&lt;/strong&gt;. Its only job is to accept the batch submission, write the initial records to DynamoDB, return a presigned S3 URL, and send a &lt;em&gt;202 Accepted&lt;/em&gt; back to the client.&lt;/p&gt;

&lt;p&gt;That's it. The API is done.&lt;/p&gt;

&lt;p&gt;Everything that follows – OCR, AI extraction, status tracking, and webhook delivery – happens completely independently, triggered by the document landing in S3.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fom1dod0oueruzzustmer.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fom1dod0oueruzzustmer.png" alt="the flow" width="800" height="726"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Why S3 as the trigger instead of having the API start the Step Functions execution directly?&lt;/p&gt;

&lt;p&gt;Because S3 event notifications are durable. If Step Functions has a transient service hiccup when the document uploads, the S3 event queues and retries. The API already returned &lt;em&gt;202&lt;/em&gt; — the client doesn't know or care. The processing will start when the event delivers.&lt;/p&gt;

&lt;p&gt;If the API triggered Step Functions directly, a transient Step Functions error at submission time would require the client to retry the entire batch upload. That's a much worse failure mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Gotcha I Hit With S3 Event Filters&lt;/strong&gt;&lt;br&gt;
S3 event notification suffix filters are &lt;strong&gt;case-sensitive&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I set up filters for &lt;em&gt;.pdf, .png, .jpg, .jpeg, .tiff&lt;/em&gt; and wondered why some customer uploads weren't triggering processing. Turns out files uploaded from Windows frequently come in as &lt;em&gt;.PDF, .JPG, .TIFF.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The fix is to register both variants for every extension:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;supportedExtensions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.pdf&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.PDF&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.png&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.PNG&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.jpg&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.JPG&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.jpeg&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.JPEG&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.tiff&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.TIFF&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.tif&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.TIF&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ext&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;supportedExtensions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;documentsBucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addEventNotification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;EventType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OBJECT_CREATED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;s3n&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LambdaDestination&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;s3IngestionLambda&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;s3-ingestion/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;suffix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ext&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not glamorous. But the kind of thing that burns you in production if you don't know it.&lt;/p&gt;

&lt;p&gt;Also, S3 event notifications are &lt;strong&gt;at-least-once&lt;/strong&gt;, not exactly-once. &lt;em&gt;S3IngestionLambda&lt;/em&gt; checks DynamoDB before creating a record — if the document already exists, it's a no-op. Idempotency here is not optional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The State Machine: Where the Real Work Happens&lt;/strong&gt;&lt;br&gt;
Once the ingestion Lambda fires StartExecution, control passes to the Step Functions state machine. This is the &lt;strong&gt;heart of the pipeline&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvckqqybrsl42ci5vcjob.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvckqqybrsl42ci5vcjob.png" alt="architecture" width="799" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The full flow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm66fskqg30jwzsalopg7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm66fskqg30jwzsalopg7.png" alt="the full flow - state machine" width="800" height="683"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let me walk through the decisions that aren't obvious from the diagram.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why maxConcurrency: 10 — Not 50, Not 100&lt;/strong&gt;&lt;br&gt;
The Map state fans out to one execution per document. &lt;em&gt;maxConcurrency: 10&lt;/em&gt; means at most 10 documents are processed simultaneously.&lt;/p&gt;

&lt;p&gt;The first time I saw this, I thought it was a conservative default I should raise. I didn't raise it — and here's why that was the right call.&lt;/p&gt;

&lt;p&gt;Textract's AnalyzeDocument API and Bedrock both have per-account concurrency limits. If you submit 50 documents simultaneously, you'll hit those limits, get 429 throttling responses, and your retry logic will pile on. You end up processing 50 documents slower than if you'd used 10, because the retry backoff adds latency on top of the throttled calls.&lt;/p&gt;

&lt;p&gt;10 concurrent document processors is a deliberate contract with AWS service quotas. At ~15 seconds per document (Textract + Bedrock combined), a 100-document batch takes around 150 seconds — 10 waves of 10. That's completely acceptable for a background processing job.&lt;/p&gt;

&lt;p&gt;When I request higher Textract and Bedrock quotas, I'll raise this number. The point is it lives in the CDK definition as a named constant — not buried in application code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-Document Error Handling: The Most Important Decision&lt;/strong&gt;&lt;br&gt;
This was the thing I got wrong first.&lt;/p&gt;

&lt;p&gt;In my initial version, if one document's processor threw an exception, the Map state failed, and the entire batch execution failed. 49 successfully processed documents, one corrupted PDF, entire batch in &lt;strong&gt;FAILED&lt;/strong&gt; state.&lt;/p&gt;

&lt;p&gt;The fix is &lt;strong&gt;addCatch&lt;/strong&gt; at the task level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;processDocument&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addCatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;markDocumentFailed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;States.ALL&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;resultPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;$.error&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;em&gt;DocumentProcessorLambda&lt;/em&gt; throws — for any reason — execution routes to &lt;em&gt;MarkDocumentFailed&lt;/em&gt; instead of propagating up. MarkDocumentFailed writes the error, timestamp, and document ID to DynamoDB and returns cleanly. The Map state moves to the next document.&lt;/p&gt;

&lt;p&gt;The batch ends with a mix of &lt;em&gt;SUCCEEDED&lt;/em&gt; and &lt;em&gt;FAILED&lt;/em&gt; documents. &lt;em&gt;UpdateBatchStatus&lt;/em&gt; calculates the final batch state from the individual document outcomes. Users can see which documents succeeded, which failed, and why — rather than getting a single opaque batch failure.&lt;/p&gt;

&lt;p&gt;This is the difference between a pipeline that's usable in production and one that isn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SQS Queues: Two of Them, Different Purposes&lt;/strong&gt;&lt;br&gt;
The pipeline has two SQS queues, and they're not interchangeable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DocumentProcessingQueue&lt;/strong&gt; — for bulk S3 ingestion paths where documents need to buffer before hitting the state machine. The critical configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;visibilityTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;minutes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;// Lambda timeout is 5min — always add 1&lt;/span&gt;
&lt;span class="nx"&gt;deadLetterQueue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;documentDlq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;maxReceiveCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The visibility timeout rule I follow everywhere: &lt;strong&gt;always set it to Lambda execution timeout + at least 60 seconds&lt;/strong&gt;. If the Lambda is mid-execution and the visibility timeout expires, SQS thinks the message was abandoned and makes it visible again. Now two Lambdas are processing the same message simultaneously. That's bad.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WebhookDeliveryQueue&lt;/strong&gt; — for notifying customer endpoints after batch completion. Five retry attempts instead of three, because external customer endpoints are less reliable than internal Lambda functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;deadLetterQueue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;webhookDlq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;maxReceiveCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// external endpoints get more retries&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And &lt;em&gt;batchSize: 1&lt;/em&gt; on the consumer Lambda — if you process 10 webhook deliveries in one invocation and 3 fail, SQS retries all 10. With batch size 1, each delivery fails independently.&lt;/p&gt;

&lt;p&gt;Both DLQs retain messages for 14 days. That's the window to investigate failures. A CloudWatch alarm on DLQ depth &amp;gt; 0 means something broke and gave up — always worth looking at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Four Failure Modes and How Each Is Handled&lt;/strong&gt;&lt;br&gt;
I designed the failure handling before I wrote the happy path. Here's the failure map:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A single document fails&lt;/strong&gt; → &lt;em&gt;addCatch&lt;/em&gt; routes to &lt;em&gt;MarkDocumentFailed&lt;/em&gt;. The batch continues. The document is marked &lt;em&gt;FAILED&lt;/em&gt; in DynamoDB with the error reason.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The entire batch initialisation fails&lt;/strong&gt; → State machine transitions to &lt;strong&gt;FAILED&lt;/strong&gt;. This is the right outcome — if the batch can't be initialised, there's nothing to recover.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Webhook delivery fails&lt;/strong&gt; → SQS retries up to 5 times. After 5 failures, message dead letters. CloudWatch alarm fires. The customer gets no webhook — but the batch itself already completed successfully.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scheduled batch trigger fails&lt;/strong&gt; → EventBridge retries 2 times. &lt;em&gt;ScheduledBatchLambda&lt;/em&gt; is idempotent — it checks DynamoDB before starting any execution, so even a double-trigger from EventBridge won't create duplicate processing.&lt;/p&gt;

&lt;p&gt;Each failure mode is isolated. A failed webhook doesn't affect a completed batch. A failed scheduled trigger doesn't affect manually submitted batches. The pipeline degrades gracefully rather than having a single failure surface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One Operational Lesson That Surprised Me&lt;/strong&gt;&lt;br&gt;
Step Functions execution history is more useful than I expected.&lt;/p&gt;

&lt;p&gt;When something goes wrong at 2am — and it will — the first place I look is the Step Functions console. The execution shows every state transition, every input payload, and every error message exactly when it occurred. No digging through CloudWatch logs across multiple Lambda invocations. No reconstructing request IDs.&lt;/p&gt;

&lt;p&gt;I added &lt;strong&gt;X-ray&lt;/strong&gt; tracing to every Lambda in the pipeline before the first batch ever ran. The combination of Step Functions execution history and X-Ray distributed traces means I've never had a failure I couldn't diagnose within a few minutes.&lt;/p&gt;

&lt;p&gt;Add these from day one. They cost almost nothing, and they're invaluable when you need them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrapping Up&lt;/strong&gt;&lt;br&gt;
The pipeline I built is not clever. It's S3, Lambda, Step Functions, and SQS doing exactly what they're designed to do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;S3 as a durable ingestion trigger, not a storage layer&lt;/li&gt;
&lt;li&gt;Step Functions for orchestration with state, parallelism, and explicit failure handling&lt;/li&gt;
&lt;li&gt;SQS for buffering and decoupled delivery with the right retry budget per queue&lt;/li&gt;
&lt;li&gt;DLQs and CloudWatch alarms as the operational safety net&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 202 async pattern means the API is fast and the processing is resilient. The per-document error handling means one bad file can't kill a batch. The visibility timeout discipline means no message is processed twice.&lt;/p&gt;

&lt;p&gt;Next I'm writing about the AI cost optimization layer — specifically how I reduced Bedrock costs by ~40% without touching extraction accuracy.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>distributedsystems</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Step Functions vs EventBridge vs SQS — I Use All Three in the Same System. Here's why.</title>
      <dc:creator>Yoganand Govind</dc:creator>
      <pubDate>Sun, 31 May 2026 12:03:18 +0000</pubDate>
      <link>https://dev.to/yogieee/step-functions-vs-eventbridge-vs-sqs-i-use-all-three-in-the-same-system-heres-why-4njb</link>
      <guid>https://dev.to/yogieee/step-functions-vs-eventbridge-vs-sqs-i-use-all-three-in-the-same-system-heres-why-4njb</guid>
      <description>&lt;p&gt;When I started building &lt;a href="//autowired.ai"&gt;Autowired.ai&lt;/a&gt; — an AI document extraction SaaS, one of the earliest decisions I had to make was which AWS messaging and orchestration services to use for the processing pipeline.&lt;/p&gt;

&lt;p&gt;My first instinct was to reach for SQS everywhere. I knew it well. It's simple, it's cheap; it's reliable. But as the pipeline grew more complex — document uploads triggering extraction workflows, batches processing dozens of files in parallel, and webhook notifications delivered to customer endpoints, I kept running into the edges of what a queue alone can do.&lt;/p&gt;

&lt;p&gt;I ended up using all three services: SQS, EventBridge, and Step Functions. Not because I wanted complexity, but because each one fits a specific job that the others don't.&lt;/p&gt;

&lt;p&gt;This post discusses the decision-making process for each service, detailing what each one does, where it may fail, and the specific reasons why I selected each service for different components of the Autowired pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem With Reaching for One Tool&lt;/strong&gt;&lt;br&gt;
Here's the antipattern I see a lot: an engineer learns SQS, it works for the first async use case, and then SQS becomes the default for everything async. Queue → Lambda → done. Repeat.&lt;/p&gt;

&lt;p&gt;This works fine until your workflow has multiple steps. Then you start chaining queues: Queue A → Lambda B writes to Queue C → Lambda D writes to Queue E. You've now built a state machine out of SQS queues — without any of the state, visibility, error handling, or branching logic that a state machine provides.&lt;/p&gt;

&lt;p&gt;When something breaks in that chain, you're reconstructing what happened from CloudWatch logs across five different Lambda invocations with different request IDs, trying to figure out which step failed and what the data looked like when it did.&lt;/p&gt;

&lt;p&gt;I've been there. It's not fun.&lt;/p&gt;

&lt;p&gt;The right answer isn't "use Step Functions for everything" either. Step Functions has overhead — cost per state transition, latency per step, and operational complexity that's overkill for simple async tasks.&lt;/p&gt;

&lt;p&gt;The answer is using each service for what it's actually designed for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Three Services, Simply Put&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;SQS&lt;/strong&gt; is a queue. It buffers work between a producer and a consumer, handles retries, and dead-letters failed messages. It knows nothing about workflow state — only whether a message was processed or needs to be retried.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EventBridge&lt;/strong&gt; is an event router. It receives events and routes them to one or more consumers based on rules. Its superpower is loose coupling — the producer doesn't know who's listening, and you can add new consumers without touching the producer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step Functions&lt;/strong&gt; is a workflow orchestrator. It manages multi-step processes with persistent state, branching, parallelism, and error handling. Every step's input, output, and failure is recorded in the execution history.&lt;/p&gt;

&lt;p&gt;None of these is a substitute for the others. They solve different problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How Autowired Uses All Three&lt;/strong&gt;&lt;br&gt;
Here's the actual service topology in the Autowired processing pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx70pdl8xm7dy4w5299jf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx70pdl8xm7dy4w5299jf.png" alt=" " width="800" height="643"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let me explain why each service is where it is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EventBridge: For the Scheduled Trigger&lt;/strong&gt;&lt;br&gt;
Every 5 minutes, Autowired checks whether any batches have a scheduled execution time that's due. This is handled by a Lambda (&lt;em&gt;ScheduledBatchLambda&lt;/em&gt;) triggered by an EventBridge rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;scheduledBatchRule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ScheduledBatchRule&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;ruleName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`autowire-scheduled-batch-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Schedule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;minutes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Triggers scheduled batch processing check every 5 minutes&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;scheduledBatchRule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addTarget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;targets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LambdaFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scheduledBatchLambda&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;retryAttempts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why EventBridge here and not a CloudWatch Events cron or a polling Lambda?&lt;/p&gt;

&lt;p&gt;Because EventBridge is the native AWS scheduling primitive. The rule is declarative, versioned in CDK, has built-in retry logic (&lt;em&gt;retryAttempts: 2&lt;/em&gt;), and integrates cleanly with Lambda. There's no infrastructure to manage, no polling loop to maintain.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;retryAttempts: 2&lt;/em&gt; matters: if the Lambda has a cold start failure or a transient error, EventBridge retries twice before giving up. Without this, a Lambda cold start would silently skip a scheduled batch check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step Functions: For the Document Processing Workflow&lt;/strong&gt;&lt;br&gt;
When a document batch is submitted — either via S3 upload or scheduled trigger — it starts a Step Functions execution. This is the core of the pipeline, and it's where the complexity lives.&lt;/p&gt;

&lt;p&gt;The state machine looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj1019vqxt77gfyxlqxm6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj1019vqxt77gfyxlqxm6.png" alt=" " width="799" height="501"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I want to highlight a few specific decisions here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Map state with &lt;em&gt;maxConcurrency: 10&lt;/em&gt;&lt;/strong&gt;. Each batch can have dozens or hundreds of documents. The Map state fans out to one execution per document, running up to 10 in parallel. Why 10 and not 50? Because Textract and Bedrock both have per-account concurrency limits. Unconstrained parallelism would exhaust those limits, trigger throttling, and ironically make the batch slower. 10 is a deliberate contract with AWS service quotas, not a performance guess.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-document error handling with &lt;em&gt;addCatch&lt;/em&gt;&lt;/strong&gt;. This was one of the most important design decisions. If one corrupted PDF crashes the document processor, it should not abort the other 49 documents in the batch. The &lt;em&gt;addCatch&lt;/em&gt; on the &lt;em&gt;processDocument&lt;/em&gt; step routes failures to &lt;em&gt;MarkDocumentFailed&lt;/em&gt;, which writes the error to DynamoDB and lets the Map state continue:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;processDocument&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addCatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;markDocumentFailed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;States.ALL&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;resultPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;$.error&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without this, a single bad document would fail the entire batch execution. That's the wrong failure mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why not Lambda chains?&lt;/strong&gt; I tried a simpler version of this early on — chaining Lambda invocations directly. The moment I needed to fan out across multiple documents and then aggregate results (to determine overall batch status), Lambda chains couldn't express it. There's no fan-in primitive. Step Functions' Map state handles this natively.&lt;/p&gt;

&lt;p&gt;The other thing Lambda chains can't give you: &lt;strong&gt;execution history&lt;/strong&gt;. When a batch fails, the first thing I do is open the Step Functions console and look at the execution. Every state, every input, every error is right there. That debugging visibility has saved me hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SQS: For Webhook Delivery&lt;/strong&gt;&lt;br&gt;
After a batch completes, if the customer has webhook delivery configured, Autowired sends a notification to their endpoint. This is handled via SQS, not via a direct Lambda call from inside the state machine, and not via EventBridge.&lt;/p&gt;

&lt;p&gt;Here's why SQS specifically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;webhookDeliveryQueue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;sqs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Queue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;WebhookDeliveryQueue&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;queueName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`autowire-webhook-queue-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;visibilityTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seconds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;deadLetterQueue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;webhookDlq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;maxReceiveCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The decoupling is the point&lt;/strong&gt;. Customer webhook endpoints are unreliable — they go down, they time out, they return 500s for unrelated reasons. If webhook delivery is inside the Step Functions execution, a failed webhook delivery blocks or fails the batch. Those are completely unrelated concerns.&lt;/p&gt;

&lt;p&gt;By routing to SQS, the Step Functions execution completes cleanly. Webhook delivery has its own retry budget (*&lt;em&gt;maxReceiveCount: 5 *&lt;/em&gt;— more than the document processor queue's 3, because external endpoints are flakier than internal Lambdas). Failures dead-letter after 5 attempts and trigger a separate alarm.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;batchSize: 1&lt;/em&gt; on the consumer&lt;/strong&gt;. The &lt;em&gt;WebhookDeliveryLambda&lt;/em&gt; processes one message at a time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;webhookDeliveryLambda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addEventSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;lambdaEventSources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SqsEventSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;webhookDeliveryQueue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;batchSize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you process 10 webhook deliveries in one Lambda invocation and 3 fail, SQS can't partially acknowledge all 10 retry. With &lt;em&gt;batchSize: 1&lt;/em&gt;, each webhook delivery retries independently. The blast radius of a failed delivery is exactly one customer, not ten.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Decision Framework I Actually Use&lt;/strong&gt;&lt;br&gt;
After building this, here's how I'd frame the decision for any new async requirement:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reach for SQS when&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need to buffer work between a producer and consumer running at different rates&lt;/li&gt;
&lt;li&gt;The work is a single step — queue in, Lambda processes, done&lt;/li&gt;
&lt;li&gt;You need per-message acknowledgment and dead-lettering&lt;/li&gt;
&lt;li&gt;The work items are independent of each other&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Reach for EventBridge when&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need a scheduled trigger (rate or cron)&lt;/li&gt;
&lt;li&gt;One event needs to fan out to multiple independent consumers&lt;/li&gt;
&lt;li&gt;You want the producer to be decoupled from who consumes its events&lt;/li&gt;
&lt;li&gt;You're routing events across services or accounts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Reach for Step Functions when&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your workflow has multiple steps where each depends on the previous step's output&lt;/li&gt;
&lt;li&gt;You need to fan out over a collection and wait for all items to complete (Map state)&lt;/li&gt;
&lt;li&gt;You need conditional branching based on data in the execution context&lt;/li&gt;
&lt;li&gt;You need execution history for debugging and operations&lt;/li&gt;
&lt;li&gt;The workflow can run for minutes to hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the pattern to avoid: using SQS to chain multi-step workflows. If you find yourself writing a Lambda that reads from Queue A and writes to Queue B to trigger the next step, stop and reach for Step Functions instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrapping Up&lt;/strong&gt;&lt;br&gt;
The combination of EventBridge for scheduling, Step Functions for orchestration, and SQS for decoupled delivery isn't overengineering. It's three services doing the jobs they were designed for.&lt;/p&gt;

&lt;p&gt;The Autowired pipeline is more debuggable, more resilient, and operationally cleaner because each service is in its right place — not because I used the most services, but because I used the right ones.&lt;/p&gt;

&lt;p&gt;Next I'm writing about the full event-driven document processing pipeline — how S3 uploads trigger the Step Functions execution, how the DLQs are wired, and the failure handling that makes it production grade.&lt;/p&gt;

&lt;p&gt;Follow along if that's useful.&lt;/p&gt;

&lt;p&gt;This is part of a 10-post series on the architecture behind &lt;a href="//autowired.ai"&gt;Autowired.ai&lt;/a&gt; — an AI document extraction SaaS I built solo on AWS serverless.&lt;/p&gt;

&lt;p&gt;← &lt;a href="https://dev.to/yogieee/what-ive-been-building-for-the-last-several-months-and-why-im-finally-writing-about-it-3i9m"&gt;Intro post: What I've been building&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>distributedsystems</category>
      <category>serverless</category>
      <category>architecture</category>
    </item>
    <item>
      <title>What I've Been Building for the Last Several Months and Why I'm Finally Writing About It</title>
      <dc:creator>Yoganand Govind</dc:creator>
      <pubDate>Thu, 28 May 2026 17:45:41 +0000</pubDate>
      <link>https://dev.to/yogieee/what-ive-been-building-for-the-last-several-months-and-why-im-finally-writing-about-it-3i9m</link>
      <guid>https://dev.to/yogieee/what-ive-been-building-for-the-last-several-months-and-why-im-finally-writing-about-it-3i9m</guid>
      <description>&lt;p&gt;I've been quietly heads-down building something outside of work for the past several months. No posts, no updates, no "excited to share" announcements. Just building.&lt;br&gt;
Today I'm breaking that silence, and this is the first post in a series where I'll share everything I've learned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Thing I Built&lt;/strong&gt;&lt;br&gt;
&lt;a href="//autowired.ai"&gt;Autowired.ai&lt;/a&gt; — an AI-powered document extraction SaaS.&lt;br&gt;
The idea is straightforward: businesses deal with mountains of documents — invoices, purchase orders, contracts, insurance forms — and extracting structured data from them is still mostly manual or brittle rule-based OCR that breaks the moment the template changes.&lt;/p&gt;

&lt;p&gt;Autowired lets you define a visual extraction template on a canvas (you draw fields over a sample document), submit a batch of documents, and get back structured JSON with the extracted values. No code, no regex, no fragile parsers.&lt;/p&gt;

&lt;p&gt;Sounds simple. The engineering is not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why I Built It, Solo&lt;/strong&gt;&lt;br&gt;
I have 11 years of software engineering experience. Enterprise Java, government systems, insurance platforms, cloud architecture. I've worked on large teams, gone through full SDLC processes, dealt with change advisory boards and SLA contracts.&lt;/p&gt;

&lt;p&gt;What I hadn't done was build something entirely from scratch, make every architectural decision myself, and take it all the way to production — solo.&lt;/p&gt;

&lt;p&gt;So that's what I did. This project is my proving ground for everything I know about cloud-native architecture and AI systems applied without the safety net of a team.&lt;/p&gt;

&lt;p&gt;It's ~90% complete, which is in beta phase. And the lessons have been hard-earned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Stack (and Why)&lt;/strong&gt;&lt;br&gt;
Before I dive into any specific post, here's the full picture of what's running under the hood:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;: AWS CDK (TypeScript) — everything is code, nothing is clicked into existence in the console. Six separate CDK stacks: database, storage, processing, Bedrock, API, and monitoring.&lt;br&gt;
Database: DynamoDB single-table design with three GSIs. Multi-tenant data isolation baked into the partition key structure — not enforced by application code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document processing pipeline&lt;/strong&gt;: S3 event notifications trigger a Lambda, which starts a Step Functions state machine. The state machine runs up to 10 documents in parallel, handles per-document failures independently, updates batch status, and optionally delivers webhook notifications via a separate SQS queue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI extraction&lt;/strong&gt;: Amazon Bedrock Data Automation (BDA) for intelligent field extraction. Amazon Textract for OCR preprocessing. Bedrock Guardrails for safety filtering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auth&lt;/strong&gt;: Clerk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Frontend&lt;/strong&gt;: Next.js product app and marketing site.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runtime&lt;/strong&gt;: ARM64 Lambdas on Node.js 20, X-Ray tracing across the pipeline, DLQs on every queue.&lt;/p&gt;

&lt;p&gt;Every piece of that list came with decisions, tradeoffs, and at least one thing I got wrong the first time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's Coming in This Series&lt;/strong&gt;&lt;br&gt;
Over the next 10 weeks, I'm writing about each layer of this system in depth — not tutorials, not beginner walkthroughs, but the actual engineering reasoning behind the decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Step Functions vs EventBridge vs SQS&lt;/strong&gt; — when to use each, and how I use all three in the same system for different jobs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Building the event-driven document processing pipeline&lt;/strong&gt; — S3, SQS, Lambda, Step Functions, and the failure handling that makes it production-grade&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How I reduced Bedrock AI costs by ~40%&lt;/strong&gt; — prompt caching, model tiering, token optimisation, result caching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DynamoDB single-table design for multi-tenant SaaS&lt;/strong&gt; — real partition key patterns, GSI design decisions, and the tradeoffs nobody mentions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The full multi-tenant SaaS architecture&lt;/strong&gt; — tenant isolation, async processing, API design, and how all the stacks fit together&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terraform vs AWS CDK&lt;/strong&gt; — a practical comparison from someone who's used both in production&lt;/li&gt;
&lt;li&gt;*&lt;em&gt;RAG architecture on Amazon Bedrock *&lt;/em&gt;— embeddings, chunking strategy, tenant-aware retrieval, hallucination reduction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Designing high-availability systems at enterprise scale&lt;/strong&gt; - what I carried over from government and insurance engineering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI architecture patterns from a real product&lt;/strong&gt; — observability, confidence thresholds, prompt versioning, output validation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;From Enterprise Java engineer to AI Platform engineer&lt;/strong&gt; — what the transition actually looks like.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you work in cloud, AI infrastructure, or platform engineering or you're just curious how a solo engineer structures a production-grade SaaS — this series is for you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Note on Why I'm Sharing This&lt;/strong&gt;&lt;br&gt;
I'm not building in public for the sake of building in public. I'm sharing this because the content I wish had existed when I was making these decisions about DynamoDB key design, about when Step Functions is overkill, about how to actually reduce Bedrock costs in a real workload mostly doesn't exist at the depth it should.&lt;/p&gt;

&lt;p&gt;Most AWS content is either too beginner or too abstract. There's not enough "here's a real system, here's why it's designed this way, here's what broke."&lt;/p&gt;

&lt;p&gt;That's what I'm trying to write.&lt;/p&gt;

&lt;p&gt;— Yoganand (Yogi)&lt;/p&gt;

&lt;p&gt;Follow me here on Dev.to or connect on &lt;a href="https://www.linkedin.com/in/yoganandgovind/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; to get each post as it drops.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>ai</category>
      <category>distributedsystems</category>
    </item>
  </channel>
</rss>
