Yoganand Govind

Posted on Jun 7

How I Built the Document Processing Pipeline Behind Autowired.ai: S3, Lambda, Step Functions, and SQS

#aws #distributedsystems #serverless #architecture

Early in building Autowired.ai, I wired up a simple flow: user submits a batch → API calls Textract → API calls Bedrock → API returns results.

It worked perfectly. For one document.

The moment I tested with 20 documents, the Lambda timed out. The moment I tested with a corrupted PDF, the entire batch failed. The moment I imagined what happens when Bedrock throttles mid-batch, I realised I had built a pipeline that would fall apart in production in at least five different ways.

So I scrapped it and rebuilt the whole thing async. This post is what I landed on — and more importantly, why.

The Core Insight: The API Should Not Touch the Processing
The first architectural shift was the most important one: the API has no role in document processing. Its only job is to accept the batch submission, write the initial records to DynamoDB, return a presigned S3 URL, and send a 202 Accepted back to the client.

That's it. The API is done.

Everything that follows – OCR, AI extraction, status tracking, and webhook delivery – happens completely independently, triggered by the document landing in S3.

Why S3 as the trigger instead of having the API start the Step Functions execution directly?

Because S3 event notifications are durable. If Step Functions has a transient service hiccup when the document uploads, the S3 event queues and retries. The API already returned 202 — the client doesn't know or care. The processing will start when the event delivers.

If the API triggered Step Functions directly, a transient Step Functions error at submission time would require the client to retry the entire batch upload. That's a much worse failure mode.

The Gotcha I Hit With S3 Event Filters
S3 event notification suffix filters are case-sensitive.

I set up filters for .pdf, .png, .jpg, .jpeg, .tiff and wondered why some customer uploads weren't triggering processing. Turns out files uploaded from Windows frequently come in as .PDF, .JPG, .TIFF.

The fix is to register both variants for every extension:

const supportedExtensions = [
  ".pdf", ".PDF",
  ".png", ".PNG",
  ".jpg", ".JPG",
  ".jpeg", ".JPEG",
  ".tiff", ".TIFF",
  ".tif", ".TIF",
];

for (const ext of supportedExtensions) {
  this.documentsBucket.addEventNotification(
    s3.EventType.OBJECT_CREATED,
    new s3n.LambdaDestination(this.s3IngestionLambda),
    { prefix: "s3-ingestion/", suffix: ext }
  );
}

Not glamorous. But the kind of thing that burns you in production if you don't know it.

Also, S3 event notifications are at-least-once, not exactly-once. S3IngestionLambda checks DynamoDB before creating a record — if the document already exists, it's a no-op. Idempotency here is not optional.

The State Machine: Where the Real Work Happens
Once the ingestion Lambda fires StartExecution, control passes to the Step Functions state machine. This is the heart of the pipeline.

The full flow:

Let me walk through the decisions that aren't obvious from the diagram.

Why maxConcurrency: 10 — Not 50, Not 100
The Map state fans out to one execution per document. maxConcurrency: 10 means at most 10 documents are processed simultaneously.

The first time I saw this, I thought it was a conservative default I should raise. I didn't raise it — and here's why that was the right call.

Textract's AnalyzeDocument API and Bedrock both have per-account concurrency limits. If you submit 50 documents simultaneously, you'll hit those limits, get 429 throttling responses, and your retry logic will pile on. You end up processing 50 documents slower than if you'd used 10, because the retry backoff adds latency on top of the throttled calls.

10 concurrent document processors is a deliberate contract with AWS service quotas. At ~15 seconds per document (Textract + Bedrock combined), a 100-document batch takes around 150 seconds — 10 waves of 10. That's completely acceptable for a background processing job.

When I request higher Textract and Bedrock quotas, I'll raise this number. The point is it lives in the CDK definition as a named constant — not buried in application code.

Per-Document Error Handling: The Most Important Decision
This was the thing I got wrong first.

In my initial version, if one document's processor threw an exception, the Map state failed, and the entire batch execution failed. 49 successfully processed documents, one corrupted PDF, entire batch in FAILED state.

The fix is addCatch at the task level:

processDocument.addCatch(markDocumentFailed, {
  errors: ["States.ALL"],
  resultPath: "$.error",
});

When DocumentProcessorLambda throws — for any reason — execution routes to MarkDocumentFailed instead of propagating up. MarkDocumentFailed writes the error, timestamp, and document ID to DynamoDB and returns cleanly. The Map state moves to the next document.

The batch ends with a mix of SUCCEEDED and FAILED documents. UpdateBatchStatus calculates the final batch state from the individual document outcomes. Users can see which documents succeeded, which failed, and why — rather than getting a single opaque batch failure.

This is the difference between a pipeline that's usable in production and one that isn't.

SQS Queues: Two of Them, Different Purposes
The pipeline has two SQS queues, and they're not interchangeable.

DocumentProcessingQueue — for bulk S3 ingestion paths where documents need to buffer before hitting the state machine. The critical configuration:

visibilityTimeout: cdk.Duration.minutes(6), // Lambda timeout is 5min — always add 1
deadLetterQueue: {
  queue: documentDlq,
  maxReceiveCount: 3,
},

The visibility timeout rule I follow everywhere: always set it to Lambda execution timeout + at least 60 seconds. If the Lambda is mid-execution and the visibility timeout expires, SQS thinks the message was abandoned and makes it visible again. Now two Lambdas are processing the same message simultaneously. That's bad.

WebhookDeliveryQueue — for notifying customer endpoints after batch completion. Five retry attempts instead of three, because external customer endpoints are less reliable than internal Lambda functions:

deadLetterQueue: {
  queue: webhookDlq,
  maxReceiveCount: 5, // external endpoints get more retries
},

And batchSize: 1 on the consumer Lambda — if you process 10 webhook deliveries in one invocation and 3 fail, SQS retries all 10. With batch size 1, each delivery fails independently.

Both DLQs retain messages for 14 days. That's the window to investigate failures. A CloudWatch alarm on DLQ depth > 0 means something broke and gave up — always worth looking at.

The Four Failure Modes and How Each Is Handled
I designed the failure handling before I wrote the happy path. Here's the failure map:

A single document fails → addCatch routes to MarkDocumentFailed. The batch continues. The document is marked FAILED in DynamoDB with the error reason.

The entire batch initialisation fails → State machine transitions to FAILED. This is the right outcome — if the batch can't be initialised, there's nothing to recover.

Webhook delivery fails → SQS retries up to 5 times. After 5 failures, message dead letters. CloudWatch alarm fires. The customer gets no webhook — but the batch itself already completed successfully.

Scheduled batch trigger fails → EventBridge retries 2 times. ScheduledBatchLambda is idempotent — it checks DynamoDB before starting any execution, so even a double-trigger from EventBridge won't create duplicate processing.

Each failure mode is isolated. A failed webhook doesn't affect a completed batch. A failed scheduled trigger doesn't affect manually submitted batches. The pipeline degrades gracefully rather than having a single failure surface.

One Operational Lesson That Surprised Me
Step Functions execution history is more useful than I expected.

When something goes wrong at 2am — and it will — the first place I look is the Step Functions console. The execution shows every state transition, every input payload, and every error message exactly when it occurred. No digging through CloudWatch logs across multiple Lambda invocations. No reconstructing request IDs.

I added X-ray tracing to every Lambda in the pipeline before the first batch ever ran. The combination of Step Functions execution history and X-Ray distributed traces means I've never had a failure I couldn't diagnose within a few minutes.

Add these from day one. They cost almost nothing, and they're invaluable when you need them.

Wrapping Up
The pipeline I built is not clever. It's S3, Lambda, Step Functions, and SQS doing exactly what they're designed to do:

S3 as a durable ingestion trigger, not a storage layer
Step Functions for orchestration with state, parallelism, and explicit failure handling
SQS for buffering and decoupled delivery with the right retry budget per queue
DLQs and CloudWatch alarms as the operational safety net

The 202 async pattern means the API is fast and the processing is resilient. The per-document error handling means one bad file can't kill a batch. The visibility timeout discipline means no message is processed twice.

Next I'm writing about the AI cost optimization layer — specifically how I reduced Bedrock costs by ~40% without touching extraction accuracy.

DEV Community

How I Built the Document Processing Pipeline Behind Autowired.ai: S3, Lambda, Step Functions, and SQS

Top comments (0)