Vasil Shaikh

Posted on Dec 27, 2025 • Originally published at Medium

Building an AI Document Processing Pipeline on AWS (Textract + Bedrock)

#serverless #machinelearning #aws #ai

Hello World 👋

I’m Vasil, a DevOps Engineer with a passion for building reliable, scalable, and well-architected cloud platforms. With hands-on experience across cloud infrastructure, CI/CD, observability, and platform engineering, I enjoy turning complex operational challenges into clean, automated solutions.

I’ve been working with AWS Cloud for over 5 years, and I believe it’s high time I start exploring AI on AWS more deeply. Through these posts, I plan to share practical learnings, real-world experiences, and honest perspectives from my journey in DevOps, Cloud, and now AI.

Without further delay — let’s dive in 🚀

Introduction

Processing large volumes of documents like PDFs, scanned images, invoices, or contracts is a common challenge in many businesses. AWS provides managed services that can significantly reduce the manual effort required to extract meaningful text and context from uncategorized documents.

In this article, we’ll walk through a practical pattern for document processing using AI services on AWS:

Amazon Textract — for optical character recognition (OCR) and structured text extraction
Amazon Bedrock — for applying foundation models for understanding and summarization
Amazon S3 — for document storage

This pattern is useful for building Intelligent Document Processing (IDP) pipelines that go beyond simple OCR, adding structure, insights, and context using AI.

Prerequisites

To follow this article, you need:

AWS account access
AWS CLI configured (I’ll be using Cloudshell)

Permissions for:

Amazon S3
Amazon Textract
Amazon Bedrock
A document (PDF or image) stored in an S3 bucket

Architecture Overview

Here’s the high-level flow we’ll follow:

Upload document to Amazon S3
Use Amazon Textract to extract raw text and layout from the document
Feed the extracted content into Amazon Bedrock foundation models for summarization or structured extraction (Note: You can use a Lambda function here but we will not be using it in this article to keep it simple)
Store or present the processed results

Step-by-Step Implementation

Step 1: Create a S3 bucket for documents

We’ll start by creating an S3 bucket to store input documents and AI outputs.

 aws s3 mb s3://aws-textract-demo-vasil

(You can replace the suffix with your name to make the bucket unique)

Step 2: Upload Document to S3

Upload your document (PDF, PNG, JPG, etc.) to an S3 bucket.

aws s3 cp path/to/local/document.pdf s3://your-bucket/

Note: If you are going to use Cloudshell like me then you’ll need to upload the file to your Cloudshell shell from the options on the top-right corner>Click Actions>Click ‘Upload file’

For e.g

aws s3 cp invoice.pdf s3://aws-textract-demo-vasil/invoice.pdf

You can upload any invoice you have or download a sample invoice pdf/image from Google. For this example, I used my own invoice that I got after a purchase.

Step 3: Extract Text with Amazon Textract

Textract converts documents into machine-readable text.

aws textract detect-document-text \
  --region us-east-1 \
  --document '{
    "S3Object": {
      "Bucket": "<your-bucket-name>",
      "Name": "invoice.pdf"
    }
  }' > textract-output.json

aws textract detect-document-text \
  --region us-east-1 \
  --document '{
    "S3Object": {
      "Bucket": "aws-textract-demo",
      "Name": "invoice.pdf"
    }
  }' > textract-output.json

And… drumroll 🥁…we hit our first error!

An error occurred (AccessDeniedException) when calling the DetectDocumentText operation: User: arn:aws:iam::393078901895:user/kk_labs_user_465822 is not authorized to perform: textract:DetectDocumentText with an explicit deny in a service control policy

What’s Happening Here?

In a standard AWS account, this error can be resolved by granting the required Textract permissions via IAM.

However, in managed sandbox environments (such as training labs), Textract APIs are often explicitly restricted using service or identity-based policies to control cost and data exposure.

Because of this, the following steps assume a standard AWS account with appropriate permissions.

Note: The outputs shown below are representative of real Amazon Textract responses.

Sample Textract output

{
  "Blocks": [
    {
      "BlockType": "LINE",
      "Text": "Invoice Number: INV-1023",
      "Confidence": 99.2
    },
    {
      "BlockType": "LINE",
      "Text": "Total Amount: $1,250.00",
      "Confidence": 98.7
    },
    {
      "BlockType": "LINE",
      "Text": "Due Date: 15 Aug 2025",
      "Confidence": 97.9
    }
  ]
}

This produces structured JSON containing:

detected text
layout information
confidence scores

Textract doesn’t just do OCR — it understands document structure. That makes it a much better input for LLMs than raw text extraction tools.

Step 4: Prepare text for Bedrock

At this stage, you have raw text from Textract which may include:

words
line breaks
positional information
nested JSON frames

Your goal is to extract the plain, relevant text and concatenate multiple pages into a single text blob. LLMs work best with clean, structured input, so we focus only on the meaningful lines.

You can do this using Python scripts, CLI tools like jq, or if you prefer automation use a Lambda function that processes new documents automatically. Using Lambda is optional and not required for this guide.

jq -r '.Blocks[] | select(.BlockType=="LINE") | .Text' textract-output.json > cleaned_text.txt

Check the content:

cat cleaned_text.txt

Output:

Invoice Number: 12345
Date: 2025-12-27
Total Amount: $1,234.56

Step 5: Feed Textract Output into Amazon Bedrock

We’ll use Meta Llama 3 (3B Instruct) via Amazon Bedrock to interpret the document.

What We’ll Ask the Model

“Summarize this invoice and extract key details.”

Bedrock CLI Invocation

# Safely read the cleaned text and escape it for JSON
PROMPT=$(jq -Rs . < cleaned_text.txt)

# Build the full prompt with instruction
FULL_PROMPT=$(jq -Rn --arg txt "$PROMPT" '$txt + "\n\nPlease summarize this invoice and provide only the key details in a structured format (Invoice Number, Date, Total Amount)."')

# Invoke Bedrock
aws bedrock-runtime invoke-model \
  --region us-east-1 \
  --model-id us.meta.llama3-2-3b-instruct-v1:0 \
  --content-type application/json \
  --accept application/json \
  --cli-binary-format raw-in-base64-out \
  --body "{
    \"prompt\": $FULL_PROMPT,
    \"max_gen_len\": 300,
    \"temperature\": 0.3
  }" \
  response.json

jq -Rs . < cleaned_text.txt escapes all newlines/quotes in your text.
jq -Rn — arg txt “$PROMPT”appends your instruction safely to the JSON string.
— body contains valid JSON with “prompt” including your instruction.

This method avoids all the bash parsing issues and ensures the model gets the full instruction properly.

Model Response

Finally, do cat response.json to see the response. You might see some generic output as well. The generic output is coming from the model itself, and it happens because:

The content in cleaned_text.txt is very small or too generic (in our example, just a few lines like Invoice Number: 12345)
The Llama 3.2 Instruct model interprets short prompts in a “template/illustrative” style and fills in plausible generic text.

What We Just Built

✅ Stored documents in S3
✅ Extracted structured data with Textract
✅ Used an LLM to reason over documents
✅ Built a logical AI pipeline — step by step

All without deploying a single Lambda function.

Where This Goes in the Real World

With automation, this same pipeline can:

Process invoices automatically
Extract contract clauses
Power document search
Feed ERP or finance systems
Trigger workflows based on document content Add Lambda or Step Functions later — the core AI flow stays the same.

Final Thoughts

Well done. Seriously — give yourself a pat 👏

At this point, you’ve successfully used generative AI on AWS:

No apps
No infrastructure
Just clean, composable AI services

And yes — DO NOT forget to clean up resources that you created during this walkthrough to avoid unexpected cost.

DEV Community