Tetiana Mostova for AWS Community Builders

Posted on Dec 24, 2025

Building an Intelligent Document Processing Pipeline with AWS: A Journey from Idea to Production

#aws #serverless #machinelearning #lambda

TL;DR - Want to Skip the Story?

Fork the repo, deploy in 10 minutes, start processing documents:

git clone https://github.com/Tetianamost/aws-intelligent-document-processing.git
cd aws-intelligent-document-processing
sam build && sam deploy --guided

Everything is production-ready. Just add your AWS credentials and email. Keep reading if you want to know how it works and what I learned building it.

Why I Built This (And Why You Might Need It Too)

Picture this: You're drowning in invoices, receipts, tax forms, and contracts. Your team is manually typing data from PDFs into spreadsheets. Hours wasted. Errors everywhere. Sound familiar?

I built an automated document processing system using AWS services that:

Extracts text, tables, and form data from any document automatically
Processes documents in real-time as soon as they're uploaded
Stores structured data ready for analysis or integration
Sends notifications when processing completes (or fails)
Scales effortlessly from 10 to 10,000 documents

The best part? Once deployed, it costs pennies per document and requires zero maintenance. And you don't have to build it from scratch - just fork, deploy, and use.

The Architecture: Simple but Powerful

Here's what I built:

Document Upload (S3) 
    ↓
Automatic Trigger (S3 Event)
    ↓
AI Processing (Lambda + Textract)
    ↓
Structured Storage (DynamoDB)
    ↓
Notification (SNS Email)

The magic happens in seconds:

Drop a document in S3
Lambda wakes up automatically
Textract extracts everything (text, tables, forms)
Data lands in DynamoDB, perfectly structured
You get an email notification

No servers to manage. No infrastructure to maintain. Just pure serverless goodness.

What Makes This Cool?

1. It Actually Understands Your Documents

This isn't just OCR. Amazon Textract uses machine learning to understand document structure:

Tables: Extracts rows and columns with relationships intact
Forms: Identifies key-value pairs (Invoice #: 12345, Date: Dec 23, etc.)
Text: Gets every word with confidence scores

I tested it with a business invoice containing 4 line items, calculations, and company details. Textract extracted 155 text blocks and perfectly reconstructed the entire table structure. Here's what it found:

{
  "fields": {
    "Invoice #": "INV-2024-001234",
    "Date": "December 23, 2024",
    "Total": "$30,922.50"
  },
  "tables": [
    {
      "rows": 8,
      "columns": 4,
      "data": [
        ["Description", "Quantity", "Price", "Total"],
        ["AWS Cloud Setup", "1", "$5,000", "$5,000"],
        ["DevOps Consulting (80 hrs)", "80", "$175/hr", "$14,000"],
        ...
      ]
    }
  ]
}

2. Zero Infrastructure Management

Remember the days of provisioning servers, configuring load balancers, and setting up auto-scaling? Yeah, me neither. With this serverless architecture:

Lambda handles compute (only runs when needed)
S3 triggers events automatically
DynamoDB scales infinitely
CloudWatch monitors everything

I literally deployed this, uploaded a document, and walked away. It just works.

3. Cost-Effective at Any Scale

Let's talk money:

Textract: $1.50 per 1,000 pages (first 1M pages/month)
Lambda: First 1M requests free, then $0.20 per 1M
S3: $0.023 per GB
DynamoDB: First 25 GB free

Real example: Processing 1,000 invoices per month costs about $1.50. That's it.

4. Production-Ready Error Handling

Things fail. Networks hiccup. Documents are corrupted. I built in:

Dead Letter Queue for failed processing
CloudWatch Alarms for error monitoring
SNS notifications for both success and failure
Automatic retries with exponential backoff
Status tracking in DynamoDB

When something breaks, you know immediately.

The Challenges (And How I Solved Them)

Challenge #1: Circular Dependencies in CloudFormation

My first SAM deployment failed with:

Circular dependency between resources: [DocumentBucket, DocumentProcessorFunction, DocumentProcessorFunctionPermission]

The Problem: S3 bucket needed Lambda permission, Lambda needed the bucket name, and the permission needed both. Classic chicken-and-egg.

The Solution:

Deploy infrastructure without S3 event notifications
Add S3 notifications separately using AWS CLI
Keep Lambda permission with explicit DependsOn

Lesson learned: Sometimes you need to break CloudFormation into multiple steps.

Challenge #2: PDF Format Compatibility Hell

This was the big one. I generated beautiful PDFs using Python's reportlab library. They looked perfect. Textract rejected every single one.

UnsupportedDocumentException: Request has unsupported document format

Even the official IRS Form 1040 PDF failed!

The Investigation:

Textract supports PDFs (per AWS docs) ✅
My PDFs were valid (opened fine in Preview) ✅
File size was under 5MB ✅
But still... rejected ❌

The Breakthrough: Not all PDFs are created equal. Textract is picky about internal PDF structure. PDFs generated by reportlab, fpdf, and even some government forms use formats Textract doesn't support.

The Solution: Convert PDFs to PNG/JPG images first.

# Using poppler's pdftoppm
pdftoppm -png -singlefile document.pdf output

Results:

IRS Form 1040 as PDF: ❌ Failed
Same form as PNG: ✅ 1,412 blocks extracted
Reportlab invoice as PDF: ❌ Failed
Same invoice as PNG: ✅ 155 blocks extracted

Key Takeaway: If you're programmatically generating documents for Textract, create them as PNG/JPG from the start. Save yourself the headache.

Challenge #3: IAM Permissions Maze

SAM deployment needs a surprising number of AWS permissions:

CloudFormation (to create stacks)
Lambda (to deploy functions)
IAM (to create execution roles)
S3 (to store artifacts and documents)
DynamoDB (to create tables)
SNS (to create topics)
CloudWatch (to create log groups and alarms)
X-Ray (for tracing)

Best Practice: Create an IAM user group with all required managed policies, then add your deployment user to it. This makes permission management cleaner and reusable.

Real Results: What It Actually Extracted

Test 1: Business Invoice

Input: PNG image with company header, bill-to info, 4 line items, and calculations

Extracted:

8 key-value pairs (Invoice #, dates, addresses)
Complete 8x4 table with all line items
Subtotal, tax, and total calculations
155 total text blocks

Processing time: ~2 seconds

Test 2: IRS Form 1040 (Tax Form)

Input: Official IRS form converted to PNG

Extracted:

1,412 text blocks
All form field labels
Form structure and layout
Every piece of text on the form

Processing time: ~3 seconds

Deployment: From Zero to Production in 10 Minutes

Here's what it actually took:

# 1. Build the application
sam build

# 2. Deploy to AWS
sam deploy --stack-name doc-processing-dev \
  --region us-east-1 \
  --capabilities CAPABILITY_IAM \
  --parameter-overrides NotificationEmail=your@email.com \
  --resolve-s3 \
  --no-confirm-changeset

# 3. Configure S3 event notifications
aws s3api put-bucket-notification-configuration \
  --bucket your-bucket-name \
  --notification-configuration file://s3-notification-config.json

# Done!

CloudFormation created:

1 S3 bucket
1 Lambda function
1 DynamoDB table
1 SNS topic with email subscription
2 CloudWatch alarms
1 SQS dead letter queue
All IAM roles and permissions

Total deployment time: 3 minutes

What I'd Do Differently

Start with PNG/JPG from day one - Would have saved hours of PDF debugging
Add async processing earlier - For multi-page documents, async Textract jobs are better
Include cost alerts - Set up billing alarms before deploying
Add API Gateway - Would make it easy to trigger processing via HTTP
Implement document versioning - Track changes to processed documents

The Bottom Line

Building this taught me that serverless isn't just a buzzword - it's genuinely transformative for document processing workflows.

What you get:

Automatic document processing in seconds
AI-powered data extraction (not just OCR)
Zero server management
Pay-per-use pricing
Production-ready error handling
Infinite scalability

What it costs:

~$1.50 per 1,000 documents
A few hours to set up
Zero ongoing maintenance

What you save:

Hundreds of hours of manual data entry
Countless transcription errors
Server infrastructure costs
DevOps overhead

Ready to Use It?

The entire project is production-ready and open source on GitHub: aws-intelligent-document-processing

Quick Start:

# 1. Clone the repo
git clone https://github.com/Tetianamost/aws-intelligent-document-processing.git
cd aws-intelligent-document-processing

# 2. Deploy to AWS (takes ~3 minutes)
sam build
sam deploy --guided

# 3. Upload a document
aws s3 cp your-document.png s3://your-bucket-name/incoming/

# 4. Check DynamoDB for extracted data

Everything you need is included:

✅ Lambda function code
✅ CloudFormation templates
✅ Sample documents (invoice + tax form)
✅ IAM permission templates
✅ Step-by-step deployment guide

Pro tip: Start with PNG images, not PDFs. Trust me on this one.

Have questions or built something similar? I'd love to hear about it! Drop a comment or reach out.

DEV Community