DEV Community

Cover image for Building an Intelligent Document Processing Pipeline with AWS: A Journey from Idea to Production

Building an Intelligent Document Processing Pipeline with AWS: A Journey from Idea to Production

TL;DR - Want to Skip the Story?

Fork the repo, deploy in 10 minutes, start processing documents:

git clone https://github.com/Tetianamost/aws-intelligent-document-processing.git
cd aws-intelligent-document-processing
sam build && sam deploy --guided
Enter fullscreen mode Exit fullscreen mode

Everything is production-ready. Just add your AWS credentials and email. Keep reading if you want to know how it works and what I learned building it.


Why I Built This (And Why You Might Need It Too)

Picture this: You're drowning in invoices, receipts, tax forms, and contracts. Your team is manually typing data from PDFs into spreadsheets. Hours wasted. Errors everywhere. Sound familiar?

I built an automated document processing system using AWS services that:

  • Extracts text, tables, and form data from any document automatically
  • Processes documents in real-time as soon as they're uploaded
  • Stores structured data ready for analysis or integration
  • Sends notifications when processing completes (or fails)
  • Scales effortlessly from 10 to 10,000 documents

The best part? Once deployed, it costs pennies per document and requires zero maintenance. And you don't have to build it from scratch - just fork, deploy, and use.

The Architecture: Simple but Powerful

Here's what I built:

Document Upload (S3) 
    ↓
Automatic Trigger (S3 Event)
    ↓
AI Processing (Lambda + Textract)
    ↓
Structured Storage (DynamoDB)
    ↓
Notification (SNS Email)
Enter fullscreen mode Exit fullscreen mode

The magic happens in seconds:

  1. Drop a document in S3
  2. Lambda wakes up automatically
  3. Textract extracts everything (text, tables, forms)
  4. Data lands in DynamoDB, perfectly structured
  5. You get an email notification

No servers to manage. No infrastructure to maintain. Just pure serverless goodness.

What Makes This Cool?

1. It Actually Understands Your Documents

This isn't just OCR. Amazon Textract uses machine learning to understand document structure:

  • Tables: Extracts rows and columns with relationships intact
  • Forms: Identifies key-value pairs (Invoice #: 12345, Date: Dec 23, etc.)
  • Text: Gets every word with confidence scores

I tested it with a business invoice containing 4 line items, calculations, and company details. Textract extracted 155 text blocks and perfectly reconstructed the entire table structure. Here's what it found:

{
  "fields": {
    "Invoice #": "INV-2024-001234",
    "Date": "December 23, 2024",
    "Total": "$30,922.50"
  },
  "tables": [
    {
      "rows": 8,
      "columns": 4,
      "data": [
        ["Description", "Quantity", "Price", "Total"],
        ["AWS Cloud Setup", "1", "$5,000", "$5,000"],
        ["DevOps Consulting (80 hrs)", "80", "$175/hr", "$14,000"],
        ...
      ]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

2. Zero Infrastructure Management

Remember the days of provisioning servers, configuring load balancers, and setting up auto-scaling? Yeah, me neither. With this serverless architecture:

  • Lambda handles compute (only runs when needed)
  • S3 triggers events automatically
  • DynamoDB scales infinitely
  • CloudWatch monitors everything

I literally deployed this, uploaded a document, and walked away. It just works.

3. Cost-Effective at Any Scale

Let's talk money:

  • Textract: $1.50 per 1,000 pages (first 1M pages/month)
  • Lambda: First 1M requests free, then $0.20 per 1M
  • S3: $0.023 per GB
  • DynamoDB: First 25 GB free

Real example: Processing 1,000 invoices per month costs about $1.50. That's it.

4. Production-Ready Error Handling

Things fail. Networks hiccup. Documents are corrupted. I built in:

  • Dead Letter Queue for failed processing
  • CloudWatch Alarms for error monitoring
  • SNS notifications for both success and failure
  • Automatic retries with exponential backoff
  • Status tracking in DynamoDB

When something breaks, you know immediately.

The Challenges (And How I Solved Them)

Challenge #1: Circular Dependencies in CloudFormation

My first SAM deployment failed with:

Circular dependency between resources: [DocumentBucket, DocumentProcessorFunction, DocumentProcessorFunctionPermission]
Enter fullscreen mode Exit fullscreen mode

The Problem: S3 bucket needed Lambda permission, Lambda needed the bucket name, and the permission needed both. Classic chicken-and-egg.

The Solution:

  1. Deploy infrastructure without S3 event notifications
  2. Add S3 notifications separately using AWS CLI
  3. Keep Lambda permission with explicit DependsOn

Lesson learned: Sometimes you need to break CloudFormation into multiple steps.

Challenge #2: PDF Format Compatibility Hell

This was the big one. I generated beautiful PDFs using Python's reportlab library. They looked perfect. Textract rejected every single one.

UnsupportedDocumentException: Request has unsupported document format
Enter fullscreen mode Exit fullscreen mode

Even the official IRS Form 1040 PDF failed!

The Investigation:

  • Textract supports PDFs (per AWS docs) ✅
  • My PDFs were valid (opened fine in Preview) ✅
  • File size was under 5MB ✅
  • But still... rejected ❌

The Breakthrough: Not all PDFs are created equal. Textract is picky about internal PDF structure. PDFs generated by reportlab, fpdf, and even some government forms use formats Textract doesn't support.

The Solution: Convert PDFs to PNG/JPG images first.

# Using poppler's pdftoppm
pdftoppm -png -singlefile document.pdf output
Enter fullscreen mode Exit fullscreen mode

Results:

  • IRS Form 1040 as PDF: ❌ Failed
  • Same form as PNG: ✅ 1,412 blocks extracted
  • Reportlab invoice as PDF: ❌ Failed
  • Same invoice as PNG: ✅ 155 blocks extracted

Key Takeaway: If you're programmatically generating documents for Textract, create them as PNG/JPG from the start. Save yourself the headache.

Challenge #3: IAM Permissions Maze

SAM deployment needs a surprising number of AWS permissions:

  • CloudFormation (to create stacks)
  • Lambda (to deploy functions)
  • IAM (to create execution roles)
  • S3 (to store artifacts and documents)
  • DynamoDB (to create tables)
  • SNS (to create topics)
  • CloudWatch (to create log groups and alarms)
  • X-Ray (for tracing)

Best Practice: Create an IAM user group with all required managed policies, then add your deployment user to it. This makes permission management cleaner and reusable.

Real Results: What It Actually Extracted

Test 1: Business Invoice

Input: PNG image with company header, bill-to info, 4 line items, and calculations

Extracted:

  • 8 key-value pairs (Invoice #, dates, addresses)
  • Complete 8x4 table with all line items
  • Subtotal, tax, and total calculations
  • 155 total text blocks

Processing time: ~2 seconds

Test 2: IRS Form 1040 (Tax Form)

Input: Official IRS form converted to PNG

Extracted:

  • 1,412 text blocks
  • All form field labels
  • Form structure and layout
  • Every piece of text on the form

Processing time: ~3 seconds

Deployment: From Zero to Production in 10 Minutes

Here's what it actually took:

# 1. Build the application
sam build

# 2. Deploy to AWS
sam deploy --stack-name doc-processing-dev \
  --region us-east-1 \
  --capabilities CAPABILITY_IAM \
  --parameter-overrides NotificationEmail=your@email.com \
  --resolve-s3 \
  --no-confirm-changeset

# 3. Configure S3 event notifications
aws s3api put-bucket-notification-configuration \
  --bucket your-bucket-name \
  --notification-configuration file://s3-notification-config.json

# Done!
Enter fullscreen mode Exit fullscreen mode

CloudFormation created:

  • 1 S3 bucket
  • 1 Lambda function
  • 1 DynamoDB table
  • 1 SNS topic with email subscription
  • 2 CloudWatch alarms
  • 1 SQS dead letter queue
  • All IAM roles and permissions

Total deployment time: 3 minutes

What I'd Do Differently

  1. Start with PNG/JPG from day one - Would have saved hours of PDF debugging
  2. Add async processing earlier - For multi-page documents, async Textract jobs are better
  3. Include cost alerts - Set up billing alarms before deploying
  4. Add API Gateway - Would make it easy to trigger processing via HTTP
  5. Implement document versioning - Track changes to processed documents

The Bottom Line

Building this taught me that serverless isn't just a buzzword - it's genuinely transformative for document processing workflows.

What you get:

  • Automatic document processing in seconds
  • AI-powered data extraction (not just OCR)
  • Zero server management
  • Pay-per-use pricing
  • Production-ready error handling
  • Infinite scalability

What it costs:

  • ~$1.50 per 1,000 documents
  • A few hours to set up
  • Zero ongoing maintenance

What you save:

  • Hundreds of hours of manual data entry
  • Countless transcription errors
  • Server infrastructure costs
  • DevOps overhead

Ready to Use It?

The entire project is production-ready and open source on GitHub: aws-intelligent-document-processing

Quick Start:

# 1. Clone the repo
git clone https://github.com/Tetianamost/aws-intelligent-document-processing.git
cd aws-intelligent-document-processing

# 2. Deploy to AWS (takes ~3 minutes)
sam build
sam deploy --guided

# 3. Upload a document
aws s3 cp your-document.png s3://your-bucket-name/incoming/

# 4. Check DynamoDB for extracted data
Enter fullscreen mode Exit fullscreen mode

Everything you need is included:

  • ✅ Lambda function code
  • ✅ CloudFormation templates
  • ✅ Sample documents (invoice + tax form)
  • ✅ IAM permission templates
  • ✅ Step-by-step deployment guide

Pro tip: Start with PNG images, not PDFs. Trust me on this one.


Have questions or built something similar? I'd love to hear about it! Drop a comment or reach out.

Top comments (0)