TL;DR - Want to Skip the Story?
Fork the repo, deploy in 10 minutes, start processing documents:
git clone https://github.com/Tetianamost/aws-intelligent-document-processing.git
cd aws-intelligent-document-processing
sam build && sam deploy --guided
Everything is production-ready. Just add your AWS credentials and email. Keep reading if you want to know how it works and what I learned building it.
Why I Built This (And Why You Might Need It Too)
Picture this: You're drowning in invoices, receipts, tax forms, and contracts. Your team is manually typing data from PDFs into spreadsheets. Hours wasted. Errors everywhere. Sound familiar?
I built an automated document processing system using AWS services that:
- Extracts text, tables, and form data from any document automatically
- Processes documents in real-time as soon as they're uploaded
- Stores structured data ready for analysis or integration
- Sends notifications when processing completes (or fails)
- Scales effortlessly from 10 to 10,000 documents
The best part? Once deployed, it costs pennies per document and requires zero maintenance. And you don't have to build it from scratch - just fork, deploy, and use.
The Architecture: Simple but Powerful
Here's what I built:
Document Upload (S3)
↓
Automatic Trigger (S3 Event)
↓
AI Processing (Lambda + Textract)
↓
Structured Storage (DynamoDB)
↓
Notification (SNS Email)
The magic happens in seconds:
- Drop a document in S3
- Lambda wakes up automatically
- Textract extracts everything (text, tables, forms)
- Data lands in DynamoDB, perfectly structured
- You get an email notification
No servers to manage. No infrastructure to maintain. Just pure serverless goodness.
What Makes This Cool?
1. It Actually Understands Your Documents
This isn't just OCR. Amazon Textract uses machine learning to understand document structure:
- Tables: Extracts rows and columns with relationships intact
- Forms: Identifies key-value pairs (Invoice #: 12345, Date: Dec 23, etc.)
- Text: Gets every word with confidence scores
I tested it with a business invoice containing 4 line items, calculations, and company details. Textract extracted 155 text blocks and perfectly reconstructed the entire table structure. Here's what it found:
{
"fields": {
"Invoice #": "INV-2024-001234",
"Date": "December 23, 2024",
"Total": "$30,922.50"
},
"tables": [
{
"rows": 8,
"columns": 4,
"data": [
["Description", "Quantity", "Price", "Total"],
["AWS Cloud Setup", "1", "$5,000", "$5,000"],
["DevOps Consulting (80 hrs)", "80", "$175/hr", "$14,000"],
...
]
}
]
}
2. Zero Infrastructure Management
Remember the days of provisioning servers, configuring load balancers, and setting up auto-scaling? Yeah, me neither. With this serverless architecture:
- Lambda handles compute (only runs when needed)
- S3 triggers events automatically
- DynamoDB scales infinitely
- CloudWatch monitors everything
I literally deployed this, uploaded a document, and walked away. It just works.
3. Cost-Effective at Any Scale
Let's talk money:
- Textract: $1.50 per 1,000 pages (first 1M pages/month)
- Lambda: First 1M requests free, then $0.20 per 1M
- S3: $0.023 per GB
- DynamoDB: First 25 GB free
Real example: Processing 1,000 invoices per month costs about $1.50. That's it.
4. Production-Ready Error Handling
Things fail. Networks hiccup. Documents are corrupted. I built in:
- Dead Letter Queue for failed processing
- CloudWatch Alarms for error monitoring
- SNS notifications for both success and failure
- Automatic retries with exponential backoff
- Status tracking in DynamoDB
When something breaks, you know immediately.
The Challenges (And How I Solved Them)
Challenge #1: Circular Dependencies in CloudFormation
My first SAM deployment failed with:
Circular dependency between resources: [DocumentBucket, DocumentProcessorFunction, DocumentProcessorFunctionPermission]
The Problem: S3 bucket needed Lambda permission, Lambda needed the bucket name, and the permission needed both. Classic chicken-and-egg.
The Solution:
- Deploy infrastructure without S3 event notifications
- Add S3 notifications separately using AWS CLI
- Keep Lambda permission with explicit DependsOn
Lesson learned: Sometimes you need to break CloudFormation into multiple steps.
Challenge #2: PDF Format Compatibility Hell
This was the big one. I generated beautiful PDFs using Python's reportlab library. They looked perfect. Textract rejected every single one.
UnsupportedDocumentException: Request has unsupported document format
Even the official IRS Form 1040 PDF failed!
The Investigation:
- Textract supports PDFs (per AWS docs) ✅
- My PDFs were valid (opened fine in Preview) ✅
- File size was under 5MB ✅
- But still... rejected ❌
The Breakthrough: Not all PDFs are created equal. Textract is picky about internal PDF structure. PDFs generated by reportlab, fpdf, and even some government forms use formats Textract doesn't support.
The Solution: Convert PDFs to PNG/JPG images first.
# Using poppler's pdftoppm
pdftoppm -png -singlefile document.pdf output
Results:
- IRS Form 1040 as PDF: ❌ Failed
- Same form as PNG: ✅ 1,412 blocks extracted
- Reportlab invoice as PDF: ❌ Failed
- Same invoice as PNG: ✅ 155 blocks extracted
Key Takeaway: If you're programmatically generating documents for Textract, create them as PNG/JPG from the start. Save yourself the headache.
Challenge #3: IAM Permissions Maze
SAM deployment needs a surprising number of AWS permissions:
- CloudFormation (to create stacks)
- Lambda (to deploy functions)
- IAM (to create execution roles)
- S3 (to store artifacts and documents)
- DynamoDB (to create tables)
- SNS (to create topics)
- CloudWatch (to create log groups and alarms)
- X-Ray (for tracing)
Best Practice: Create an IAM user group with all required managed policies, then add your deployment user to it. This makes permission management cleaner and reusable.
Real Results: What It Actually Extracted
Test 1: Business Invoice
Input: PNG image with company header, bill-to info, 4 line items, and calculations
Extracted:
- 8 key-value pairs (Invoice #, dates, addresses)
- Complete 8x4 table with all line items
- Subtotal, tax, and total calculations
- 155 total text blocks
Processing time: ~2 seconds
Test 2: IRS Form 1040 (Tax Form)
Input: Official IRS form converted to PNG
Extracted:
- 1,412 text blocks
- All form field labels
- Form structure and layout
- Every piece of text on the form
Processing time: ~3 seconds
Deployment: From Zero to Production in 10 Minutes
Here's what it actually took:
# 1. Build the application
sam build
# 2. Deploy to AWS
sam deploy --stack-name doc-processing-dev \
--region us-east-1 \
--capabilities CAPABILITY_IAM \
--parameter-overrides NotificationEmail=your@email.com \
--resolve-s3 \
--no-confirm-changeset
# 3. Configure S3 event notifications
aws s3api put-bucket-notification-configuration \
--bucket your-bucket-name \
--notification-configuration file://s3-notification-config.json
# Done!
CloudFormation created:
- 1 S3 bucket
- 1 Lambda function
- 1 DynamoDB table
- 1 SNS topic with email subscription
- 2 CloudWatch alarms
- 1 SQS dead letter queue
- All IAM roles and permissions
Total deployment time: 3 minutes
What I'd Do Differently
- Start with PNG/JPG from day one - Would have saved hours of PDF debugging
- Add async processing earlier - For multi-page documents, async Textract jobs are better
- Include cost alerts - Set up billing alarms before deploying
- Add API Gateway - Would make it easy to trigger processing via HTTP
- Implement document versioning - Track changes to processed documents
The Bottom Line
Building this taught me that serverless isn't just a buzzword - it's genuinely transformative for document processing workflows.
What you get:
- Automatic document processing in seconds
- AI-powered data extraction (not just OCR)
- Zero server management
- Pay-per-use pricing
- Production-ready error handling
- Infinite scalability
What it costs:
- ~$1.50 per 1,000 documents
- A few hours to set up
- Zero ongoing maintenance
What you save:
- Hundreds of hours of manual data entry
- Countless transcription errors
- Server infrastructure costs
- DevOps overhead
Ready to Use It?
The entire project is production-ready and open source on GitHub: aws-intelligent-document-processing
Quick Start:
# 1. Clone the repo
git clone https://github.com/Tetianamost/aws-intelligent-document-processing.git
cd aws-intelligent-document-processing
# 2. Deploy to AWS (takes ~3 minutes)
sam build
sam deploy --guided
# 3. Upload a document
aws s3 cp your-document.png s3://your-bucket-name/incoming/
# 4. Check DynamoDB for extracted data
Everything you need is included:
- ✅ Lambda function code
- ✅ CloudFormation templates
- ✅ Sample documents (invoice + tax form)
- ✅ IAM permission templates
- ✅ Step-by-step deployment guide
Pro tip: Start with PNG images, not PDFs. Trust me on this one.
Have questions or built something similar? I'd love to hear about it! Drop a comment or reach out.
Top comments (0)