Introduction
This guide provides detailed instructions on building a serverless document processing pipeline using AWS Lambda and Amazon Bedrock Data Automation (BDA). We will create an automated document processing system where Lambda functions are triggered when documents are uploaded to S3, using BDA to extract structured information from documents.
What You Will Learn
After completing this guide, you will be able to:
- Create and configure Lambda functions through AWS Console
- Integrate Lambda with Amazon Bedrock Data Automation
- Set up S3 Event triggers for Lambda
- Automatically process documents and save results
- Monitor and troubleshoot Lambda functions
Solution Architecture
Knowledge Requirements
- Basic understanding of AWS Console
- Basic Python knowledge
- Understanding of S3 and Lambda (basic level)
Article Structure
This article is divided into 5 main parts:
Part | Content |
---|---|
Part 1 | Environment Setup (Enable Models, S3, IAM, BDA Project) |
Part 2 | Create and Configure Lambda Function |
Part 3 | Set up S3 Trigger |
Part 4 | Testing and Deployment |
Part 5 | Real-world Examples and Best Practices |
Prerequisites
Before starting, ensure you have:
- AWS Account with IAM permissions for Bedrock, Lambda, S3, IAM
- Access to AWS Console
- Selected a region supporting Amazon Bedrock Data Automation (recommended:
us-east-1
orus-west-2
) - Modern browser (Chrome, Firefox, Safari, or Edge)
Note about Region: Amazon Bedrock Data Automation is not available in all regions. Check Bedrock endpoints and quotas to see supported regions.
Lambda Function Code Reference
Before setting up the environment, here is the complete Lambda function code that you will use later in Part 2. Having this code available upfront allows you to understand what we're building toward.
File: lambda_function.py
import json
import boto3
import os
import time
from datetime import datetime
from urllib.parse import unquote_plus
# Initialize AWS clients
s3_client = boto3.client('s3')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
sts_client = boto3.client('sts')
# Get configuration from environment variables
BDA_PROJECT_ARN = os.environ['BDA_PROJECT_ARN']
OUTPUT_BUCKET = os.environ['OUTPUT_BUCKET']
# Get region and account ID dynamically (Lambda provides these automatically)
AWS_REGION = os.environ['AWS_REGION'] # Lambda provides this automatically
account_id = sts_client.get_caller_identity()['Account']
# Construct BDA Profile ARN dynamically
BDA_PROFILE_ARN = f"arn:aws:bedrock:{AWS_REGION}:{account_id}:data-automation-profile/us.data-automation-v1"
def lambda_handler(event, context):
"""
Lambda handler function to process documents with BDA.
Triggered by S3 upload events.
"""
try:
# Extract S3 bucket and key from event
s3_event = event['Records'][0]['s3']
input_bucket = s3_event['bucket']['name'].lower() # Ensure lowercase
# URL decode the key (S3 event has URL-encoded keys)
document_key = unquote_plus(s3_event['object']['key'])
print(f"Processing document: s3://{input_bucket}/{document_key}")
# Validate bucket name format (BDA requirement)
if not input_bucket.replace('-', '').replace('.', '').isalnum():
raise ValueError(f"Invalid bucket name format: {input_bucket}. Must contain only lowercase letters, numbers, dots, and hyphens.")
# Construct S3 URIs
input_s3_uri = f"s3://{input_bucket}/{document_key}"
# Generate output path with timestamp
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
file_name = document_key.split('/')[-1]
output_prefix = f"processed/{timestamp}_{file_name}"
output_s3_uri = f"s3://{OUTPUT_BUCKET}/{output_prefix}"
# Invoke BDA to process the document
print("Invoking Bedrock Data Automation...")
response = bda_runtime_client.invoke_data_automation_async(
inputConfiguration={
's3Uri': input_s3_uri
},
outputConfiguration={
's3Uri': output_s3_uri
},
dataAutomationConfiguration={
'dataAutomationProjectArn': BDA_PROJECT_ARN,
'stage': 'LIVE'
},
dataAutomationProfileArn=BDA_PROFILE_ARN
)
invocation_arn = response['invocationArn']
print(f"BDA Invocation ARN: {invocation_arn}")
# Wait for processing to complete (with timeout)
max_wait_time = 240 # 4 minutes (leave 1 min for cleanup)
wait_interval = 10
elapsed_time = 0
while elapsed_time < max_wait_time:
status_response = bda_runtime_client.get_data_automation_status(
invocationArn=invocation_arn
)
status = status_response['status']
print(f"Current status: {status}")
if status == 'Success':
result_s3_uri = status_response['outputConfiguration']['s3Uri']
print(f"Processing completed successfully!")
print(f"Results saved to: {result_s3_uri}")
return {
'statusCode': 200,
'body': json.dumps({
'message': 'Document processed successfully',
'input': input_s3_uri,
'output': result_s3_uri,
'invocationArn': invocation_arn
})
}
elif status in ['ClientError', 'ServiceError']:
error_msg = f"Processing failed with status: {status}"
print(error_msg)
return {
'statusCode': 500,
'body': json.dumps({
'error': error_msg,
'invocationArn': invocation_arn
})
}
# Wait before checking again
time.sleep(wait_interval)
elapsed_time += wait_interval
# Timeout reached
print(f"Processing still in progress after {max_wait_time} seconds")
return {
'statusCode': 202,
'body': json.dumps({
'message': 'Processing initiated but not completed within Lambda timeout',
'invocationArn': invocation_arn,
'note': 'Check BDA console for final status'
})
}
except Exception as e:
error_msg = f"Error processing document: {str(e)}"
print(error_msg)
return {
'statusCode': 500,
'body': json.dumps({
'error': error_msg
})
}
Key Features of This Code:
- Automatically converts bucket names to lowercase (BDA requirement)
- URL decodes file names from S3 events
- Validates bucket name format
- Dynamically constructs BDA Profile ARN
- Polls for completion status with timeout
- Comprehensive error handling and logging
When you reach Step 2.3 in Part 2, simply copy this code into your Lambda function.
Introduction to Amazon Bedrock Data Automation
Intelligent Document Processing with Generative AI
Generative AI not only drives innovation through ideation and content creation but also optimizes operational processes and increases productivity across various domains. Amazon Bedrock Data Automation (BDA) is a fully managed service that provides Intelligent Document Processing (IDP) capabilities enhanced by generative AI.
Why Choose BDA?
1. Complete Automation
Enterprises can extract significant value from IDP enhanced with generative AI. By integrating generative AI capabilities into IDP solutions, organizations can:
- Advanced Document Understanding: Deep analysis of structure and semantics
- Structured Data Extraction: Automatic transformation from unstructured to structured data
- Automatic Classification: Document type recognition and appropriate routing
- Information Retrieval: Search and retrieve information from unstructured text
2. BDA vs Direct Bedrock API
Aspect | Direct Bedrock API | BDA (Managed Service) |
---|---|---|
Complexity | Need to build pipeline | Fully managed, integrated |
Document Understanding | Requires prompt engineering | Built-in document intelligence |
Multimodal | Only processes images | Supports various formats: PDF, images, audio, video |
Output Format | Custom JSON | Standardized JSON + Markdown/HTML/CSV |
Bounding Boxes | Not available | Automatically provides element positions |
Scalability | Self-managed | Auto-scaling |
Best For | Custom IDP solutions | Enterprise document processing |
Business Value and Use Cases
Government & Public Sector
Process and extract data from:
- Birth certificate applications, ID/Passport applications
- Immigration and visa records
- Legal contracts and government forms
Benefits: Reduce processing time from days to minutes, improve citizen service quality.
Healthcare
Extract and organize information from:
- Electronic medical records
- Insurance claim requests
- Prescriptions and test results
- Clinical trial records
Benefits: Improve data accuracy, increase information accessibility for better patient care.
Finance & Banking
Automate processing:
- Loan and credit applications
- Financial reports and tax documents
- Contracts and agreements
- Compliance documentation
Benefits: Reduce manual work, increase operational efficiency, reduce compliance risks.
Logistics & Supply Chain
Process:
- Shipping documents and invoices
- Purchase orders and warehouse receipts
- Supplier contracts
- Customs certificates
Benefits: Optimize processes, increase supply chain visibility.
Retail & E-commerce
Automate:
- Customer orders
- Product catalogs and descriptions
- Invoices and receipts
- Marketing documents
Benefits: Personalize customer experience, process orders quickly and efficiently.
Why Combine BDA with Lambda?
Combining BDA with AWS Lambda creates a powerful serverless IDP pipeline:
- Event-Driven: Automatically processes when new documents arrive
- Scalable: Automatically scales with document volume
- Cost-Effective: Pay only when processing
- Low Maintenance: No server management needed
- Integration Ready: Easy to integrate with existing systems
Statistics and Impact
According to AWS, organizations deploying IDP with generative AI have achieved:
- 90%+ reduction in document processing time
- 60-70% cost savings in operations
- 95%+ accuracy in data extraction
- 80% reduction in manual intervention needs
AWS Services in This Guide
This guide uses the following AWS services to build a serverless IDP pipeline:
Amazon Bedrock Data Automation
Fully managed service providing intelligent document processing capabilities with generative AI. BDA automatically processes documents and extracts structured information without needing to orchestrate complex tasks.
Key Features:
- Document classification and extraction
- Multi-granularity analysis (document, page, element level)
- Generative summaries and descriptions
- Support for diverse formats: PDF, images, audio, video
AWS Lambda
Serverless computing service that allows running code without managing servers. Lambda automatically scales and you only pay for compute time used.
Role in solution:
- Triggered when new document uploaded
- Calls BDA API to process document
- Handles and saves results
Amazon S3 (Simple Storage Service)
Highly scalable, durable, and secure object storage service.
Role in solution:
- Store input documents
- Store output results from BDA
- Trigger S3 events for Lambda
Amazon CloudWatch
Monitoring and observability service for AWS resources and applications.
Role in solution:
- Collect logs from Lambda execution
- Monitor metrics (invocations, errors, duration)
- Troubleshooting and debugging
AWS IAM (Identity and Access Management)
Securely manage access and permissions for AWS resources.
Role in solution:
- Create execution role for Lambda
- Grant permissions to S3, Bedrock, CloudWatch
- Security and access control
Detailed Workflow
1. USER uploads document
│
▼
2. S3 Event Notification
│
▼
3. Lambda Function triggered
│
├─▶ Read document from S3
│
├─▶ Call BDA InvokeDataAutomationAsync API
│ │
│ ▼
│ BDA Processing:
│ ├─ Document ingestion
│ ├─ Structure analysis
│ ├─ Content extraction (text, tables, figures)
│ ├─ Semantic enrichment (AI summaries)
│ └─ Result formation (JSON, Markdown, CSV)
│
├─▶ Poll for completion status
│
└─▶ Save results to S3 output bucket
Difference from Traditional IDP
Traditional IDP | BDA-Powered IDP |
---|---|
Rule-based extraction | AI-powered understanding |
Template dependency | Template-free processing |
Manual training needed | Pre-trained models |
Limited format support | Multi-format support |
No semantic understanding | Deep semantic analysis |
Fixed output structure | Flexible, rich output |
Part 1: Environment Setup
Step 1.1: Enable Model Access in Amazon Bedrock
Purpose: Enable access to foundation models required for BDA.
Important: This is a mandatory step before using Amazon Bedrock Data Automation. Without enabling model access, BDA cannot process documents.
Models to Enable:
We need to enable the following models for this guide:
- Amazon Models: All models
-
Claude Models:
- Claude 3.5 Haiku
- Claude 3 Sonnet
- Claude 3.5 Sonnet
-
Cohere Models:
- Cohere Rerank 3.5
Implementation Steps:
-
Open Amazon Bedrock Console
- Search for Bedrock in the top search bar
- Click on Amazon Bedrock
-
Access Model Access
- In the left menu (or navigation menu), click on Model access
- Click Modify model access button (or Request model access if first time)
-
Select Models
In the models list, select the following:
Amazon Models:- Check all Amazon models Claude Models (by Anthropic):
- Check Claude 3.5 Haiku
- Check Claude 3 Sonnet
- Check Claude 3.5 Sonnet Cohere Models:
- Check Cohere Rerank 3.5
-
Submit Request
- Click Next button at bottom right
- Review selected models
- Click Submit to request access
- Click Next button at bottom right
-
Wait for Approval
- Most models will be approved instantly (status: Access granted)
- Some models may take a few minutes for provisioning
- Refresh page to see status updates
When all models show status "Access granted", you are ready to continue.
Step 1.2: Create S3 Buckets
Purpose: Create S3 buckets to store input documents and output results.
- Log in to AWS Management Console
- Search for and select S3 service
- Click Create bucket button
Create Input Bucket:
- Enter information:
-
Bucket name:
bda-workshop-input-demo-xyz789
(replacexyz789
with your random string to ensure bucket name is unique) -
Block Public Access settings: Keep default (Block all public access)
-
Bucket name:
-
Scroll down and click Create bucket
Create Output Bucket:
-
Repeat steps 3-5 with:
-
Bucket name:
bda-workshop-output-demo-xyz789
(use same suffix as input bucket) - Other configurations keep default
-
Bucket name:
Step 1.3: Create IAM Role for Lambda
Purpose: Create IAM role with sufficient permissions for Lambda to access S3 and Bedrock.
- In AWS Console, search for and select IAM
- In left menu, select Roles
-
Select Trusted Entity:
Trusted entity type: Select AWS service
Use case: Select Lambda
-
Assign Permissions:
-
In policy search box, find and select the following policies (check the checkbox):
-
AWSLambdaBasicExecutionRole
(for CloudWatch Logs) -
AmazonS3FullAccess
(to read/write S3) -
AmazonBedrockFullAccess
(to use Bedrock BDA) Security Note: In production environments, create custom policies with minimum necessary permissions instead of using FullAccess policies.
-
Name and Create Role:
-
Role name:
BDA-Lambda-ExecutionRole
-
Description: `Execution role for BDA document processing Lambda function
- Scroll down and click Create role
---
Step 1.4: Create BDA Project
Purpose: Create BDA project to configure document processing.
- In AWS Console, search for and select Amazon Bedrock
- In left menu, select Data Automation > Set-up Project
- Click Create project button
Configure Project:
-
Enter information:
-
Project name:
document-processing-project
- Select ** Create Project**
-
Project name:
-
In Standard output configuration section, configure:
- Click Edit
-
Granularity types:
- Check DOCUMENT
- Check PAGE
- Check ELEMENT
- Bounding box: Select ENABLED
- Generative field: Select ENABLED
-
Text format types:
- Check MARKDOWN
-
Additional file format: Select ENABLED
Wait for project status to change to Changes saved successfully.
-
Copy and save Project ARN (will be used in Lambda configuration step)
- Click on project name to view details
- Copy ARN from project details page
- Example ARN:
arn:aws:bedrock:us-east-1:111111111111:data-automation-project/abc123xyz
Part 2: Create Lambda Function
Step 2.1: Create Lambda Function
- In AWS Console, search for and select Lambda
Select Author from scratch
-
Enter information:
-
Function name:
BDA-Document-Processor
- Runtime: Select Python 3.12
- Architecture: Select x86_64 #### Permissions:
-
Function name:
In Permissions section, select Use an existing role
Existing role: Select
BDA-Lambda-ExecutionRole
(role created in step 1.3)
-
Click Create function
Step 2.2: Configure Lambda Function
Increase Timeout and Memory:
- In Lambda function page, select Configuration tab
- Select General configuration > Click Edit
- Configure:
-
Memory:
512 MB
-
Timeout:
5 min
(5 minutes) -
Description:
Processes documents using Bedrock Data Automation
-
Memory:
- Click Save
- Still in Configuration tab, select Environment variables
- Click Edit > Click Add environment variable
- Add the following variables:
Key | Value | Description |
---|---|---|
BDA_PROJECT_ARN |
Paste ARN saved in step 1.4 | ARN of created BDA project |
OUTPUT_BUCKET |
bda-workshop-output-demo-xyz789 |
Output bucket name (replace with your bucket name) |
8.Click Save
Step 2.3: Write Lambda Code
- Return to Code tab
- In editor, delete sample code and replace with the complete Lambda function code provided in the Lambda Function Code Reference section above (before Part 1).
- Click Deploy to save code
The code handles S3 events, validates bucket names, invokes BDA, and polls for completion status with comprehensive error handling.
Part 3: Configure S3 Trigger
Step 3.1: Add S3 Trigger for Lambda
Purpose: Configure Lambda to automatically run when new files are uploaded to S3.
- In Lambda function page, select Configuration tab
- Select Triggers in left menu
Select a source: Select S3
Bucket: Select
bda-workshop-input-demo-xyz789
(input bucket created in step 1.2)Event type: Select All object create events
Prefix (optional): Leave blank (process all files) or enter
documents/
if only want to process files in documents folderSuffix (optional): Enter
.pdf
if only want to process PDF files
Check the checkbox I acknowledge that using the same S3 bucket...
Part 4: Testing and Deployment
Step 4.1: Prepare Test Document
- Download a PDF sample file to your computer (or use existing file)
Important - File Naming:
- File names should only contain: letters (a-z, A-Z), numbers (0-9), hyphens (-), underscores (_), dots (.)
- Should NOT use: spaces, special characters, Vietnamese characters
- Good file name examples:
document-test.pdf
,report_2024.pdf
- File names to avoid:
report document.pdf
,document (1).pdf
Step 4.2: Upload and Test
- Open S3 Console
- Go to bucket bda-workshop-input-demo-xyz789 (your input bucket)
- Click Upload
- Click Add files and select test PDF file
- Click Upload
Step 4.3: Check Lambda Execution
- Return to Lambda Console
- Select function BDA-Document-Processor
- Select Monitor tab
Click on latest log stream (most recent time)
-
View logs to verify:
- Lambda was triggered
- Document being processed
- BDA invocation successful
- Processing completed
Step 4.4: Check Output
- Open S3 Console
- Go to bucket bda-workshop-output-demo-xyz789 (your output bucket)
- Go to processed/ folder
- You will see folder with timestamp and file name
- Inside that folder will have:
-
job_metadata.json
- job metadata -
standard-output.json
- main processing results - Other files like markdown, CSV (if any)
-
- Download
standard-output.json
and view results:- Document summary
- Extracted tables
- Figures
- Page-level information
- Element details
Real-World Example: Government Birth Certificate Processing
Use Case from AWS Blog
A real-world example from AWS Machine Learning Blog illustrates how IDP with generative AI solves real problems:
The Problem
Government agency issuing birth certificates receives applications through multiple channels:
- Online applications
- Forms completed at physical locations
- Mailed paper applications
Current Manual Process:
- Scan paper applications
- Staff manually read and enter into system
- Check and validation
- Save to database
Issues:
- Very time-consuming
- High labor costs
- Prone to manual data entry errors
- Not scalable when volume increases
- Complex if forms are in multiple languages (English, Spanish, etc.)
Solution with BDA
With architecture similar to this guide, but with additions:
- SQS Queue: Buffer to process messages reliably
- DynamoDB: Store extracted data
- Multi-language Support: Automatically translate and extract
Upload Form → S3 → Lambda → BDA → SQS → Lambda → DynamoDB
↓
Auto-detect language
Extract all fields
Translate if needed
Results Achieved
Before:
- 15-20 minutes/application (manual)
- High labor costs
- 85-90% accuracy (human error)
After (with BDA):
- < 1 minute/application (automated)
- 60-70% cost savings
- 95%+ accuracy
- Process multiple languages
- Scale to thousands of applications/day
Fields Extracted Automatically
BDA can extract complex information:
- Applicant information (name, address, contact)
- Birth certificate recipient information
- Parent information
- Fee payment information
- Signatures and dates
- Bonus: Automatically translate from Spanish to English
Extending the Solution
You can extend this solution in the following directions:
- Add SQS Queue: Buffer processing and retry logic
- Add DynamoDB: Store structured data
- Custom Extraction: Define fields to extract for specific domain
- Multi-language: Process documents in multiple languages
- Human-in-the-loop: Validation for critical data
Cleanup Resources (Optional)
If you do not want to continue using and avoid incurring costs:
Delete Lambda Function:
- Lambda Console > Select function > Actions > Delete
Delete S3 Buckets:
- S3 Console > Select bucket > Empty (delete all objects)
- Then select bucket > Delete
Delete IAM Role:
- IAM Console > Roles > Select role > Delete
Delete BDA Project:
- Bedrock Console > Data Automation > Projects > Select project > Delete
Reference Documentation
AWS Documentation:
- AWS Lambda Developer Guide - Complete Lambda guide
- Amazon Bedrock Data Automation - Official BDA documentation
- Amazon Bedrock User Guide - Bedrock overview
- S3 Event Notifications - S3 trigger configuration
- Standard Output in BDA - Output configuration details
API References:
- BDA CreateDataAutomationProject API
- BDA InvokeDataAutomationAsync API
- BDA GetDataAutomationStatus API
- Lambda Invoke API
AWS Blogs and Case Studies:
BDA-Specific:
- Simplify multimodal generative AI with Amazon Bedrock Data Automation - BDA introduction
- Unleashing the multimodal power of Amazon Bedrock Data Automation - Advanced BDA use cases
- Get insights from multimodal content with Amazon Bedrock Data Automation - GA announcement
IDP with Generative AI:
- Intelligent document processing using Amazon Bedrock and Anthropic Claude - Real-world IDP example (direct Bedrock API approach)
- Scalable intelligent document processing using Amazon Bedrock - Scalability patterns
- Building serverless document processing workflows - Serverless architecture patterns
Advanced Topics:
- Building a multimodal RAG application with BDA - RAG with BDA
- New Amazon Bedrock capabilities enhance data processing - Latest features
Solution Guidance:
- Guidance for Multimodal Data Processing Using Amazon Bedrock Data Automation - AWS Solutions Library
Product Pages:
- Amazon Bedrock Data Automation Product Page - Features and pricing
- Amazon Bedrock - Main Bedrock page
- AWS Lambda - Lambda features
Video Resources:
- AWS re:Invent Sessions on Bedrock - Conference talks
- AWS Online Tech Talks - Technical webinars
Conclusion
Congratulations! You have successfully completed and created a serverless document processing pipeline with AWS Lambda and Amazon Bedrock Data Automation!
Summary of What You Learned
In this article, we covered:
Understanding BDA and IDP: Learned about Intelligent Document Processing with generative AI and benefits of Amazon Bedrock Data Automation
Building Infrastructure: Created S3 buckets, IAM roles, and BDA project through AWS Console
Developing Lambda Function: Wrote Python code to integrate with BDA API
Event-Driven Architecture: Set up S3 event triggers for automated processing
Testing and Monitoring: Deployed, tested, and monitored solution with CloudWatch
Troubleshooting: Handled common issues and best practices
Next Steps
Now that you have a solid foundation, continue exploring:
- Expand to multimodal: Try processing images, audio, and video with BDA
- Integrate downstream systems: Connect with DynamoDB, API Gateway, or business applications
- Advanced patterns: Implement human-in-the-loop workflows, model evaluation
- Production readiness: Error handling, DLQ, cost optimization, security hardening
- Explore RAG: Build multimodal RAG applications with BDA and Knowledge Bases
Acknowledgments
Thank you for reading this article! Hope this guide helps you start your journey with Amazon Bedrock Data Automation. Wish you success in building intelligent document processing solutions!
Top comments (0)