Lam Bùi

Posted on Oct 12 • Edited on Oct 24

Building an Intelligent Document Processing Pipeline with AWS Lambda and Bedrock Data Automation

#aws #awslambda #awsbedrockdataautomation

Introduction

This guide provides detailed instructions on building a serverless document processing pipeline using AWS Lambda and Amazon Bedrock Data Automation (BDA). We will create an automated document processing system where Lambda functions are triggered when documents are uploaded to S3, using BDA to extract structured information from documents.

What You Will Learn

After completing this guide, you will be able to:

Create and configure Lambda functions through AWS Console
Integrate Lambda with Amazon Bedrock Data Automation
Set up S3 Event triggers for Lambda
Automatically process documents and save results
Monitor and troubleshoot Lambda functions

Solution Architecture

Knowledge Requirements

Basic understanding of AWS Console
Basic Python knowledge
Understanding of S3 and Lambda (basic level)

Article Structure

This article is divided into 5 main parts:

Part	Content
Part 1	Environment Setup (Enable Models, S3, IAM, BDA Project)
Part 2	Create and Configure Lambda Function
Part 3	Set up S3 Trigger
Part 4	Testing and Deployment
Part 5	Real-world Examples and Best Practices

Prerequisites

Before starting, ensure you have:

AWS Account with IAM permissions for Bedrock, Lambda, S3, IAM
Access to AWS Console
Selected a region supporting Amazon Bedrock Data Automation (recommended: us-east-1 or us-west-2)
Modern browser (Chrome, Firefox, Safari, or Edge)

Note about Region: Amazon Bedrock Data Automation is not available in all regions. Check Bedrock endpoints and quotas to see supported regions.

Lambda Function Code Reference

Before setting up the environment, here is the complete Lambda function code that you will use later in Part 2. Having this code available upfront allows you to understand what we're building toward.

File: lambda_function.py

import json
import boto3
import os
import time
from datetime import datetime
from urllib.parse import unquote_plus

# Initialize AWS clients
s3_client = boto3.client('s3')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
sts_client = boto3.client('sts')

# Get configuration from environment variables
BDA_PROJECT_ARN = os.environ['BDA_PROJECT_ARN']
OUTPUT_BUCKET = os.environ['OUTPUT_BUCKET']

# Get region and account ID dynamically (Lambda provides these automatically)
AWS_REGION = os.environ['AWS_REGION']  # Lambda provides this automatically
account_id = sts_client.get_caller_identity()['Account']

# Construct BDA Profile ARN dynamically
BDA_PROFILE_ARN = f"arn:aws:bedrock:{AWS_REGION}:{account_id}:data-automation-profile/us.data-automation-v1"

def lambda_handler(event, context):
    """
    Lambda handler function to process documents with BDA.
    Triggered by S3 upload events.
    """

    try:
        # Extract S3 bucket and key from event
        s3_event = event['Records'][0]['s3']
        input_bucket = s3_event['bucket']['name'].lower()  # Ensure lowercase
        # URL decode the key (S3 event has URL-encoded keys)
        document_key = unquote_plus(s3_event['object']['key'])

        print(f"Processing document: s3://{input_bucket}/{document_key}")

        # Validate bucket name format (BDA requirement)
        if not input_bucket.replace('-', '').replace('.', '').isalnum():
            raise ValueError(f"Invalid bucket name format: {input_bucket}. Must contain only lowercase letters, numbers, dots, and hyphens.")

        # Construct S3 URIs
        input_s3_uri = f"s3://{input_bucket}/{document_key}"

        # Generate output path with timestamp
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        file_name = document_key.split('/')[-1]
        output_prefix = f"processed/{timestamp}_{file_name}"
        output_s3_uri = f"s3://{OUTPUT_BUCKET}/{output_prefix}"

        # Invoke BDA to process the document
        print("Invoking Bedrock Data Automation...")
        response = bda_runtime_client.invoke_data_automation_async(
            inputConfiguration={
                's3Uri': input_s3_uri
            },
            outputConfiguration={
                's3Uri': output_s3_uri
            },
            dataAutomationConfiguration={
                'dataAutomationProjectArn': BDA_PROJECT_ARN,
                'stage': 'LIVE'
            },
            dataAutomationProfileArn=BDA_PROFILE_ARN
        )

        invocation_arn = response['invocationArn']
        print(f"BDA Invocation ARN: {invocation_arn}")

        # Wait for processing to complete (with timeout)
        max_wait_time = 240  # 4 minutes (leave 1 min for cleanup)
        wait_interval = 10
        elapsed_time = 0

        while elapsed_time < max_wait_time:
            status_response = bda_runtime_client.get_data_automation_status(
                invocationArn=invocation_arn
            )

            status = status_response['status']
            print(f"Current status: {status}")

            if status == 'Success':
                result_s3_uri = status_response['outputConfiguration']['s3Uri']
                print(f"Processing completed successfully!")
                print(f"Results saved to: {result_s3_uri}")

                return {
                    'statusCode': 200,
                    'body': json.dumps({
                        'message': 'Document processed successfully',
                        'input': input_s3_uri,
                        'output': result_s3_uri,
                        'invocationArn': invocation_arn
                    })
                }

            elif status in ['ClientError', 'ServiceError']:
                error_msg = f"Processing failed with status: {status}"
                print(error_msg)
                return {
                    'statusCode': 500,
                    'body': json.dumps({
                        'error': error_msg,
                        'invocationArn': invocation_arn
                    })
                }

            # Wait before checking again
            time.sleep(wait_interval)
            elapsed_time += wait_interval

        # Timeout reached
        print(f"Processing still in progress after {max_wait_time} seconds")
        return {
            'statusCode': 202,
            'body': json.dumps({
                'message': 'Processing initiated but not completed within Lambda timeout',
                'invocationArn': invocation_arn,
                'note': 'Check BDA console for final status'
            })
        }

    except Exception as e:
        error_msg = f"Error processing document: {str(e)}"
        print(error_msg)
        return {
            'statusCode': 500,
            'body': json.dumps({
                'error': error_msg
            })
        }

Key Features of This Code:

Automatically converts bucket names to lowercase (BDA requirement)
URL decodes file names from S3 events
Validates bucket name format
Dynamically constructs BDA Profile ARN
Polls for completion status with timeout
Comprehensive error handling and logging

When you reach Step 2.3 in Part 2, simply copy this code into your Lambda function.

Introduction to Amazon Bedrock Data Automation

Intelligent Document Processing with Generative AI

Generative AI not only drives innovation through ideation and content creation but also optimizes operational processes and increases productivity across various domains. Amazon Bedrock Data Automation (BDA) is a fully managed service that provides Intelligent Document Processing (IDP) capabilities enhanced by generative AI.

Why Choose BDA?

1. Complete Automation

Enterprises can extract significant value from IDP enhanced with generative AI. By integrating generative AI capabilities into IDP solutions, organizations can:

Advanced Document Understanding: Deep analysis of structure and semantics
Structured Data Extraction: Automatic transformation from unstructured to structured data
Automatic Classification: Document type recognition and appropriate routing
Information Retrieval: Search and retrieve information from unstructured text

2. BDA vs Direct Bedrock API

Aspect	Direct Bedrock API	BDA (Managed Service)
Complexity	Need to build pipeline	Fully managed, integrated
Document Understanding	Requires prompt engineering	Built-in document intelligence
Multimodal	Only processes images	Supports various formats: PDF, images, audio, video
Output Format	Custom JSON	Standardized JSON + Markdown/HTML/CSV
Bounding Boxes	Not available	Automatically provides element positions
Scalability	Self-managed	Auto-scaling
Best For	Custom IDP solutions	Enterprise document processing

Business Value and Use Cases

Government & Public Sector

Process and extract data from:

Birth certificate applications, ID/Passport applications
Immigration and visa records
Legal contracts and government forms

Benefits: Reduce processing time from days to minutes, improve citizen service quality.

Healthcare

Extract and organize information from:

Electronic medical records
Insurance claim requests
Prescriptions and test results
Clinical trial records

Benefits: Improve data accuracy, increase information accessibility for better patient care.

Finance & Banking

Automate processing:

Loan and credit applications
Financial reports and tax documents
Contracts and agreements
Compliance documentation

Benefits: Reduce manual work, increase operational efficiency, reduce compliance risks.

Logistics & Supply Chain

Process:

Shipping documents and invoices
Purchase orders and warehouse receipts
Supplier contracts
Customs certificates

Benefits: Optimize processes, increase supply chain visibility.

Retail & E-commerce

Automate:

Customer orders
Product catalogs and descriptions
Invoices and receipts
Marketing documents

Benefits: Personalize customer experience, process orders quickly and efficiently.

Why Combine BDA with Lambda?

Combining BDA with AWS Lambda creates a powerful serverless IDP pipeline:

Event-Driven: Automatically processes when new documents arrive
Scalable: Automatically scales with document volume
Cost-Effective: Pay only when processing
Low Maintenance: No server management needed
Integration Ready: Easy to integrate with existing systems

Statistics and Impact

According to AWS, organizations deploying IDP with generative AI have achieved:

90%+ reduction in document processing time
60-70% cost savings in operations
95%+ accuracy in data extraction
80% reduction in manual intervention needs

AWS Services in This Guide

This guide uses the following AWS services to build a serverless IDP pipeline:

Amazon Bedrock Data Automation

Fully managed service providing intelligent document processing capabilities with generative AI. BDA automatically processes documents and extracts structured information without needing to orchestrate complex tasks.

Key Features:

Document classification and extraction
Multi-granularity analysis (document, page, element level)
Generative summaries and descriptions
Support for diverse formats: PDF, images, audio, video

AWS Lambda

Serverless computing service that allows running code without managing servers. Lambda automatically scales and you only pay for compute time used.

Role in solution:

Triggered when new document uploaded
Calls BDA API to process document
Handles and saves results

Amazon S3 (Simple Storage Service)

Highly scalable, durable, and secure object storage service.

Role in solution:

Store input documents
Store output results from BDA
Trigger S3 events for Lambda

Amazon CloudWatch

Monitoring and observability service for AWS resources and applications.

Role in solution:

Collect logs from Lambda execution
Monitor metrics (invocations, errors, duration)
Troubleshooting and debugging

AWS IAM (Identity and Access Management)

Securely manage access and permissions for AWS resources.

Role in solution:

Create execution role for Lambda
Grant permissions to S3, Bedrock, CloudWatch
Security and access control

Detailed Workflow

1. USER uploads document
   │
   ▼
2. S3 Event Notification
   │
   ▼
3. Lambda Function triggered
   │
   ├─▶ Read document from S3
   │
   ├─▶ Call BDA InvokeDataAutomationAsync API
   │   │
   │   ▼
   │   BDA Processing:
   │   ├─ Document ingestion
   │   ├─ Structure analysis
   │   ├─ Content extraction (text, tables, figures)
   │   ├─ Semantic enrichment (AI summaries)
   │   └─ Result formation (JSON, Markdown, CSV)
   │
   ├─▶ Poll for completion status
   │
   └─▶ Save results to S3 output bucket

Difference from Traditional IDP

Traditional IDP	BDA-Powered IDP
Rule-based extraction	AI-powered understanding
Template dependency	Template-free processing
Manual training needed	Pre-trained models
Limited format support	Multi-format support
No semantic understanding	Deep semantic analysis
Fixed output structure	Flexible, rich output

Part 1: Environment Setup

Step 1.1: Enable Model Access in Amazon Bedrock

Purpose: Enable access to foundation models required for BDA.

Important: This is a mandatory step before using Amazon Bedrock Data Automation. Without enabling model access, BDA cannot process documents.

Models to Enable:

We need to enable the following models for this guide:

Amazon Models: All models
Claude Models:
- Claude 3.5 Haiku
- Claude 3 Sonnet
- Claude 3.5 Sonnet
Cohere Models:
- Cohere Rerank 3.5

Implementation Steps:

Open Amazon Bedrock Console
- Search for Bedrock in the top search bar
- Click on Amazon Bedrock
Access Model Access
- In the left menu (or navigation menu), click on Model access
- Click Modify model access button (or Request model access if first time)
Select Models
In the models list, select the following:
Amazon Models:
- Check all Amazon models Claude Models (by Anthropic):
- Check Claude 3.5 Haiku
- Check Claude 3 Sonnet
- Check Claude 3.5 Sonnet Cohere Models:
- Check Cohere Rerank 3.5
Submit Request
- Click Next button at bottom right
- Review selected models
- Click Submit to request access
Wait for Approval
- Most models will be approved instantly (status: Access granted)
- Some models may take a few minutes for provisioning
- Refresh page to see status updates

When all models show status "Access granted", you are ready to continue.

Step 1.2: Create S3 Buckets

Purpose: Create S3 buckets to store input documents and output results.

Log in to AWS Management Console
Search for and select S3 service
Click Create bucket button

Create Input Bucket:

Enter information:
- Bucket name: bda-workshop-input-demo-xyz789 (replace xyz789 with your random string to ensure bucket name is unique)
- Block Public Access settings: Keep default (Block all public access)
Scroll down and click Create bucket

Create Output Bucket:
Repeat steps 3-5 with:
- Bucket name: bda-workshop-output-demo-xyz789 (use same suffix as input bucket)
- Other configurations keep default

Step 1.3: Create IAM Role for Lambda

Purpose: Create IAM role with sufficient permissions for Lambda to access S3 and Bedrock.

In AWS Console, search for and select IAM
In left menu, select Roles
Click Create role button

Select Trusted Entity:
Trusted entity type: Select AWS service
Use case: Select Lambda
Click Next

Assign Permissions:
In policy search box, find and select the following policies (check the checkbox):
- AWSLambdaBasicExecutionRole (for CloudWatch Logs)
- AmazonS3FullAccess (to read/write S3)
- AmazonBedrockFullAccess (to use Bedrock BDA) Security Note: In production environments, create custom policies with minimum necessary permissions instead of using FullAccess policies.
Click Next

Name and Create Role:

Role name: BDA-Lambda-ExecutionRole
Description: `Execution role for BDA document processing Lambda function
Scroll down and click Create role ---

Step 1.4: Create BDA Project

Purpose: Create BDA project to configure document processing.

In AWS Console, search for and select Amazon Bedrock
In left menu, select Data Automation > Set-up Project
Click Create project button

Configure Project:

Enter information:
- Project name: document-processing-project
- Select ** Create Project**
In Standard output configuration section, configure:
- Click Edit
- Granularity types:
  - Check DOCUMENT
  - Check PAGE
  - Check ELEMENT
- Bounding box: Select ENABLED
- Generative field: Select ENABLED
- Text format types:
  - Check MARKDOWN
- Additional file format: Select ENABLED
Click Save changes
Wait for project status to change to Changes saved successfully.
Copy and save Project ARN (will be used in Lambda configuration step)
- Click on project name to view details
- Copy ARN from project details page
- Example ARN: arn:aws:bedrock:us-east-1:111111111111:data-automation-project/abc123xyz

Part 2: Create Lambda Function

Step 2.1: Create Lambda Function

In AWS Console, search for and select Lambda
Click Create function button
Select Author from scratch
Enter information:
- Function name: BDA-Document-Processor
- Runtime: Select Python 3.12
- Architecture: Select x86_64 #### Permissions:
In Permissions section, select Use an existing role
Existing role: Select BDA-Lambda-ExecutionRole (role created in step 1.3)
Click Create function

Step 2.2: Configure Lambda Function

Increase Timeout and Memory:

In Lambda function page, select Configuration tab
Select General configuration > Click Edit
Configure:
- Memory: 512 MB
- Timeout: 5 min (5 minutes)
- Description: Processes documents using Bedrock Data Automation
Click Save
Still in Configuration tab, select Environment variables
Click Edit > Click Add environment variable
Add the following variables:

Key	Value	Description
`BDA_PROJECT_ARN`	Paste ARN saved in step 1.4	ARN of created BDA project
`OUTPUT_BUCKET`	`bda-workshop-output-demo-xyz789`	Output bucket name (replace with your bucket name)

8.Click Save

Step 2.3: Write Lambda Code

Return to Code tab
In editor, delete sample code and replace with the complete Lambda function code provided in the Lambda Function Code Reference section above (before Part 1).
Click Deploy to save code

The code handles S3 events, validates bucket names, invokes BDA, and polls for completion status with comprehensive error handling.

Part 3: Configure S3 Trigger

Step 3.1: Add S3 Trigger for Lambda

Purpose: Configure Lambda to automatically run when new files are uploaded to S3.

In Lambda function page, select Configuration tab
Select Triggers in left menu
Click Add trigger
Select a source: Select S3
Bucket: Select bda-workshop-input-demo-xyz789 (input bucket created in step 1.2)
Event type: Select All object create events
Prefix (optional): Leave blank (process all files) or enter documents/ if only want to process files in documents folder
Suffix (optional): Enter .pdf if only want to process PDF files
Check the checkbox I acknowledge that using the same S3 bucket...
Click Add

Part 4: Testing and Deployment

Step 4.1: Prepare Test Document

Download a PDF sample file to your computer (or use existing file)

Important - File Naming:

File names should only contain: letters (a-z, A-Z), numbers (0-9), hyphens (-), underscores (_), dots (.)
Should NOT use: spaces, special characters, Vietnamese characters
Good file name examples: document-test.pdf, report_2024.pdf
File names to avoid: report document.pdf, document (1).pdf

Step 4.2: Upload and Test

Open S3 Console
Go to bucket bda-workshop-input-demo-xyz789 (your input bucket)
Click Upload
Click Add files and select test PDF file
Click Upload

Step 4.3: Check Lambda Execution

Return to Lambda Console
Select function BDA-Document-Processor
Select Monitor tab
Click View CloudWatch logs
Click on latest log stream (most recent time)
View logs to verify:
- Lambda was triggered
- Document being processed
- BDA invocation successful
- Processing completed

Step 4.4: Check Output

Open S3 Console
Go to bucket bda-workshop-output-demo-xyz789 (your output bucket)
Go to processed/ folder
You will see folder with timestamp and file name
Inside that folder will have:
- job_metadata.json - job metadata
- standard-output.json - main processing results
- Other files like markdown, CSV (if any)
Download standard-output.json and view results:
- Document summary
- Extracted tables
- Figures
- Page-level information
- Element details

Real-World Example: Government Birth Certificate Processing

Use Case from AWS Blog

A real-world example from AWS Machine Learning Blog illustrates how IDP with generative AI solves real problems:

The Problem

Government agency issuing birth certificates receives applications through multiple channels:

Online applications
Forms completed at physical locations
Mailed paper applications

Current Manual Process:

Scan paper applications
Staff manually read and enter into system
Check and validation
Save to database

Issues:

Very time-consuming
High labor costs
Prone to manual data entry errors
Not scalable when volume increases
Complex if forms are in multiple languages (English, Spanish, etc.)

Solution with BDA

With architecture similar to this guide, but with additions:

SQS Queue: Buffer to process messages reliably
DynamoDB: Store extracted data
Multi-language Support: Automatically translate and extract

Upload Form → S3 → Lambda → BDA → SQS → Lambda → DynamoDB ↓ Auto-detect language Extract all fields Translate if needed

Results Achieved

Before:

15-20 minutes/application (manual)
High labor costs
85-90% accuracy (human error)

After (with BDA):

< 1 minute/application (automated)
60-70% cost savings
95%+ accuracy
Process multiple languages
Scale to thousands of applications/day

Fields Extracted Automatically

BDA can extract complex information:

Applicant information (name, address, contact)
Birth certificate recipient information
Parent information
Fee payment information
Signatures and dates
Bonus: Automatically translate from Spanish to English

Extending the Solution

You can extend this solution in the following directions:

Add SQS Queue: Buffer processing and retry logic
Add DynamoDB: Store structured data
Custom Extraction: Define fields to extract for specific domain
Multi-language: Process documents in multiple languages
Human-in-the-loop: Validation for critical data

Cleanup Resources (Optional)

If you do not want to continue using and avoid incurring costs:

Delete Lambda Function:

Lambda Console > Select function > Actions > Delete

Delete S3 Buckets:

S3 Console > Select bucket > Empty (delete all objects)
Then select bucket > Delete

Delete IAM Role:

IAM Console > Roles > Select role > Delete

Delete BDA Project:

Bedrock Console > Data Automation > Projects > Select project > Delete

Reference Documentation

AWS Documentation:

AWS Lambda Developer Guide - Complete Lambda guide
Amazon Bedrock Data Automation - Official BDA documentation
Amazon Bedrock User Guide - Bedrock overview
S3 Event Notifications - S3 trigger configuration
Standard Output in BDA - Output configuration details

API References:

AWS Blogs and Case Studies:

Conclusion

Congratulations! You have successfully completed and created a serverless document processing pipeline with AWS Lambda and Amazon Bedrock Data Automation!

Summary of What You Learned

In this article, we covered:

Understanding BDA and IDP: Learned about Intelligent Document Processing with generative AI and benefits of Amazon Bedrock Data Automation

Building Infrastructure: Created S3 buckets, IAM roles, and BDA project through AWS Console

Developing Lambda Function: Wrote Python code to integrate with BDA API

Event-Driven Architecture: Set up S3 event triggers for automated processing

Testing and Monitoring: Deployed, tested, and monitored solution with CloudWatch

Troubleshooting: Handled common issues and best practices

Next Steps

Now that you have a solid foundation, continue exploring:

Expand to multimodal: Try processing images, audio, and video with BDA
Integrate downstream systems: Connect with DynamoDB, API Gateway, or business applications
Advanced patterns: Implement human-in-the-loop workflows, model evaluation
Production readiness: Error handling, DLQ, cost optimization, security hardening
Explore RAG: Build multimodal RAG applications with BDA and Knowledge Bases

Acknowledgments

Thank you for reading this article! Hope this guide helps you start your journey with Amazon Bedrock Data Automation. Wish you success in building intelligent document processing solutions!

Top comments (1)

Khoi Ngo Quoc • Nov 14

Amazing!!!

Introduction

What You Will Learn

Solution Architecture

Knowledge Requirements

Article Structure

Prerequisites

Lambda Function Code Reference

Introduction to Amazon Bedrock Data Automation

Intelligent Document Processing with Generative AI

Why Choose BDA?

1. Complete Automation

2. BDA vs Direct Bedrock API

Business Value and Use Cases

Government & Public Sector

Healthcare

Finance & Banking

Logistics & Supply Chain

Retail & E-commerce

Why Combine BDA with Lambda?

Statistics and Impact

AWS Services in This Guide

Amazon Bedrock Data Automation

AWS Lambda

Amazon S3 (Simple Storage Service)

Amazon CloudWatch

AWS IAM (Identity and Access Management)

Detailed Workflow

Difference from Traditional IDP

Part 1: Environment Setup

Step 1.1: Enable Model Access in Amazon Bedrock

Models to Enable:

Implementation Steps:

Step 1.2: Create S3 Buckets

Create Input Bucket:

Create Output Bucket:

Step 1.3: Create IAM Role for Lambda

Select Trusted Entity:

Assign Permissions:

Name and Create Role:

Step 1.4: Create BDA Project

Configure Project:

Part 2: Create Lambda Function

Step 2.1: Create Lambda Function

Step 2.2: Configure Lambda Function

Increase Timeout and Memory:

Step 2.3: Write Lambda Code

Part 3: Configure S3 Trigger

Step 3.1: Add S3 Trigger for Lambda

Part 4: Testing and Deployment

Step 4.1: Prepare Test Document

Step 4.2: Upload and Test

Step 4.3: Check Lambda Execution

Step 4.4: Check Output

Real-World Example: Government Birth Certificate Processing

Use Case from AWS Blog

The Problem

Solution with BDA

Results Achieved

Fields Extracted Automatically

Extending the Solution

Cleanup Resources (Optional)

Delete Lambda Function:

Delete S3 Buckets:

Delete IAM Role:

Delete BDA Project:

Reference Documentation

AWS Documentation:

API References:

AWS Blogs and Case Studies:

BDA-Specific:

IDP with Generative AI:

Advanced Topics:

Solution Guidance:

Product Pages:

Video Resources:

Conclusion

Summary of What You Learned

Next Steps

Acknowledgments