DEV Community

Cover image for Build Intelligent Document Processing with AWS Lambda and Bedrock Data Automation
Lam Bùi
Lam Bùi

Posted on

Build Intelligent Document Processing with AWS Lambda and Bedrock Data Automation

Introduction

This guide provides detailed instructions on building a serverless document processing pipeline using AWS Lambda and Amazon Bedrock Data Automation (BDA). We will create an automated document processing system where Lambda functions are triggered when documents are uploaded to S3, using BDA to extract structured information from documents.

What You Will Learn

After completing this guide, you will be able to:

  • Create and configure Lambda functions through AWS Console
  • Integrate Lambda with Amazon Bedrock Data Automation
  • Set up S3 Event triggers for Lambda
  • Automatically process documents and save results
  • Monitor and troubleshoot Lambda functions

Solution Architecture

Knowledge Requirements

  • Basic understanding of AWS Console
  • Basic Python knowledge
  • Understanding of S3 and Lambda (basic level)

Article Structure

This article is divided into 5 main parts:

Part Content
Part 1 Environment Setup (Enable Models, S3, IAM, BDA Project)
Part 2 Create and Configure Lambda Function
Part 3 Set up S3 Trigger
Part 4 Testing and Deployment
Part 5 Real-world Examples and Best Practices

Prerequisites

Before starting, ensure you have:

  • AWS Account with IAM permissions for Bedrock, Lambda, S3, IAM
  • Access to AWS Console
  • Selected a region supporting Amazon Bedrock Data Automation (recommended: us-east-1 or us-west-2)
  • Modern browser (Chrome, Firefox, Safari, or Edge)

Note about Region: Amazon Bedrock Data Automation is not available in all regions. Check Bedrock endpoints and quotas to see supported regions.


Lambda Function Code Reference

Before setting up the environment, here is the complete Lambda function code that you will use later in Part 2. Having this code available upfront allows you to understand what we're building toward.

File: lambda_function.py

import json
import boto3
import os
import time
from datetime import datetime
from urllib.parse import unquote_plus

# Initialize AWS clients
s3_client = boto3.client('s3')
bda_runtime_client = boto3.client('bedrock-data-automation-runtime')
sts_client = boto3.client('sts')

# Get configuration from environment variables
BDA_PROJECT_ARN = os.environ['BDA_PROJECT_ARN']
OUTPUT_BUCKET = os.environ['OUTPUT_BUCKET']

# Get region and account ID dynamically (Lambda provides these automatically)
AWS_REGION = os.environ['AWS_REGION']  # Lambda provides this automatically
account_id = sts_client.get_caller_identity()['Account']

# Construct BDA Profile ARN dynamically
BDA_PROFILE_ARN = f"arn:aws:bedrock:{AWS_REGION}:{account_id}:data-automation-profile/us.data-automation-v1"

def lambda_handler(event, context):
    """
    Lambda handler function to process documents with BDA.
    Triggered by S3 upload events.
    """

    try:
        # Extract S3 bucket and key from event
        s3_event = event['Records'][0]['s3']
        input_bucket = s3_event['bucket']['name'].lower()  # Ensure lowercase
        # URL decode the key (S3 event has URL-encoded keys)
        document_key = unquote_plus(s3_event['object']['key'])

        print(f"Processing document: s3://{input_bucket}/{document_key}")

        # Validate bucket name format (BDA requirement)
        if not input_bucket.replace('-', '').replace('.', '').isalnum():
            raise ValueError(f"Invalid bucket name format: {input_bucket}. Must contain only lowercase letters, numbers, dots, and hyphens.")

        # Construct S3 URIs
        input_s3_uri = f"s3://{input_bucket}/{document_key}"

        # Generate output path with timestamp
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        file_name = document_key.split('/')[-1]
        output_prefix = f"processed/{timestamp}_{file_name}"
        output_s3_uri = f"s3://{OUTPUT_BUCKET}/{output_prefix}"

        # Invoke BDA to process the document
        print("Invoking Bedrock Data Automation...")
        response = bda_runtime_client.invoke_data_automation_async(
            inputConfiguration={
                's3Uri': input_s3_uri
            },
            outputConfiguration={
                's3Uri': output_s3_uri
            },
            dataAutomationConfiguration={
                'dataAutomationProjectArn': BDA_PROJECT_ARN,
                'stage': 'LIVE'
            },
            dataAutomationProfileArn=BDA_PROFILE_ARN
        )

        invocation_arn = response['invocationArn']
        print(f"BDA Invocation ARN: {invocation_arn}")

        # Wait for processing to complete (with timeout)
        max_wait_time = 240  # 4 minutes (leave 1 min for cleanup)
        wait_interval = 10
        elapsed_time = 0

        while elapsed_time < max_wait_time:
            status_response = bda_runtime_client.get_data_automation_status(
                invocationArn=invocation_arn
            )

            status = status_response['status']
            print(f"Current status: {status}")

            if status == 'Success':
                result_s3_uri = status_response['outputConfiguration']['s3Uri']
                print(f"Processing completed successfully!")
                print(f"Results saved to: {result_s3_uri}")

                return {
                    'statusCode': 200,
                    'body': json.dumps({
                        'message': 'Document processed successfully',
                        'input': input_s3_uri,
                        'output': result_s3_uri,
                        'invocationArn': invocation_arn
                    })
                }

            elif status in ['ClientError', 'ServiceError']:
                error_msg = f"Processing failed with status: {status}"
                print(error_msg)
                return {
                    'statusCode': 500,
                    'body': json.dumps({
                        'error': error_msg,
                        'invocationArn': invocation_arn
                    })
                }

            # Wait before checking again
            time.sleep(wait_interval)
            elapsed_time += wait_interval

        # Timeout reached
        print(f"Processing still in progress after {max_wait_time} seconds")
        return {
            'statusCode': 202,
            'body': json.dumps({
                'message': 'Processing initiated but not completed within Lambda timeout',
                'invocationArn': invocation_arn,
                'note': 'Check BDA console for final status'
            })
        }

    except Exception as e:
        error_msg = f"Error processing document: {str(e)}"
        print(error_msg)
        return {
            'statusCode': 500,
            'body': json.dumps({
                'error': error_msg
            })
        }

Enter fullscreen mode Exit fullscreen mode

Key Features of This Code:

  • Automatically converts bucket names to lowercase (BDA requirement)
  • URL decodes file names from S3 events
  • Validates bucket name format
  • Dynamically constructs BDA Profile ARN
  • Polls for completion status with timeout
  • Comprehensive error handling and logging

When you reach Step 2.3 in Part 2, simply copy this code into your Lambda function.

Introduction to Amazon Bedrock Data Automation

Intelligent Document Processing with Generative AI

Generative AI not only drives innovation through ideation and content creation but also optimizes operational processes and increases productivity across various domains. Amazon Bedrock Data Automation (BDA) is a fully managed service that provides Intelligent Document Processing (IDP) capabilities enhanced by generative AI.

Why Choose BDA?

1. Complete Automation

Enterprises can extract significant value from IDP enhanced with generative AI. By integrating generative AI capabilities into IDP solutions, organizations can:

  • Advanced Document Understanding: Deep analysis of structure and semantics
  • Structured Data Extraction: Automatic transformation from unstructured to structured data
  • Automatic Classification: Document type recognition and appropriate routing
  • Information Retrieval: Search and retrieve information from unstructured text

2. BDA vs Direct Bedrock API

Aspect Direct Bedrock API BDA (Managed Service)
Complexity Need to build pipeline Fully managed, integrated
Document Understanding Requires prompt engineering Built-in document intelligence
Multimodal Only processes images Supports various formats: PDF, images, audio, video
Output Format Custom JSON Standardized JSON + Markdown/HTML/CSV
Bounding Boxes Not available Automatically provides element positions
Scalability Self-managed Auto-scaling
Best For Custom IDP solutions Enterprise document processing

Business Value and Use Cases

Government & Public Sector

Process and extract data from:

  • Birth certificate applications, ID/Passport applications
  • Immigration and visa records
  • Legal contracts and government forms

Benefits: Reduce processing time from days to minutes, improve citizen service quality.

Healthcare

Extract and organize information from:

  • Electronic medical records
  • Insurance claim requests
  • Prescriptions and test results
  • Clinical trial records

Benefits: Improve data accuracy, increase information accessibility for better patient care.

Finance & Banking

Automate processing:

  • Loan and credit applications
  • Financial reports and tax documents
  • Contracts and agreements
  • Compliance documentation

Benefits: Reduce manual work, increase operational efficiency, reduce compliance risks.

Logistics & Supply Chain

Process:

  • Shipping documents and invoices
  • Purchase orders and warehouse receipts
  • Supplier contracts
  • Customs certificates

Benefits: Optimize processes, increase supply chain visibility.

Retail & E-commerce

Automate:

  • Customer orders
  • Product catalogs and descriptions
  • Invoices and receipts
  • Marketing documents

Benefits: Personalize customer experience, process orders quickly and efficiently.

Why Combine BDA with Lambda?

Combining BDA with AWS Lambda creates a powerful serverless IDP pipeline:

  1. Event-Driven: Automatically processes when new documents arrive
  2. Scalable: Automatically scales with document volume
  3. Cost-Effective: Pay only when processing
  4. Low Maintenance: No server management needed
  5. Integration Ready: Easy to integrate with existing systems

Statistics and Impact

According to AWS, organizations deploying IDP with generative AI have achieved:

  • 90%+ reduction in document processing time
  • 60-70% cost savings in operations
  • 95%+ accuracy in data extraction
  • 80% reduction in manual intervention needs

AWS Services in This Guide

This guide uses the following AWS services to build a serverless IDP pipeline:

Amazon Bedrock Data Automation

Fully managed service providing intelligent document processing capabilities with generative AI. BDA automatically processes documents and extracts structured information without needing to orchestrate complex tasks.

Key Features:

  • Document classification and extraction
  • Multi-granularity analysis (document, page, element level)
  • Generative summaries and descriptions
  • Support for diverse formats: PDF, images, audio, video

AWS Lambda

Serverless computing service that allows running code without managing servers. Lambda automatically scales and you only pay for compute time used.

Role in solution:

  • Triggered when new document uploaded
  • Calls BDA API to process document
  • Handles and saves results

Amazon S3 (Simple Storage Service)

Highly scalable, durable, and secure object storage service.

Role in solution:

  • Store input documents
  • Store output results from BDA
  • Trigger S3 events for Lambda

Amazon CloudWatch

Monitoring and observability service for AWS resources and applications.

Role in solution:

  • Collect logs from Lambda execution
  • Monitor metrics (invocations, errors, duration)
  • Troubleshooting and debugging

AWS IAM (Identity and Access Management)

Securely manage access and permissions for AWS resources.

Role in solution:

  • Create execution role for Lambda
  • Grant permissions to S3, Bedrock, CloudWatch
  • Security and access control

Detailed Workflow

1. USER uploads document
   │
   ▼
2. S3 Event Notification
   │
   ▼
3. Lambda Function triggered
   │
   ├─▶ Read document from S3
   │
   ├─▶ Call BDA InvokeDataAutomationAsync API
   │   │
   │   ▼
   │   BDA Processing:
   │   ├─ Document ingestion
   │   ├─ Structure analysis
   │   ├─ Content extraction (text, tables, figures)
   │   ├─ Semantic enrichment (AI summaries)
   │   └─ Result formation (JSON, Markdown, CSV)
   │
   ├─▶ Poll for completion status
   │
   └─▶ Save results to S3 output bucket
Enter fullscreen mode Exit fullscreen mode

Difference from Traditional IDP

Traditional IDP BDA-Powered IDP
Rule-based extraction AI-powered understanding
Template dependency Template-free processing
Manual training needed Pre-trained models
Limited format support Multi-format support
No semantic understanding Deep semantic analysis
Fixed output structure Flexible, rich output

Part 1: Environment Setup

Step 1.1: Enable Model Access in Amazon Bedrock

Purpose: Enable access to foundation models required for BDA.

Important: This is a mandatory step before using Amazon Bedrock Data Automation. Without enabling model access, BDA cannot process documents.

Models to Enable:

We need to enable the following models for this guide:

  1. Amazon Models: All models
  2. Claude Models:
    • Claude 3.5 Haiku
    • Claude 3 Sonnet
    • Claude 3.5 Sonnet
  3. Cohere Models:
    • Cohere Rerank 3.5

Implementation Steps:

  1. Open Amazon Bedrock Console
    • Search for Bedrock in the top search bar
    • Click on Amazon Bedrock
  2. Access Model Access

    • In the left menu (or navigation menu), click on Model access
    • Click Modify model access button (or Request model access if first time)
  3. Select Models
    In the models list, select the following:
    Amazon Models:

    • Check all Amazon models Claude Models (by Anthropic):
    • Check Claude 3.5 Haiku
    • Check Claude 3 Sonnet
    • Check Claude 3.5 Sonnet Cohere Models:
    • Check Cohere Rerank 3.5
  4. Submit Request

    • Click Next button at bottom right
    • Review selected models
    • Click Submit to request access
  5. Wait for Approval

    • Most models will be approved instantly (status: Access granted)
    • Some models may take a few minutes for provisioning
    • Refresh page to see status updates

When all models show status "Access granted", you are ready to continue.


Step 1.2: Create S3 Buckets

Purpose: Create S3 buckets to store input documents and output results.

  1. Log in to AWS Management Console
  2. Search for and select S3 service
  3. Click Create bucket button

Create Input Bucket:

  1. Enter information:
    • Bucket name: bda-workshop-input-demo-xyz789 (replace xyz789 with your random string to ensure bucket name is unique)
    • Block Public Access settings: Keep default (Block all public access)
  2. Scroll down and click Create bucket

    Create Output Bucket:

  3. Repeat steps 3-5 with:

    • Bucket name: bda-workshop-output-demo-xyz789 (use same suffix as input bucket)
    • Other configurations keep default

Step 1.3: Create IAM Role for Lambda

Purpose: Create IAM role with sufficient permissions for Lambda to access S3 and Bedrock.

  1. In AWS Console, search for and select IAM
  2. In left menu, select Roles
  3. Click Create role button

    Select Trusted Entity:

  4. Trusted entity type: Select AWS service

  5. Use case: Select Lambda

  6. Click Next

    Assign Permissions:

  7. In policy search box, find and select the following policies (check the checkbox):

    • AWSLambdaBasicExecutionRole (for CloudWatch Logs)
    • AmazonS3FullAccess (to read/write S3)
    • AmazonBedrockFullAccess (to use Bedrock BDA) Security Note: In production environments, create custom policies with minimum necessary permissions instead of using FullAccess policies.
  8. Click Next

Name and Create Role:

  1. Role name: BDA-Lambda-ExecutionRole
  2. Description: `Execution role for BDA document processing Lambda function
  3. Scroll down and click Create role  ---

Step 1.4: Create BDA Project

Purpose: Create BDA project to configure document processing.

  1. In AWS Console, search for and select Amazon Bedrock
  2. In left menu, select Data Automation > Set-up Project
  3. Click Create project button

Configure Project:

  1. Enter information:

    • Project name: document-processing-project
    • Select ** Create Project**
  2. In Standard output configuration section, configure:

    • Click Edit
    • Granularity types:
      • Check DOCUMENT
      • Check PAGE
      • Check ELEMENT
    • Bounding box: Select ENABLED
    • Generative field: Select ENABLED
    • Text format types:
      • Check MARKDOWN
    • Additional file format: Select ENABLED
  3. Click Save changes

  4. Wait for project status to change to Changes saved successfully.

  5. Copy and save Project ARN (will be used in Lambda configuration step)

    • Click on project name to view details
    • Copy ARN from project details page
    • Example ARN: arn:aws:bedrock:us-east-1:111111111111:data-automation-project/abc123xyz

Part 2: Create Lambda Function

Step 2.1: Create Lambda Function

  1. In AWS Console, search for and select Lambda
  2. Click Create function button

  3. Select Author from scratch

  4. Enter information:

    • Function name: BDA-Document-Processor
    • Runtime: Select Python 3.12
    • Architecture: Select x86_64 #### Permissions:
  5. In Permissions section, select Use an existing role

  6. Existing role: Select BDA-Lambda-ExecutionRole (role created in step 1.3)

  7. Click Create function

    Step 2.2: Configure Lambda Function

Increase Timeout and Memory:

  1. In Lambda function page, select Configuration tab
  2. Select General configuration > Click Edit
  3. Configure:
    • Memory: 512 MB
    • Timeout: 5 min (5 minutes)
    • Description: Processes documents using Bedrock Data Automation
  4. Click Save
  5. Still in Configuration tab, select Environment variables
  6. Click Edit > Click Add environment variable
  7. Add the following variables:
Key Value Description
BDA_PROJECT_ARN Paste ARN saved in step 1.4 ARN of created BDA project
OUTPUT_BUCKET bda-workshop-output-demo-xyz789 Output bucket name (replace with your bucket name)

8.Click Save

Step 2.3: Write Lambda Code

  1. Return to Code tab
  2. In editor, delete sample code and replace with the complete Lambda function code provided in the Lambda Function Code Reference section above (before Part 1).
  3. Click Deploy to save code

The code handles S3 events, validates bucket names, invokes BDA, and polls for completion status with comprehensive error handling.

Part 3: Configure S3 Trigger

Step 3.1: Add S3 Trigger for Lambda

Purpose: Configure Lambda to automatically run when new files are uploaded to S3.

  1. In Lambda function page, select Configuration tab
  2. Select Triggers in left menu
  3. Click Add trigger

  4. Select a source: Select S3

  5. Bucket: Select bda-workshop-input-demo-xyz789 (input bucket created in step 1.2)

  6. Event type: Select All object create events

  7. Prefix (optional): Leave blank (process all files) or enter documents/ if only want to process files in documents folder

  8. Suffix (optional): Enter .pdf if only want to process PDF files

  9. Check the checkbox I acknowledge that using the same S3 bucket...

  10. Click Add


Part 4: Testing and Deployment

Step 4.1: Prepare Test Document

  1. Download a PDF sample file to your computer (or use existing file)

Important - File Naming:

  • File names should only contain: letters (a-z, A-Z), numbers (0-9), hyphens (-), underscores (_), dots (.)
  • Should NOT use: spaces, special characters, Vietnamese characters
  • Good file name examples: document-test.pdf, report_2024.pdf
  • File names to avoid: report document.pdf, document (1).pdf

Step 4.2: Upload and Test

  1. Open S3 Console
  2. Go to bucket bda-workshop-input-demo-xyz789 (your input bucket)
  3. Click Upload
  4. Click Add files and select test PDF file
  5. Click Upload

Step 4.3: Check Lambda Execution

  1. Return to Lambda Console
  2. Select function BDA-Document-Processor
  3. Select Monitor tab
  4. Click View CloudWatch logs

  5. Click on latest log stream (most recent time)

  6. View logs to verify:

    • Lambda was triggered
    • Document being processed
    • BDA invocation successful
    • Processing completed

Step 4.4: Check Output

  1. Open S3 Console
  2. Go to bucket bda-workshop-output-demo-xyz789 (your output bucket)
  3. Go to processed/ folder
  4. You will see folder with timestamp and file name
  5. Inside that folder will have:
    • job_metadata.json - job metadata
    • standard-output.json - main processing results
    • Other files like markdown, CSV (if any)
  6. Download standard-output.json and view results:
    • Document summary
    • Extracted tables
    • Figures
    • Page-level information
    • Element details

Real-World Example: Government Birth Certificate Processing

Use Case from AWS Blog

A real-world example from AWS Machine Learning Blog illustrates how IDP with generative AI solves real problems:

The Problem

Government agency issuing birth certificates receives applications through multiple channels:

  • Online applications
  • Forms completed at physical locations
  • Mailed paper applications

Current Manual Process:

  1. Scan paper applications
  2. Staff manually read and enter into system
  3. Check and validation
  4. Save to database

Issues:

  • Very time-consuming
  • High labor costs
  • Prone to manual data entry errors
  • Not scalable when volume increases
  • Complex if forms are in multiple languages (English, Spanish, etc.)

Solution with BDA

With architecture similar to this guide, but with additions:

  • SQS Queue: Buffer to process messages reliably
  • DynamoDB: Store extracted data
  • Multi-language Support: Automatically translate and extract


Upload Form → S3 → Lambda → BDA → SQS → Lambda → DynamoDB

Auto-detect language
Extract all fields
Translate if needed

Results Achieved

Before:

  • 15-20 minutes/application (manual)
  • High labor costs
  • 85-90% accuracy (human error)

After (with BDA):

  • < 1 minute/application (automated)
  • 60-70% cost savings
  • 95%+ accuracy
  • Process multiple languages
  • Scale to thousands of applications/day

Fields Extracted Automatically

BDA can extract complex information:

  • Applicant information (name, address, contact)
  • Birth certificate recipient information
  • Parent information
  • Fee payment information
  • Signatures and dates
  • Bonus: Automatically translate from Spanish to English

Extending the Solution

You can extend this solution in the following directions:

  1. Add SQS Queue: Buffer processing and retry logic
  2. Add DynamoDB: Store structured data
  3. Custom Extraction: Define fields to extract for specific domain
  4. Multi-language: Process documents in multiple languages
  5. Human-in-the-loop: Validation for critical data

Cleanup Resources (Optional)

If you do not want to continue using and avoid incurring costs:

Delete Lambda Function:

  1. Lambda Console > Select function > Actions > Delete

Delete S3 Buckets:

  1. S3 Console > Select bucket > Empty (delete all objects)
  2. Then select bucket > Delete

Delete IAM Role:

  1. IAM Console > Roles > Select role > Delete

Delete BDA Project:

  1. Bedrock Console > Data Automation > Projects > Select project > Delete

Reference Documentation

AWS Documentation:

API References:

AWS Blogs and Case Studies:

BDA-Specific:

IDP with Generative AI:

Advanced Topics:

Solution Guidance:

Product Pages:

Video Resources:


Conclusion

Congratulations! You have successfully completed and created a serverless document processing pipeline with AWS Lambda and Amazon Bedrock Data Automation!

Summary of What You Learned

In this article, we covered:

Understanding BDA and IDP: Learned about Intelligent Document Processing with generative AI and benefits of Amazon Bedrock Data Automation

Building Infrastructure: Created S3 buckets, IAM roles, and BDA project through AWS Console

Developing Lambda Function: Wrote Python code to integrate with BDA API

Event-Driven Architecture: Set up S3 event triggers for automated processing

Testing and Monitoring: Deployed, tested, and monitored solution with CloudWatch

Troubleshooting: Handled common issues and best practices

Next Steps

Now that you have a solid foundation, continue exploring:

  1. Expand to multimodal: Try processing images, audio, and video with BDA
  2. Integrate downstream systems: Connect with DynamoDB, API Gateway, or business applications
  3. Advanced patterns: Implement human-in-the-loop workflows, model evaluation
  4. Production readiness: Error handling, DLQ, cost optimization, security hardening
  5. Explore RAG: Build multimodal RAG applications with BDA and Knowledge Bases

Acknowledgments

Thank you for reading this article! Hope this guide helps you start your journey with Amazon Bedrock Data Automation. Wish you success in building intelligent document processing solutions!

Top comments (0)