Hung____

Posted on Jan 14

Document Processing Using Amazon Bedrock Data Automation (BDA)

#aws #ai #data

AWS Bedrock Data Automation (BDA) is a cloud-based service designed to make it easier to get insights from unstructured data such as documents, images, video, and audio. .

Here are some example use cases:

Document processing: BDA helps automate intelligent document processing (IDP) at scale without requiring complex steps such as document classification, data extraction, normalization, or validation.
Media analysis: BDA enriches unstructured video content by generating scene-level summaries, detecting unsafe or explicit material, extracting on-screen text, and classifying content based on advertisements or brands.
Generative AI assistants: BDA improves retrieval-augmented generation (RAG)–based question-answering systems by supplying detailed, specific information extracted from documents, images, video, and audio.

In this blog I want to walk through the AWS BDA workshop

Here is the official workshop: https://catalog.us-east-1.prod.workshops.aws/workshops/c64e3606-ab68-4521-81ea-b2eb36c993b9/en-US

And here is my forked repo that have the updated template and notebooks with all the results: https://github.com/Hung-00/sample-document-processing-with-amazon-bedrock-data-automation

Because the original template in the workshop have some problems (outdated LLM) when deploy so I update my own template, you can use it instead of the original template: https://github.com/Hung-00/sample-document-processing-with-amazon-bedrock-data-automation/blob/main/bda.yaml

You should also complete the workshop with the updated notebooks in my forked repo, there is some change in LLM, for example I use

Understanding the Core Concepts

BDA's standard output feature provides immediate value with least configuration. Simply send your file to BDA, and it returns commonly required information based on the data type:

Documents: Page-level text extraction, element detection (tables, figures, charts), structural analysis with markdown formatting, and document summaries
Images: Content moderation, text detection, and image summaries
Video: Scene summaries, transcripts, and content moderation
Audio: Transcriptions and audio summaries

What makes standard output powerful is its flexibility. You can configure:

Response Granularity: Choose from document, page, element, line, or word-level extraction
Text Formats: Get results in plaintext, markdown, HTML, or CSV
Bounding Boxes: Extract precise element locations on pages
Generative Fields: Enable AI-generated summaries and descriptions for figures and charts

And when you need specific information extracted from documents or images, custom output with blueprints is your solution. A blueprint is essentially a schema that defines exactly what fields you want to extract, their data types, and validation rules.

Key Features of Blueprints:

Catalog Blueprints: Pre-built blueprints for common documents like forms, paystubs, receipts, driver's licenses, bank statements, and medical insurance cards
Custom Blueprints: Define your own schemas with fields, groups, and tables
Automatic Matching: When processing files with multiple document types, BDA automatically matches each document to the appropriate blueprint
Normalization: Apply natural language context for data validation and normalization

Explore the notebooks in the repository to:

11_getting_started_with_bda.ipynb: Learn BDA basics and API workflow
12_standard_output_extended.ipynb: Deep dive into standard output configuration
13_custom_outputs_and_blueprints.ipynb: Master custom blueprints and projects
21_mortgage_and_lending.ipynb: Build a mortgage document processing solution
22_medical_claims_processing.ipynb: Create an end-to-end claims processing workflow

I will go through 2 real-world use cases in the workshop.

Mortgage and Lending: Accelerating Loan Processing

The mortgage industry handles massive volumes of documentation for each loan application. A typical lending package includes:

Identity verification documents (driver's licenses, passports)
Financial documents (bank statements, W-2 forms, paystubs, checks)
Property documents (homeowner insurance applications, appraisals)

The Challenge: Manual review of these documents is slow, expensive, and error-prone. Loan officers spend hours verifying information across multiple document types.

The BDA Solution: By creating a project with multiple blueprints (both catalog and custom), BDA can:

Automatically split multi-page PDF packages into individual documents
Classify each document type (driver's license, bank statement, W-2, etc.)
Match documents to the appropriate blueprint
Extract structured data from each document
Validate information consistency across documents

Each document is processed with its specific blueprint, extracting exactly the fields needed for loan verification. Processing time drops from hours to minutes, with higher accuracy and consistency.

Using this custom blueprint to process a Homeowner Insurance Form

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "description": "This blueprint will process a homeowners insurance applicatation form",
    "class": "default",
    "type": "object",
    "properties": {
        "Insured Name":{
           "type":"string",
           "inferenceType":"explicit",
           "instruction":"Insured's Name",
        },
           "Insurance Company":{
           "type":"string",
           "inferenceType":"explicit",
           "instruction":"insurance company name",
        },  
           "Insured Address":{
           "type":"string",
           "inferenceType":"explicit",
           "instruction":"the address of the insured property",
        },
           "Email Address":{
           "type":"string",
           "inferenceType":"explicit",
           "instruction":"the primary email address",
        }
        }
    }

Invoke data automation with code like this:

response = run_client.invoke_data_automation_async(
    inputConfiguration={'s3Uri':  f"s3://{bucket_name}/{object_name}"},
    outputConfiguration={'s3Uri': f"s3://{bucket_name}/{output_name}"},
    blueprints=[{'blueprintArn': blueprint_arn, 'stage': 'LIVE'}],
    dataAutomationProfileArn = dataAutomationProfileArn)

A lending package is a single PDF file that contains multiple documents needed to apply for a loan and BDA can also handle that.

When processing a 50-page lending package, BDA automatically detects:

A driver's license on pages 1-2
Bank statements on pages 3-15
W-2 forms on pages 16-18
Paystubs on pages 19-30
A check image on page 31
Insurance documents on pages 32-50

Medical Claims Processing

Healthcare organizations process millions of insurance claims annually. Each claim involves multiple documents, data validation, and policy verification.

The medical claims solution demonstrates BDA's power when integrated with Amazon Bedrock Agents and Knowledge Bases:

Document Ingestion: Medical claim forms (CMS 1500) are submitted and stored in S3
BDA Processing: A custom blueprint extracts all claim fields including patient information, provider details, diagnosis codes, procedure codes, and charges
Agent Orchestration: A Bedrock Agent receives the extracted data and orchestrates the verification workflow
Action Groups: The agent uses Lambda-backed action groups to:
- Query member and patient information from Aurora PostgreSQL
- Validate coverage eligibility
- Check claim data consistency
Knowledge Base Integration: The agent queries a Bedrock Knowledge Base containing Evidence of Coverage (EoC) documents to verify:
- Treatment coverage under the patient's plan
- Policy limits and exclusions
- Pay requirements
Report Generation: The agent generates a comprehensive verification report and stores it in S3

Key Innovation: By combining BDA's extraction capabilities with Bedrock Agents' orchestration and Knowledge Bases' RAG capabilities, the solution provides:

Best accuracy in field extraction
Automated verification against policy documents
Consistent decision-making based on documented coverage rules
Complete audit trails for compliance

Ingest Evidence of Coverage Documents directly into Knowledge Base.

From this health insurance claim form:

BDA can extract information base on blueprint:

Now we invoke AI agent for claim verification. Everything is done pretty perfect.

Conclusion

I think Amazon Bedrock Data Automation represents a innovative shift in how organizations handle unstructured data. By combining powerful extraction capabilities with flexible configuration options and integration with other AWS services, BDA can:

Mortgage lenders to process loan applications 10x faster
Healthcare providers to automate claims processing with agent-based verification
Financial institutions to extract insights from complex reports
Legal teams to process and analyze large document sets
... and many more use cases.

I think this is amazing, go have a look at this service.