Muhammed Ashraf for AWS Community Builders

Posted on Sep 14

Using Amazon Textract to analyze and extract text from Documents Part 1

#ai #aws

Amazon Textract is very powerful machine learning
service that used to analyze do documents and extract either text or handwriting from scanned documents.

It can be used to build different solutions for different use cases such as:

Financial Services
Health Care

In this article we walkthrough how to build a solution on that works on extracting texts from PDF, analyze them, then store them into DynamoDB for further analysis

We will use a mix of AWS services to build our solutions, below is a breakdown of these services and the use case of them

S3 Buckets: will be used to be our storage for raw & extracted JSON files
Lambda: will be used to invoke Amazon Textract through StartDocumentAnalysis & GetDocumentAnalysis APIs, store the JSON files into S3
SNS: used for async communication to invoke the Lambda to get the results and store them into the final bucket
Eventbridge: Used to trigger Lambda function once file is uploaded to S3
AWS Glue: it will used as batch job to iterate over the S3 bucket to convert the files and ingest the files into DynamoDB
DynamoDB: will be our storage for the extracted data

Sequence Flow

Functional Requirements

User should be able to upload PDF files to S3 bucket

Non-Functional Requirements

Solution must be high available
Solution should be reliable

Block Diagram

Our block diagram shows the components that will be used to build our solution

High Level Design

The high-level design shows the services used to build our solution, focusing on ingesting, analyzing & storing the results DynamoDB

We will breakdown our solution into different aspects

High availability:
- Storage: our storage services such as S3 & DynamoDB, offer high availability you can find all the details related for each service here S3 DynamoDB
- Lambda Function: Resilience in Lambda
- SNS: Resilience in SNS
- Amazon Textract: Resilience in Amazon Textract

Amazon Textract APIs

We will utilize StartDocumentAnalysis API and GetDocumentAnalysis API

for StartDocumentAnalysis API we have different types of features such as (Tables, Forms, Queries, Signature and layout) we will use Queries to extract specific data from the statements such as card number, client name, new charges etc.

and for GetDocumentAnaylsis API is recieving the results of StartDocumentAnalysis API in aschyronous mode, we will utlize SNS to decouple our Lambda functions.

below you can find a screenshots for the setup

we have two Lambda functions as below

the trigger_lambda_put will be used to be triggered once a file uploaded to the S3 and call the Amazon Textract API
and other function will be used to get the results and filtering out the required parameters

we have also two S3 buckets for the input and outputs files

SNS

the output will be as below

file name is textract job id .json, this can be modified through the lambda function code

the final output should be as below

since I have defined the card holder name as a filter parameter in the Lambda function code

In part two we will discuss more about AWS Glue JOB to process multiple files and store them into DynamoDB and we will cover the cost part for each component

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.