Amazon Textract is very powerful machine learning
service that used to analyze do documents and extract either text or handwriting from scanned documents.
It can be used to build different solutions for different use cases such as:
- Financial Services
- Health Care
In this article we walkthrough how to build a solution on that works on extracting texts from PDF, analyze them, then store them into DynamoDB for further analysis
We will use a mix of AWS services to build our solutions, below is a breakdown of these services and the use case of them
- S3 Buckets: will be used to be our storage for raw & extracted JSON files
- Lambda: will be used to invoke Amazon Textract through StartDocumentAnalysis & GetDocumentAnalysis APIs, store the JSON files into S3
- SNS: used for async communication to invoke the Lambda to get the results and store them into the final bucket
- Eventbridge: Used to trigger Lambda function once file is uploaded to S3
- AWS Glue: it will used as batch job to iterate over the S3 bucket to convert the files and ingest the files into DynamoDB
- DynamoDB: will be our storage for the extracted data
Sequence Flow
Functional Requirements
- User should be able to upload PDF files to S3 bucket
Non-Functional Requirements
- Solution must be high available
- Solution should be reliable
Block Diagram
Our block diagram shows the components that will be used to build our solution
High Level Design
The high-level design shows the services used to build our solution, focusing on ingesting, analyzing & storing the results DynamoDB
We will breakdown our solution into different aspects
- High availability:
- Storage: our storage services such as S3 & DynamoDB, offer high availability you can find all the details related for each service here S3 DynamoDB
- Lambda Function: Resilience in Lambda
- SNS: Resilience in SNS
- Amazon Textract: Resilience in Amazon Textract
Amazon Textract APIs
We will utilize StartDocumentAnalysis API and GetDocumentAnalysis API
for StartDocumentAnalysis API we have different types of features such as (Tables, Forms, Queries, Signature and layout) we will use Queries to extract specific data from the statements such as card number, client name, new charges etc.
and for GetDocumentAnaylsis API is recieving the results of StartDocumentAnalysis API in aschyronous mode, we will utlize SNS to decouple our Lambda functions.
below you can find a screenshots for the setup
we have two Lambda functions as below
the trigger_lambda_put will be used to be triggered once a file uploaded to the S3 and call the Amazon Textract API
and other function will be used to get the results and filtering out the required parameters
we have also two S3 buckets for the input and outputs files
SNS
the output will be as below
file name is textract job id .json, this can be modified through the lambda function code
the final output should be as below
since I have defined the card holder name as a filter parameter in the Lambda function code
In part two we will discuss more about AWS Glue JOB to process multiple files and store them into DynamoDB and we will cover the cost part for each component
Top comments (0)