DEV Community

Cover image for Using Amazon Textract to analyze and extract text from Documents Part 1

Using Amazon Textract to analyze and extract text from Documents Part 1

Amazon Textract is very powerful machine learning
service that used to analyze do documents and extract either text or handwriting from scanned documents.

It can be used to build different solutions for different use cases such as:

  • Financial Services
  • Health Care

In this article we walkthrough how to build a solution on that works on extracting texts from PDF, analyze them, then store them into DynamoDB for further analysis

We will use a mix of AWS services to build our solutions, below is a breakdown of these services and the use case of them

  • S3 Buckets: will be used to be our storage for raw & extracted JSON files
  • Lambda: will be used to invoke Amazon Textract through StartDocumentAnalysis & GetDocumentAnalysis APIs, store the JSON files into S3
  • SNS: used for async communication to invoke the Lambda to get the results and store them into the final bucket
  • Eventbridge: Used to trigger Lambda function once file is uploaded to S3
  • AWS Glue: it will used as batch job to iterate over the S3 bucket to convert the files and ingest the files into DynamoDB
  • DynamoDB: will be our storage for the extracted data

Sequence Flow

Functional Requirements

  • User should be able to upload PDF files to S3 bucket

Non-Functional Requirements

  • Solution must be high available
  • Solution should be reliable

Block Diagram

Our block diagram shows the components that will be used to build our solution

High Level Design

The high-level design shows the services used to build our solution, focusing on ingesting, analyzing & storing the results DynamoDB

We will breakdown our solution into different aspects

Amazon Textract APIs

We will utilize StartDocumentAnalysis API and GetDocumentAnalysis API

for StartDocumentAnalysis API we have different types of features such as (Tables, Forms, Queries, Signature and layout) we will use Queries to extract specific data from the statements such as card number, client name, new charges etc.

and for GetDocumentAnaylsis API is recieving the results of StartDocumentAnalysis API in aschyronous mode, we will utlize SNS to decouple our Lambda functions.

below you can find a screenshots for the setup

we have two Lambda functions as below

the trigger_lambda_put will be used to be triggered once a file uploaded to the S3 and call the Amazon Textract API
and other function will be used to get the results and filtering out the required parameters

we have also two S3 buckets for the input and outputs files

SNS

the output will be as below

file name is textract job id .json, this can be modified through the lambda function code

the final output should be as below


since I have defined the card holder name as a filter parameter in the Lambda function code

In part two we will discuss more about AWS Glue JOB to process multiple files and store them into DynamoDB and we will cover the cost part for each component

Top comments (0)