Extracting Text from Documents Using Amazon Textract (AI series)

#aws #cloud #devops #ai

The last stop on our journey through Amazon Rekognition and the ease with which image analysis can be made by AWS even when you are just getting your legs wet.

We are now going a notch further into an issue that everybody encounters, which is extracting text and information out of documents.

Need for Textract :

Scanned PDF documents, invoices, application forms, ID proofs and bank statements are still flooding the internet. It was once the case that the extraction of information out of these would require either typing the information in directly or it would require a series of complex OCR pipelines to be run. Amazon Textract slices that aggravation with AI.

Most of the documents are simply a mess when it comes to it. Machines find it difficult to know what goes in what field, a table or mere fluff because people are able to read a scanned invoice or a scanned form.

Simple OCR applications are able to scoop up text, but they have no idea of context. They are not even able to know with confidence whether something is a heading or a value, or whether a chunk is a paragraph or a table cell.

Textract is not a mere OCR but it does the layout of documents.

Okay, So what exactly is Amazon Textract ?

Textract is an AI service that will automatically extract text, key-value pairs, and tables in hunted down documents and PDFs. It is based on machine-learned document layout-readers.

In place of spitting out raw text, Textract spits out useful information such as form fields, table rows and the links between labels and values.

This is why it is a great triumph to businesses that are dependent on docs.

How it works?
Giving Textract a document, it performs the first step of OCR, to identify any text. Subsequently, it learns through deep-learning models, such as layout patterns such as alignment, spacing, and grouping.

On that, Textract selects structured items like form fields (e.g. Name: John Doe), rows and columns in a table. The ultimate output is a clean JSON, which is easily readable by any developer.

As far as your app is concerned, it is simply an API call but the magic behind it is much smarter.

Uses :

Textract can be relied upon in any business where one ships a ton of docs. Banks draw information out of statements and loan files. Claim forms are processed automatically by insurers. HR departments streamline the process of resumes and onboarding documents.

Student projects and even startups are able to use Textract to create document automation without creating an OCR stack.

Textract can be viewed through the AWS Console in the simplest way.

Example :

Login in AWS, select Textract in the list of services. It has the demo where you can upload sample documents. After you have entered your file, you get the choice of text, forms, and tables.

The console will display the blocks of text, key-value pairs, and table structures that it has found in the shortest amount of time possible. It provides a pretty visual verification that allows unskilled people to see how Textract interprets the doc precisely.

The following Python snippet is an example of adding a PDF in an S3 bucket with the help of the AWS SDK and extracting form data.

import boto3

textract = boto3.client('textract')

response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': 'my-documents-bucket',
            'Name': 'application_form.pdf'
        }
    },
    FeatureTypes=['FORMS', 'TABLES']
)

for block in response['Blocks']:
    if block['BlockType'] == 'KEY_VALUE_SET':
        print(block)

The process begins by reading the text in an image and uploading the results as a document with a report on the data.

In this case, the processedocument API has been analyzed as opposed to simple text detection. This allows Textract to be able to send back formatted instead of raw data.

The response is made of such block types as WORD, LINE, TABLE, CELL, and KEYVALUE_SET. It could seem enormous initially, but when you learn the IDs and relationships, it is not very difficult.

Synchronous vs Asynchronous :

Textract provides 2 ways of running things. The sync APIs are suitable to small documents and on-the-fly usage. The asynchronous ones are used to run large PDFs and batches, where the output is stored in S3 at the end of the job.

When their application expands, most will begin with sync and flip to asynchronous.

Pricing :

Textract charges per page. The rate varies when you are simply dragging plain text files or excavating forms and tables. In the case of learning or small projects, it will remain cheap, particularly when you are only viewing a few docs.

It also has a free plan that will allow you to use somewhat free, thus you can take a look around without spending cash.

When to use ?

Use Textract where you require structured data of documents and not just the raw text only. In case you want to automate documents, process forms, or extract data in scanned documents, then Textract is sound.

When all you need is to extract plain text in pictures plain OCR may work, but Textract is actually impressive when layout is a concern.

Conclusion :

Textract demonstrates that AWS AI is not merely an invention of the cool things that can be found in sci-fi movies but rather addresses real life issues. It removes a massive burden in various workflows of businesses by converting messy documents into clean data.

Textract is a terrific demonstration of how AI services can be inserted into apps with minimal effort and colossal returns to a fresh person.

In the following post, we will explore Amazon Comprehend; the AI reading, sentimental, and insight generating text reader.