Jayesh Bapu Ahire

Posted on Apr 5, 2020 • Edited on Apr 7, 2020

Data extraction from documents made easy with Amazon Textract

#machinelearning #tutorial #aws #ai

Artificial Intelligence as we know found use cases in every possible industry! Many complicated problems we used to face during our day to day are now being solved using AI. Some of them might not give results upto human standards but with improvements in underlying algorithms and optimizations we are progressing towards achieving this standards. In this article we will see one such important problem, Text Extraction from documents. For many years, companies are working on this problem using manual techniques, rule-based methods or customized OCR which are both time consuming and complicated.

One important point here is documents are important! How? Let's see!

Documents are primary tools for keeping the records. Large amount of data is stored in structured or unstructured documents. They are also important when it comes to communicate, collaborate or transact the data across industries like medical, legal, business management, finance, education, tax management and many more.

What are the types of documents we are looking at?

We are looking at scanned documents, digital documents, forms, tables, contracts and many other.

I mentioned some classical techniques which we are using above. What is the problem with those? The major problems in this manual techniques are they are too expensive, error prone and time consuming as it involves human-intervention.

Let's see problems with each of the technique:

1. Manual processing (humans):

When we depend on humans processing the docs there might be issues like

Variable output
Inconsistent results
Reviews for consensus

in a example below humans can process and interpret this blocks differently and it depends on variety of factors.

2. Customized OCR was better solution than manual extraction but it has it's own problem:

Paragraph detection (You can code this but again manual intervention comes in. You can annotate the sample set and train a ML model on model on that which will give you separated paragraphs and again there are some unsupervised methods but ML comes into play here. )
No rotated text and stylized text detection
No multi-column detection
Table Extraction

You can obviously add this features and if you want to do it without ML you have to maintain a separate code template (and templates are brittle) for each document and it's time consuming. If we consider tax form for any country there will be different variations for different job categories and you have to maintain different template and rule-sets for all of them which is nightmare.

So how can we not complicate our life further and still make a robust text extraction solution? Amazon textract comes handy and solves many of the problems we have seen! It's tagline says extract text and data from virtually any document!

Let's jump into details!

What Amazon Textract can do?

Let's first list down some things you can achieve using amazon textract and then see core features in details:

Text detection from documents
Multi-column detection and reading order
Natural language processing and document classification
Natural language processing for medical documents
Document translation
Search and discovery
Form extraction and processing
Compliance control with document redaction
Table extraction and processing
PDF document processing

How textract works?

Amazon textract API accepts the document stored in s3 and uses ML models built in to extract text, tables or any fields of interest from docs. Now we get an option to either store this extracted data into some other format or stack some other services for further processing the output. We can use services like Elasticsearch to create indexes for the data to built a search application around it or we can amazon comprehend to use Natural Language Processing on our data.

We can use services like amazon comprehend medical which uses advanced machine learning models to accurately and quickly identify medical information, such as medical conditions and medications, and determines their relationship to each other, for instance, medicine dosage and strength. Amazon Comprehend Medical can also link the detected information to medical ontologies such as ICD-10-CM or RxNorm. And if you are not interested in all this fancy stuff you can just store your data in database with pre-defined schema and use it in your application! The above self-explanatory diagram from documentation will make understanding of things little easy!

Before going ahead let's just see request and response format of Textract API.

1. Request Syntax:

{
   "Document": { 
      "Bytes": blob,
      "S3Object": { 
         "Bucket": "string",
         "Name": "string",
         "Version": "string"
      }
   },
   "FeatureTypes": [ "string" ],
   "HumanLoopConfig": { 
      "DataAttributes": { 
         "ContentClassifiers": [ "string" ]
      },
      "FlowDefinitionArn": "string",
      "HumanLoopName": "string"
   }
}

Here, Document is input document which can be base64-encoded bytes or an Amazon S3 object and it's required. FeatureTypes is list of features you want to extract like tables, forms etc. and it's also required. HumanLoopConfig allows you to set human reviewer and it's not required.

2. Response Syntax:

{
   "AnalyzeDocumentModelVersion": "string",
   "Blocks": [ 
      { 
         "BlockType": "string",
         "ColumnIndex": number,
         "ColumnSpan": number,
         "Confidence": number,
         "EntityTypes": [ "string" ],
         "Geometry": { 
            "BoundingBox": { 
               "Height": number,
               "Left": number,
               "Top": number,
               "Width": number
            },
            "Polygon": [ 
               { 
                  "X": number,
                  "Y": number
               }
            ]
         },
         "Id": "string",
         "Page": number,
         "Relationships": [ 
            { 
               "Ids": [ "string" ],
               "Type": "string"
            }
         ],
         "RowIndex": number,
         "RowSpan": number,
         "SelectionStatus": "string",
         "Text": "string"
      }
   ],
   "DocumentMetadata": { 
      "Pages": number
   },
   "HumanLoopActivationOutput": { 
      "HumanLoopActivationConditionsEvaluationResults": "string",
      "HumanLoopActivationReasons": [ "string" ],
      "HumanLoopArn": "string"
   }
}

Here, AnalyzeDocumentModelVersion tells you version of model used used and Blocks contains all the detected items. DocumentMetadata gives additional information about document and HumanLoopActivationOutput gives results of evaluation by human reviewer.

Now we know what textract can do and how it works, let's see the core features and capabilities textract provides in details:

Core Features:

You can try this all from Amazon Textract Console directly!

1. Table Extraction:

Amazon textract can extract tables from given document and provide them into any format we want including CSV or spreadsheet and we can even automatically load the extracted data into a database using a pre-defined schema.

Let's consider one document and see how Textract works for that!

Here are the results which are really promising!

2. Form Extraction:

Amazon textract can extract data from forms in key-value pairs which we can use for various applications. For example you want to setup automated process which accepts scanned bank account opening application and fills required data into system and creates account you can do that using amazon textract form extraction.

let's try this on below document:

Here are the results:

Let's see harder problem with document like this:

Here's what we got:

3. Text Extraction:

Amazon textract uses a better adoption of OCR which uses ML along with OCR (some people like to call it OCR++) which detects printed text and numbers in a scan or rendering of a document. This can be used for medical reports, financial reports or we can use it for applications like clause extraction in legal documents when paired with amazon comprehend.

Let's try to extract text from this document:

Here are the results:

Along with this 3 core features, textract also provides you bunch of features like Bounding Boxes, Adjustable Confidence Thresholds, Built-in Human Review Workflow.

So, how can we use the textract API with python?

Let's build a very simplified upload and analyze pipeline based on amazon textractor.

Pipeline: First, we will upload document to s3 and then use amazon textractor to extract fields we want from document.

import os
import subprocess as sp
from s3_upload import upload
import re

def run_pipeline(source_file, bucket_name, object_key, flags):
    upload(source_file, bucket_name, object_key)
    url = f"s3://{bucket_name}/{object_key}"

    command_analysis = f"python textractor.py --documents {url} {flags}"
    os.system(command_analysis)


def main():
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('source_file', help='The path and name of the source file to upload.')
    parser.add_argument('bucket_name', help='The name of the destination bucket.')
    parser.add_argument('object_key', help='The key of the destination object.')
    parser.add_argument('flags', help='Only one of the flags (--text, --forms and --tables) is required at the minimum. You can use combination of all three.')
    args = parser.parse_args()

    run_pipeline(args.source_file, args.bucket_name, args.object_key, args.flags)

if __name__ == "__main__":
    main()

Here, we will provide local file path, s3 bucket we want to upload file in and name of the file along with what we want to extract.

Upload file to s3: uploading file to s3 is really easy:

def upload(source_file, bucket_name, object_key):
    s3 = boto3.resource('s3')
    try:
        s3.Bucket(bucket_name).upload_file(source_file, object_key)
    except Exception as e:
        print(e)

Textractor: Textractor is the ready to use solution made by amazon which helps to speed up the PoC's. It can convert output in different formats including raw JSON, JSON for each page in the document, text, text in reading order, key/values exported as CSV, tables exported as CSV. It can also generate insights or translate detected text by using Amazon Comprehend, Amazon Comprehend Medical and Amazon Translate.

This is how textractor uses response parser library which helps process JSON returned from Amazon Textract. See the repo and documentation for more details.

# Call Amazon Textract and get JSON response
docproc = DocumentProcessor(bucketName, filePath, awsRegion, detectText, detectForms, tables)
response = docproc.run()

# Get DOM
doc = Document(response)

# Iterate over elements in the document
for page in doc.pages:
    # Print lines and words
    for line in page.lines:
        print("Line: {}--{}".format(line.text, line.confidence))
        for word in line.words:
            print("Word: {}--{}".format(word.text, word.confidence))

    # Print tables
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}-{}".format(r, c, cell.text, cell.confidence))

    # Print fields
    for field in page.form.fields:
        print("Field: Key: {}, Value: {}".format(field.key.text, field.value.text))

    # Get field by key
    key = "Phone Number:"
    field = page.form.getFieldByKey(key)
    if(field):
        print("Field: Key: {}, Value: {}".format(field.key, field.value))

    # Search fields by key
    key = "address"
    fields = page.form.searchFieldsByKey(key)
    for field in fields:
        print("Field: Key: {}, Value: {}".format(field.key, field.value))

This is how the output looks like!

What's next

We went through various features and capabilities textract provides! This is one of the ready to use solution which can simplify some very complicated problems we face while building business applications around documents. This is not 100% accurate and directly usable for every case but some small tweaks here and there should make it usable for most of the use cases. In next article, we will see how we can use this is some of the business applications and we will also try to build end to end pipeline using various AWS services.

Until then, let me know if you have some use-cases where you are already using amazon textract or you're planning to use this in comments. If you have any questions or want to discuss any use-cases ping me on twitter.

Stay safe!

References:

Amazon Textract : https://aws.amazon.com/textract/
Amazon Textract Console: https://console.aws.amazon.com/textract/home?region=us-east-1#/
Amazon Blogs: https://aws.amazon.com/blogs/machine-learning/automatically-extract-text-and-structured-data-from-documents-with-amazon-textract/
Amazon Textract Documentation: https://docs.aws.amazon.com/textract/latest/dg/what-is.html
Amazon textract textractor

DEV Community