Jones Zachariah Noel for AWS Community Builders

Posted on Oct 17, 2021

Amazon Textract with expense analyzing

#textract #aws #machinelearning #tutorial

Amazon Textract now supports receipts and invoices processing which makes expense management systems analyze better with only receipt's or invoice's image or document.
Read more about the announcement.

Key takeaways from the blog

What is Textract?
How Textract processes receipts and invoices
Implementing Textract with NodeJS SDK

What is Textract?

Amazon Textract is a fully-managed Machine Learning service which extract textual information from documents and images. The Textract DetectDocumentText API is capable of detecting and extracting textual data which are handwritten or typed present either as texts, forms or tables in the document or image.

Common use-cases of Textract are -

Data capture from forms.
Automating certain processes similar to KYC process. And more use-cases available on Amazon Textract documentation.

Textract exposes the following SDK APIs for developers to integrate -

AnalyzeDocument (documentation)
AnalyzeExpense (documentation)
DetectDocumentText (documentation)
GetDocumentAnalysis (documentation)
GetDocumentTextDetection (documentation)
StartDocumentAnalysis (documentation)
StartDocumentTextDetection (documentation)

With Textract, the processing of images or documents can be handled synchronously or asynchronous.
StartDocumentAnalysis / GetDocumentAnalysis and StartDocumentTextDetection / GetDocumentTextDetection are the asynchronous implementation of Amazon Textract and whenever the action start (StartDocumentAnalysis and StartDocumentTextDetection) is executed, it returns a JobID which is referred to when getting the data.

Textract APIs are flexible to take either document/image buffer data or the object stored on S3 to process and extract textual information.

Python samples are available in - GitHub/awsdocs

aws-samples / amazon-textract-code-samples

Amazon Textract Code Samples

This repository contains example code snippets showing how Amazon Textract and other AWS services can be used to get insights from documents.

Usage

python3 01-detect-text-local.py

For examples that use S3 bucket, upload sample images to an S3 bucket and update variable "s3BucketName" in the example before running it.

Python Samples

Argument	Description
01-detect-text-local.py	Example showing processing a document on local machine.
02-detect-text-s3.py	Example showing processing a document in Amazon S3 bucket.
03-reading-order.py	Example showing printing document in reading order.
04-nlp-comprehend.py	Example showing detecting entities and sentiment.
05-nlp-medical.py	Example showing detecting medical entities.
06-translate.py	Example showing translation of documents.
07-search.py	Example showing document indexing in Elasticsearch.
08-forms.py	Example showing form (key/value) processing.
09-forms-redaction.py	Example showing redacting information in document.
10-tables.py	Example showing table processing.
11-tables-expense.py	Example showing validation of table data.
12-pdf-text.py	Example showing PDF document processing.

.NET Usage

Usage: dotnet run [--switch]
To run this console

…

View on GitHub

NodeJS samples are in an open pull request -

Added NodeJS samples #18

zachjonesnoel posted on Sep 18, 2021

Issue #, if available:

Description of changes: Added NodeJS text detect samples.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

View on GitHub

How Textract processes receipts and invoices

Textract AnalyzeExpense API processes the data and extracts the key information from the document such as Vendor names, Receipt number or Invoice number. AnalyzeExpense API is available only in the latest version of SDK, as this is one of the new functions available now. So ensure, you have updated your SDK.

Starting today, Amazon Textract adds the following capabilities for receipts and invoices: 1) Identifies Vendor Name - Amazon Textract can find the vendor name on a receipt even if it's only indicated within a logo on the page without an explicit label called “vendor”. It can also find and extract item, quantity, and prices that are not labeled with column headers for line items, 2) Enables consolidation of output from many documents - Textract normalizes keynames and column headers when extracting data from invoices and receipts, into a standard taxonomy. For example, it detects that “invoice no.” “invoice number” and “receipt #” are identical and outputs “INVOICE_RECEIPT_ID,” so that downstream applications can easily compare output from many documents, and 3) Extracts line item details, even when the column headers are missing - Textract extracts line items including items, quantities, and prices of individual goods purchased from an invoice or a receipt. If the table of line items does not include column headers, Textract now infers what the column headers are meant to be based on the table content.

As described in the announcement.

This makes it easier for systems which are integrating Textract to manage expense analysis with the consolidated information.

Implementing Textract with NodeJS SDK

In this walkthrough, we will be using the AnalyzeExpense and AnalyzeDocument API from Textract.

To get started, you can navigate to Amazon Textract AWS Console from where you will be able to run Textract on sample documents and view the response pretty-formatted on the console.

Image used for the demo -

When using AnalyzeDocument from the console/ SDK API, you would have to use what type of feature you want to extract. From SDK API you would have to pass the input for FeatureTypes as TABLES or FORMS, if you are trying this from the console, you can additionally extract all the text as RAW TEXT also.
Raw text -

Forms -

Tables -

   var params = {
        Document: { 
            S3Object: {
                Bucket: 'xxxxxx',
                Name: 'download.jfif',
            }

        },
        FeatureTypes: ["TABLES", "FORMS"]
    };
    let response = await textract.analyzeDocument(params).promise()
    return response

Response available on GitHub Gist

For AnalyzeExpense API, from the console and SDK API, you will get the response with both SummaryFields and LineItemFields.
Summary fields -

Line item fields

    var params = {
        Document: { 
            S3Object: {
                Bucket: 'xxxxxx',
                Name: 'download.jfif',
            }

        },
    };
    let response = await textract.analyzeExpense(params).promise()
    return response

Response available on GitHub Gist

Pricing

Amazon Textract's pricing varies upon the API that is executed as internally it uses OCR technology to process and extract textual information. Also based on the feature type, it focuses to extract either FORM or TABLE data.
Detailed information on pricing is available on Textract pricing page.

Conclusion

Amazon Textract enables applications to integrate with SDK APIs so that the documents or images with textual data from various representations of text in form of raw text, forms, tables are easily extratable. Now with the expense analysis support, Textract goes a level ahead to consolidate the items and also extract key information from the invoice or receipts. Textract also provides the confidence level / percentage of the extracted text making it a choice for the integrating applications to either consider it or neglect it.