DEV Community

Cover image for How to use AWS Textract in Python
DroWzYOzIl
DroWzYOzIl

Posted on

How to use AWS Textract in Python

1. Introduction

Amazon Textract is a machine learning service that extracts text, handwriting, and data from scanned documents. To recognize, comprehend, and extract data from forms and tables, it goes beyond simple optical character recognition (OCR). These days, a lot of businesses either manually extract data from scanned documents like PDFs, pictures, tables, and forms or use basic OCR software that needs to be manually configured (which often must be updated when the form changes). Textract uses ML to read and process any form of document, accurately extracting text, handwriting, tables, and other data without requiring manual labor to replace these time-consuming and expensive operations.

2. What is AWS Textract?

AWS Textract, to put it simply, is a deep learning-based service that transforms many types of documents into an editable format. Consider the situation where we have hard copies of invoices from several businesses and keep all the important data from them on Excel/Spreadsheets. Most of the time, we rely on data entry workers to manually enter them, which is chaotic, time-consuming, and prone to error. Amazon Textract is a software that extracts data and text from document images automatically. However, using Textract, all we have to do is upload our invoices, and it will then return all the text, forms, key-value pairs, and tables in a better organised manner.

2.1. Strong and Normalized Data Capture

With the help of Amazon Textract, text and tabular data may be extracted from a range of documents, including financial records, scientific articles, and medical notes. The extraction of unstructured and structured data from your document will be much simpler thanks to these non-custom APIs, which continuously learn from a large quantity of data every day.

2.2. Building a smart search index

You may build text libraries from images and PDF files using Amazon Textract. Using Amazon Textract's smart text extraction for Nlp, you can extract text into words and lines (NLP). If Amazon Textract document table analysis is turned on, the text is also organised by table cells. You have control over how text is organised when using Amazon Textract as an NLP input.

2.3. How Does Textract by Amazon (AWS) Work?

We'll go over AWS Textract's operation in this part. There are no open-source models to go into the intricacies, but we know that powerful AI and ML algorithms are behind them. But by summarising the available documentation, I'll attempt to unravel the workings.

The first thing that happens whenever a new or scanned document is sent into Textract is that it generates a list of block objects for all the identified text. For instance, if a bill contains 100 words today, AWS will create 100 block objects for all of the words. These blocks contain details on an object that has been detected, its location, and the level of confidence Amazon Textract has in the processing's accuracy.

Most documents typically consist of the following building blocks:

  • Text lines and words per page
  • Input data (Key-value pairs)
  • Elements for Selecting Tables and Cells

With the help of Connecting with many other Amazon Web Services, you can automate the workflow of extraction, processing, and storing the relevant data.

2.4. OCR Textract

Textract OCR is likewise a deep learning-based neural network-based architecture, however it cannot be fully customized or trained on a specific dataset. This is why corporations have typically employed positions such as data entry operators for simple document filling and database completion. Its job is to read a document and extract all of the data contained inside it. Textract, on the other hand, automatically adjusts to your data and achieves improved accuracy on the fly if the extracted information is verified by a person (human in the loop). Textract outperforms Tesseract when it comes to tasks like table extraction and key-value pair extraction. However, it is confined to a few languages and document types.

Here are the some of the document types that AWS Textract can process are listed below:

  • Regular Bills / Invoices
  • Financial Records
  • Medical Records
  • Documents written by hand
  • Paystubs or Personnel Records

When it comes to AWS Textract, there are three primary sorts of outcomes we may acquire. The first is to obtain the extracted result in the form of raw text. The second way is to obtain the key-value pairs found in the associated documents. The third option is to extract the table data. Amazon Textract allows us to construct text libraries from image and PDF files.

AWS provides two way to extract the text. they are

  • AWS CLI
  • Python API

2.5. AWS CLI to extract data

A centralised tool for managing your AWS services is the AWS Command Line Interface (AWS CLI). You can use the command line and scripts to automate various AWS services with only one tool that you download and configure.

Through the AWS Console, AWS CLI, Textract API, and even programmatically using compatible client SDKs, we can use the AWS OCR Textract service. But in this lesson, you'll use the AWS CLI to extract content from photos. The following are the steps to convert.

2.5.1. Step 1:

Use the commands listed below to create (mkdir) and change (cd) into a new directory in your computer's terminal. The directory can be given whatever name you like, but for the purposes of this demo, it will be called textract-extraction. The photos whose text you use Textract to extract will be found in the new directory.

mkdir textract-extraction && cd textract-extraction
Enter fullscreen mode Exit fullscreen mode

2.5.2. STEP 2:

Use the wget command listed below to download the Test.jpg public image. Using the Textract service, you may extract the handwritten sentence from the image.

wget https://raw.Images.com/adam-the-automator/awsocr/Test.jpg
Enter fullscreen mode Exit fullscreen mode

2.5.3. STEP3:

Execute the textract command below to extract the text from the Test.jpg image (detect-document-text) and output the data in JSON format (—output json).A list with a Byte whose value is the base64 representation of the Test.jpg file can be found in the -document flag.

aws textract detect-document-text --document Bytes=$( base64 ./Test.jpg ) --output json
Enter fullscreen mode Exit fullscreen mode

Once the command completes, you’ll see the printed JSON output below. we can extract data based on the requirement.

output

To determine the next layer scanned in the document, take note of the first highlighted value in the IDs array.

2.6. Using Python API to extract data

We can use the Amazon Textract API with a variety of computer languages. We'll examine a code block for key-value extraction using Python and Textract in this section. Check out these docs for more details on language and API support.

This code snippet shows how to extract key-value pairs from documents using the Python Textract API. To make this work, we'll also need to configure API keys on the AWS dashboard.

We import every package required for sending documents to AWS and handling the text extraction.

import boto3
import sys
import re
import json
    with open(file_name, 'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        print('Image loaded', file_name)
    client = boto3.client('textract')
    response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['FORMS'])
Enter fullscreen mode Exit fullscreen mode

Once the process completes, you’ll see the printed JSON output below. we can extract data based on the requirement. Using Json Packages to extract specific information the returned json result.

3. IronOCR — Introduction and Features

Engineers that use IronOCR for.NET to read text content from images and PDFs in.NET apps and websites are provided with software from IronOCR. It supports many different world languages, scans images for text and barcodes, and can output either plain text or structured data. MVC, Web, console, and desktop .NET applications can all use Iron Software's OCR library. The development team directly assists with licencing for commercial deployments.

Using the latest Tesseract 5 engine, IronOCR scans text, barcodes, and QR codes from any picture or PDF file. This library allows desktop, console, and internet programs to quickly incorporate OCR.

The 127 international languages supported by IronOCR. Word lists and custom languages are also supported.

More than 20 different barcode and QR code types can be scanned by IronOCR.

3.1. OCR Using IronPDF

Above is an illustration of the Tesseract 5 API, which enables us to turn image files into text. In the line of code above, we're creating an object for the Iron Tesseract. Additionally, we are creating an OcrInput object that will enable us to include one or more image files. When using the OcrInput object method add, we might need to specify the picture's path inside the code. You can upload as many photos as you like. By parsing the image file and extracting the result into the OCR result, we can use the function "Read" in the Object IronTesseract that we previously built to obtain the photos. It has the ability to extract text from images and turn it into a string.

var Ocr = new IronTesseract();            
Ocr.Language = OcrLanguage.EnglishBest;                                     
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;            
using (var Input = new OcrInput())      
{          
    Input.AddImage(@"3.png");         
    var R = Ocr.Read(Input);       
    Console.WriteLine(R.Text);        
    Console.ReadKey();          
}
Enter fullscreen mode Exit fullscreen mode

4. Conclusion.

Compared to other providers, setting up Textract with another AWS service is simple. For instance, by adding an add-on, it is possible to save extracted document information using Amazon DynamoDB or S3.The AWS shared responsibility model, which entails rules and norms for data protection, is followed by Amazon Textract. AWS ocr is so costly. We always need to have InterNet to make the OCR works.

Tesseract is simple and quick to use when using IronOCR in the Net Framework environment. It offers numerous ways to support images and PDF files. Additionally, it offers a number of parameters to enhance the functionality of the Tesseract OCR library. Multiple languages can be used simultaneously in addition to different languages. Visit their website to learn more about the Tesseract OCR.

IronOCR outperforms AWS OCR. AWS OCR is so expensive that there is no development version, and in order to register an AWS account, we must have a valid payment card. IronOCR, on the other hand, is less expensive and offers a development edition. It also enables us to detect barcode data and read barcode values from pictures. The IronOCR bundle includes a lifetime license with no continuing fees. At a single purchase, the IronOCR package covers numerous systems. To learn more about IronOCR pricing, visit here.

Top comments (0)