Elliot Wong for Oursky

Posted on Feb 10, 2021 • Originally published at code.oursky.com

Receipt Data Extraction with OCR, Regex and AI

#ai #machinelearning #python

Optional Image Recognition (OCR) is often the default option when it comes to document data extraction. Still, a OCR receipt scanner itself cannot yield accurate-enough results, therefore we have added Regular Expressions (Regex) and some Artificial Intelligence (AI) models to the formula.

This article records our journey developing this final solution, which is now branded under the name FormX.

Let’s start with a successful case first – FormX played an important role in streamlining the vetting process of a non-government organization’s disbursement program by enabling them to digitize data from images, forms, and physical documents from 43,000 applications.

We will dive deep into parts where data is captured and extracted. While FormX can pull data off all kinds of physical forms and documents, for the sake of readability, general receipts will be used as a primary example throughout this article.

Photo by Carli Jeen on Unsplash

Proposed Stages to Solve the Problem

The foremost problem we want to solve here is how to extract {amount}, {date}, and {time} from various receipts.

All sorts of receipts with different layouts exist out there, which make it challenging to extract just the amount, date and time. We came up with a solution that has four main stages:

Get text data out from receipt images with OCR technology.
Filter outliers and group text data into horizontal lines.
Find candidates from horizontal lines.
Classify candidates with AI models and return positive ones.

Note that while Google Vision and other OCR providers out there consistently do a great job on turning a document image to an array of strings, accurate receipt data extraction requires a few more steps. Regex patterns filter out “candidates”, where only the most likely one of each target { date, time, total amount } are picked by the AI models.

We spent a considerable amount of time in the fourth stage above by experimenting on AI models and tweaking parameters. However, we’d like to emphasize that pre-processing (stages 1 to 3) are equally important. They improve the quality of text data, which, in turn, improves the final classification result.

Receipt OCR via Google Vision

This is the first stage where a receipt image is converted to a collection of text with the aid of Google Vision API.

Whether the image is for training AI models or is actually a receipt that will have its information extracted, it is always passed to Google’s Text Detection API to have its text recognized. It’s worth mentioning that – to enhance OCR accuracy, every image goes through a process of image warping first.

The returned result is represented by five hierarchies in this descending scale order: Page, Block, Paragraph, Word, and Symbol.

Each entity, no matter which hierarchy it belongs to, contains a text data and its bounding box (a collection of four vertices with x and y coordinates).

We only used the two most basic ones, Word and Symbol. The former is an array of Symbols, while the latter represents a character or punctuation mark. You can find more detailed definitions of these hierarchies on Google’s official documentation.

Line Orientation Estimation

By this time we have the texts from receipt images stored under the Word and Symbol entities.

We will now group them into horizontal lines relative to the receipt, sorted by the vertical offset of each from the top of receipt, stored as an array. Here’s the rationale behind it:

Information in receipts is almost always horizontally printed. Text items on the same horizontal line are much more likely to be related.
It removes Words that aren’t horizontal enough. The output from OCR can sometimes contain some vertical items, which aren’t our target data.
Different combinations of Words result in different meanings. Putting them together allows us to iterate through all possible ones.
Spacing between Words or Symbols is important. Once they are grouped within the same data instance, calculating the space length between them becomes easier.
Adjacent lines are also more likely to be related. To access them, we can simply move indices up and down as they are sorted instead of comparing the distance between a set of Words with another.
The images we receive can be captured with tilted angles, like the below one.

Figure 1. Receipt Data Extraction from Relatively Horizontal Lines

Let’s take the green lines shown in the Figure 1 as example. Apart from the lines being relatively horizontal, the date and time on each receipt are on the same line. Of course, this isn’t the case for every receipt.

As a disclaimer, the example above is just a random image. In real life, receipts can be nowhere near as good and legible as we’d like them to be. For example, the receipt on the right receipt is covered. While we can accommodate tilted angles, we cannot see through covered information.

Grouping Words into Horizontal Lines, with RANSAC

Each instance of Word comes with a set of four vertices, and with them is a vector of the Word which carries its direction. It can be calculated through the following:

Figure 2. Vector Direction of a Bounding Box

All the Words’ vectors are computed and stored as a matrix. Now we need to determine whether they are horizontally on the same line. Calculating the distance between each Word’s vector and the average vector from all Words seems a good approach. If the distance lies within a threshold, it is horizontal enough; otherwise, the Word is thrown away. Once all the words are checked, the valid ones can be grouped into lines sorted with their vertical offsets (i.e., y coordinates).

Although this method would filter out Words that are not horizontal enough, they may have already contaminated the calculation of the average vector. The filter process may end up as pointless, as the result wouldn’t be accurate.

Fortunately, there is a saying – when we see outliers , we RANSAC them! RANdom SAmple Consensus (RANSAC) is an algorithm for robust-fitting a model in the presence of outliers, which, when implemented, will take them out (i.e., Words that don’t fit). To run a RANSAC, we will take the vector of each Word as one data item.

Let’s say there’s a 70% chance to get one inlier (a value within a pattern) out of all Words by picking randomly. We have to be 99.99% sure that only inliers are picked according to this formula:

Figure 3. Formula for Picking Inliers

In Figure 3, the formula is where:

C is the required confidence = 99.999%
r is inlier chance = 70%
k is the number of samples needed to fit a model, which is a vector in each run (i.e., one in each iteration)
n is the number of iteration needed to attain required confidence

To visualize the formula better, put the numbers in and do the math. You will see that the number of iterations (n) needed to have required confidence (c) in getting an inlier is >= 10 times.

In fact, 70% of inlier chance is pessimistic as the majority of Words on a receipt are horizontally printed. Setting this lower than the actual value ensures the outliers are eliminated. Plus, since we are picking one Word each time to check if it’s an inlier, k = 1.

Based on the n value computed with RANSAC, we ran 10 iterations through the unprocessed Words yielding an array of Words where 99.999% of them got to be inliers. The average vector can then be calculated.

Now we have an accurate average vector. Along with a threshold, we can calculate the distance of each Word’s vector against it to decide whether it is an inlier. Then all the inliers are grouped into horizontal lines with their y-axis values.

Shortlisting Candidates with Amount, Time and Date regex

Before we pass data to the AI classifier, we need to extract Candidates from the horizontal lines, mainly with regular expressions (regex). In this case, any text pattern that looks like price, date, or time will be considered as a candidate. Below is an example of regex for finding the amount and price candidates:

(?=((?:[^$0-9]|^)\$?(?:[1-9]\d{2}|[1-9]\d|[1-9])(?:,?\d{3})?(?:.\d{1,2})?))

Let’s say there are two adjacent Words, 12/20 and 21/01/2020, in a horizontal line. The no-space candidate of concatenating the two is 12/2021/01/2020, which looks like a really messed up date and no one can tell what part is the year. If any part of this is the date we are seeking, we might end up missing it. The with-space version 12/20 21/01/2020 ensures the AI receives the separated Words, which will improve the chance of landing a match.

At this stage, we realized regex can be a very handy tool to net some candidates. Consequently, a regex builder is available on FormX’s portal assisting users to come up with a correct regex for their target document.

Data Extraction with AI Binary Classifiers

Three models have been trained for our respective needs: price, date, and time.

Addressing the Flood of Useless Metadata

Receipts often contain unwanted metadata like the grocery’s name and quantities of items purchased. If we simply train the classification model with an unprocessed dataset, the model will be extremely biased towards negative results and end up with an unbalanced dataset. To balance the dataset, we can multiply the data of amount, date, and time to a 1:1 ratio of positive and negative results.

Bag-of-Words (BoW) Model

A BoW model is employed to first classify texts. In a BoW model, a dictionary is built from words that appear in the receipt’s training dataset. If there are n unique words, the BoW model will be a vector with n dimensions.

Normally, a BoW model records the occurrence of words, but we don’t in our case. Every word in classification data (i.e., receipt image copy) will be matched against the BoW model. If the word can’t be found in that dictionary, it will be ignored.

For price data, the surrounding text on the same line will be computed against a BoW dictionary. If the current candidate doesn’t have the surrounding text matching the dictionary, they will be marked as false. For the others, the +/-1 lines are taken into account, as data on the date or time can reside across them.

Amount Classifier

The model we used for this is logistic regression (examining and describing the relationship between binary variables, such as pass/fail, win/lose, etc). These are the input parameters we used:

Position in Receipt. The Words and Symbols come with a bounding box property. With that we can determine their vertical position divided by the total number of lines. It’s less likely to have a price right at the top of a receipt, so the candidates at lower positions have better likelihood.

Has Symbols. For candidates, we check that symbols indicating price-related data exist in a pattern, such as “$”, “.”, and “,”.

Range Checking. The numeric values in candidates are checked against a set of ranges like <10, >= 10, and <100, or an extreme one, like >= 10000000. Biases will be given based on the matching ranges. This can be tweaked based on the receipt. For example, if we’ve now extracted the amount from a bunch of receipts from a luxury brand, the range should be on the upper side of the scale.

Date Classifier

The model we used for this is random forest (an ensemble of randomized decision trees) with the number of estimators at 300. These are the input parameters we used:

Position in receipt. This is calculated similarly to the Amount Classifier. Date usually shows up on the top or bottom, so candidates with a more central position have a reduced likeliness mark.

Has Symbols. We check for symbols that imply date-related data, such as a slash (/) or period (.). Having less than two occurrences of these improves the candidate’s probability of being a date. Having a full year is also an advantage. A candidate that has “2019” in it, for example, is more likely to be a date than another one which has only “19”. Months in English is also a good indicator, and a fully spelled out month, like “September”, is a plus.

Time and date are often printed on the same line or adjacent to each other, which we also take into consideration. Candidates with inconsistent delimiters will get penalized, such as 11/04-2019 over 11/04/2019. Some of the other factors we look at are:

1/(current year – extracted year + 1)
If the time candidate is on the same line or +/- one line
If different separators are used

Time Classifier

The model we used for this is random forest with the number of estimators at 300. These are the input parameters we used:

Position in receipt. This is calculated similarly to Date Classifier. Like Date, Time usually shows up on the top or bottom so candidates with a more central position have a reduced likeliness mark.

Has Symbols. Candidates with “:” and empty space with less than 2 occurrences are more likely to be time. The ones with am or pm are also prime candidates. Similar to how Date is classified, candidates with Words that imply data related to Time will get extra marks.

Photo by Alex on Unsplash

Future Plans for FormX

Much like everything we do at Oursky, we are always on the lookout for improving our solutions and processes. Some are already in the works. We are looking to train the models with different inputs and parameters to improve accuracy. We also plan to expand our dataset. We’re currently collecting standard forms around the world, like insurance forms in the U.S.

There are tons of very helpful AI researches and projects all over the world, and the amount of investment in them is awe-inspiring. We will definitely keep a lookout on them and integrate them if they prove to be innovative and outperform the current models.

We’ll continue improving FormX so stay tuned for more of our explorations into the wonderful world of AI!

Addendum:

This article has been updated on May 23, 5:25 p.m. HKT with key updates on the introduction, line orientation estimation, AI – binary classifier, and future plans for FormX. The updates are in line with our presentation of this topic in the Google Developer Group Hong Kong ML Series 2020, an online event and series of learning sessions on machine learning. The webinar was presented as “How to extract 𝓧 from receipts?”, which was held on May 23, 2020.
This article has been updated on October 15, 2020, 3:36 HKT with our official FormX brand/name.

DEV Community