TLDR; This post shows how to extract data from table images for pandas and Excel using the Azure Form Recognizer Service in Python.
What is Azure Form Recognizer Service?
The Azure Form Recognizer is a Cognitive Service that uses machine learning technology to identify and extract text, key/value pairs and table data from form documents. It ingests text from forms and outputs structured data that includes the relationships in the original file.
There is a free tier of the service which provides up to 500 call a month which is more than enough to run this demo.
If you are new to Azure you can get started a free subscription using the link below.
The Azure Form Recognition Service can be consumed using a REST API or the following code in python.
However this code returns the result in JSON format with a lot of additional information not relevant to the actual processing of the form data.
The following code sample will show you how to reformat this JSON code with python into a pandas DataFrame so it can processed in a traditional data science pipeline or even exported to Excel.
We will use the pre-trained receipt model for this tutorial. End to End Code Can be Found in the following gist.
- Replace with the file path of your form or table (for example, C:\temp\file.pdf). This can also be the URL of a remote file.
- Replace andwith the values that you obtained with your Form Recognizer subscription key. You can find these on your Form Recognizer resource Overview tab pictured below
- Replace with the file type. Supported types: application/pdf, image/jpeg, image/png, image/tiff.
Once your data is an pandas DataFrame it can be converted to CSV to process with Excel in just one line of code.
df.to\_csv(“form\_data.csv”) # can now be processed with excel
Hope you enjoyed this demo of the power of the Azure Form Recognizer Cognitive Service. Check out the next steps to see how to train your own custom models and then use this code to extract them to pandas and or Excel.
- Train a Custom Model on your own Form/Table Data
- Link your model to a logic app to create an end to end dataprocessing pipeline
Aaron (Ari) Bornstein is an AI researcher with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.