DEV Community

loading...
Microsoft Azure

Extracting Form Data to JSON, Excel & Pandas with Azure Form Recognizer

aribornstein profile image PythicCoder Originally published at Medium on ・3 min read

TLDR; This post shows how to extract data from table images for pandas and Excel using the Azure Form Recognizer Service in Python.

What is Azure Form Recognizer Service?

The Azure Form Recognizer is a Cognitive Service that uses machine learning technology to identify and extract text, key/value pairs and table data from form documents. It ingests text from forms and outputs structured data that includes the relationships in the original file.

An example of a receipt that can be processed with the Azure Form Recognizer Service

There is a free tier of the service which provides up to 500 call a month which is more than enough to run this demo.

If you are new to Azure you can get started a free subscription using the link below.

Create your Azure free account today | Microsoft Azure

How to Consume the Azure Form Recognition Service

The Azure Form Recognition Service can be consumed using a REST API or the following code in python.

Quickstart: Extract receipt data using Python - Form Recognizer - Azure Cognitive Services

However this code returns the result in JSON format with a lot of additional information not relevant to the actual processing of the form data.

Pictured Example JSON Reponse

The following code sample will show you how to reformat this JSON code with python into a pandas DataFrame so it can processed in a traditional data science pipeline or even exported to Excel.

Form Data formatted in a tabular Pandas DataFrame

Prerequisites

We will use the pre-trained receipt model for this tutorial. End to End Code Can be Found in the following gist.

  • Replace with the file path of your form or table (for example, C:\temp\file.pdf). This can also be the URL of a remote file.
  • Replace andwith the values that you obtained with your Form Recognizer subscription key. You can find these on your Form Recognizer resource Overview tab pictured below

  • Replace with the file type. Supported types: application/pdf, image/jpeg, image/png, image/tiff.

Code

Exporting to Excel

Once your data is an pandas DataFrame it can be converted to CSV to process with Excel in just one line of code.

df.to\_csv(“form\_data.csv”) # can now be processed with excel

Hope you enjoyed this demo of the power of the Azure Form Recognizer Cognitive Service. Check out the next steps to see how to train your own custom models and then use this code to extract them to pandas and or Excel.

Next Steps

  • Train a Custom Model on your own Form/Table Data

Quickstart: Label forms, train a model, and analyze a form using the sample labeling tool - Form Recognizer - Azure Cognitive Services

  • Link your model to a logic app to create an end to end dataprocessing pipeline

Tutorial: Use Form Recognizer with Azure Logic Apps to analyze invoices - Form Recognizer - Azure Cognitive Services

About the Author

Aaron (Ari) Bornstein is an AI researcher with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.


Discussion

pic
Editor guide