DEV Community

Cover image for Extract Tables From Images in Python
Shittu Olumide
Shittu Olumide

Posted on

Extract Tables From Images in Python

Extracting tables from images can be a tedious and time-consuming task, especially if you have a large number of images to process. However, with the right tools and techniques, you can automate this process and extract tables from images quickly and easily.

In this article, we will explore how to extract tables from images using Python. We will cover a library that can be used to identify and extract tables from images, along with sample code and explanations. Whether you are working with scanned documents, photos, or other types of images, this article will provide you with the tools and knowledge you need to extract tables efficiently and accurately.

What is img2table?

Img2Table is a straightforward, user-friendly Python library for table extraction and identification that is based on OpenCV image processing and supports PDF files in addition to the majority of popular image file formats.

Due to its design, it offers a useful and less heavy alternative to solutions based on neural networks, especially for CPU usage.

It supports the following file formats:

  • JPEG files - .jpeg, .jpg, *.jpe

  • Portable Network Graphics - *.png

  • JPEG 2000 files - *.jp2

  • Windows bitmaps - .bmp, .dib

  • WebP - *.webp

  • Portable image format - .pbm, .pgm, .ppm .pxm, *.pnm

  • PFM files - *.pfm

  • OpenEXR Image files - *.exr

img2table Features

  • Table cell-level bounding boxes and table identification for images and PDF files.

  • Dealing with intricate table structures, like merged cells.

  • Extraction of table titles.

  • Extracting table content while supporting OCR tools and services.

  • A Pandas DataFrame representation and a simple object representing the extracted tables are returned.

  • Preserve the original structure of extracted tables by exporting them to an Excel file.

The package is simple (in comparison to deep learning solutions) and needs little or no training. There are still some limitations though since borderless tables' more complicated identification is not yet supported and may call for CNN-based approaches.

Implementation

Installation

Just like every other Python package, img2table can be installed via pip .

pip install img2table
Enter fullscreen mode Exit fullscreen mode

Working with Images

from img2table.document import Image

image = Image(src,dpi=200, detect_rotation=False)
Enter fullscreen mode Exit fullscreen mode

We instantiate Image , where src is the path to the image (it is required), dpi is used to adapt OpenCV algorithm parameters, it's optional with an int type (default is 200), detect_rotation detects and corrects skew or rotation of the image, it is a boolean type and by default False.

Let's have an example:

from img2table.document import Image

# Instantiation of the image
img = Image(src="image.jpg")

# Table identification
imgage_tables = img.extract_tables()

# Result of table identification
imgage_tables

#output
[ExtractedTable(title=None, bbox=(10, 8, 745, 314),shape=(6, 3)),
 ExtractedTable(title=None, bbox=(936, 9, 1129, 111),shape=(2, 2))]
Enter fullscreen mode Exit fullscreen mode

Working with PDF

from img2table.document import PDF

pdf = PDF(src, dpi=200, pages=[0, 2])
Enter fullscreen mode Exit fullscreen mode

It is the same as the way we work with images, just that we have a new parameter pages, which is a list of PDF page indexes to be processed. But if there are no specified indexes in the pages list, all the pages are processed.

Working with OCR

To parse the content of tables, img2table offers an interface for various OCR tools and services.

from img2table.ocr import TesseractOCR

ocr = TesseractOCR(n_threads=1, lang="eng", tessdata_dir="...")
Enter fullscreen mode Exit fullscreen mode

Where n_threads is the number of concurrent threads used to call Tesseract with an int type and the default is 1, lang is used in Tesseract for text extraction and it is optional, finally the tessdata_dir is the directory containing Tesseract traineddata files.

Note: Usage of Tesseract-OCR requires prior installation.

Let's have a look at an example.

from img2table.document import PDF
from img2table.ocr import TesseractOCR

# Instantiation of the pdf
pdf = PDF(src="tablesfile.pdf")

# Instantiation of the OCR, Tesseract, which requires prior installation
ocr = TesseractOCR(lang="eng")

# Table identification and extraction
pdf_tables = pdf.extract_tables(ocr=ocr)

# We can also create an excel file with the tables
pdf.to_xlsx('tables.xlsx', ocr=ocr)
Enter fullscreen mode Exit fullscreen mode

Extracting Multiple tables

The extract_tables method of a document allows multiple tables to be extracted simultaneously from a PDF page or an image.

from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
doc = Image(src, dpi=200)

# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
                                      implicit_rows=True,
                                      borderless_tables=False,
                                      min_confidence=50)
Enter fullscreen mode Exit fullscreen mode

Most of the parameters have been discussed earlier when working with images and PDF, but there are new parameters. ocr is the instance used to parse document text, implicit_rows is a Boolean type indicating if implicit rows should be identified, borderless_tables indicates if borderless tables are extracted, and lastly, min_confidence is the minimum confidence level from OCR in order to process text from 0(the worst) to 99(the best).

Conclusion

The OpenCV-python library and OpenCV are both used for all of the image processing. The Hough Transform, which recognizes lines in an image, serves as the algorithm's foundation. It enables us to recognize the image's horizontal and vertical lines. The library really doesn't have much more to it because the intention was to keep it as straightforward as possible in order to avoid any potential complications that might arise from using other approaches.

View the project's documentation on GitHub.

Let's connect on Twitter and on LinkedIn. You can also subscribe to my YouTube channel.

Happy Coding!

Top comments (6)

Collapse
 
vino2530 profile image
Vinoth Saravanan

Hi buddy, hope you are doing good. I'm facing issue while calling TesseractOCR. Scope of my task is extract table from pdf file which is containing scanned images. i have added the sample file also. and error page also shared below. can you please help me to resolve this??
Image description
Image description

Collapse
 
shittu_olumide_ profile image
Shittu Olumide

Hi, what operating system are you using?

You can check out this out: stackoverflow.com/questions/509519...

This should solve it for you.

Collapse
 
vino2530 profile image
Vinoth Saravanan

I'm using Windows machine and i'm following the same steps to call pytesseract from local, when using other libraries also. Is it mandatory to install 'pip install tesseract-ocr' also? i tried to install tesseract-ocr but there is visual studio tools dependency and the problem is its size 7.0GB.

Thread Thread
 
shittu_olumide_ profile image
Shittu Olumide

You should pip install tesseract-ocr
Let me check and give you feedback.

Collapse
 
devbhatt profile image
Dev

This is a problem of not installing the additional local packages in the program files, so in case you want to use the extraction features from Tesseract-OCR you first need to download the Tesseract engine from here digi.bib.uni-mannheim.de/tesseract... and install it then open your environment variable just search in search bar of the pc then click on edit environment variable and then select path then click edit then paste the path of Tesseract-OCR from the c drive program file or what ever path you installed the tesseract just paste the path of it in the env variable then try it, it sould work.

Collapse
 
farooq9786 profile image
farooq9786

I am getting the output as similar to #output
[ExtractedTable(title=None, bbox=(10, 8, 745, 314),shape=(6, 3)),
ExtractedTable(title=None, bbox=(936, 9, 1129, 111),shape=(2, 2))

How can we get these as table format?