How to Use Tesseract OCR to Convert PDFs to Text

#tesseract #ocr #pdf #convert

This is a cross-post from my blog Arcadian.Cloud, go there to see the original post.

I have some PDFs which I need to get typed up into text to edit. I decided to go with Tesseract OCR as it seems to be the best tool for the job. Here are the steps for how to use Tesseract OCR to convert PDFs to text.

Installation

First things first, get Tesseract CLI installed. Follow the instructions here, these are linked to from the official Tesseract docs.

sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
sudo apt-get update
sudo apt install tesseract-ocr tesseract-ocr-eng

Note: the package didn’t properly place the eng.traineddata file for me. If you get an error about this refer to the troubleshooting steps at the bottom of this article.

Usage

In the CLI, cd into the directory with the images or PDFs you want to convert.

Remember, Tesseract cannot convert PDFs, so first we must convert the PDF to a .tiff file, then we can convert the .tiff to text.

#Convert the PDF to a .tiff file, change out the file names at the end of this command to your own
#Note: If you get an error about security policy check the troubleshooting section below
convert -fill white -draw 'rectangle 10,10 20,20' -background white +matte -density 300 Loring-Lombard-Autobiogrphy-Pages1-10.pdf Loring-Lombard-Autobiogrphy-Pages1-10.tiff

#Tesseract will add .txt to the end of the new file name
tesseract Loring-Lombard-Autobiogrphy-Pages1-10.tiff Loring-Lombard-Autobiogrphy-Pages1-10

I was able to safely ignore these errors. Once the PDF to .tiff conversion finished I ran the tesseract command to created the text file.

You should now have a text file created. It really is as easy as that to Use Tesseract OCR to Convert PDFs to text files.

Troubleshooting

Missing Language Training Data

If you see something like the bellow error message it means you missed installing the English training data.

Error opening data file /usr/share/tesseract-ocr/5/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

Simply install the tesseract-ocr-eng package with the below command:

sudo apt install tesseract-ocr-eng

If this doesn’t fix it then check out this GitHub issue for more troubleshooting steps.

Convert Tool Security Policy Error

convert-im6.q16: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/421.
convert-im6.q16: no images defined `converted-pdf.tiff' @ error/convert.c/ConvertImageCommand/3229.

To fix the above error you need to edit or get rid of the imagemagic security policy. The simplest solution is to temporarily rename the security policy but this may be dangerous if you forget to put it back. Instead, I recommend just edit the policy and remove the offending policy.

sudo sed -i 's/^.*policy.*coder.*none.*PDF.*//' /etc/ImageMagick-6/policy.xml

Checkout this StackOverflow post for more details on working around this error.

DEV Community