PDF OCR Text Extraction

Iron Tesseract can read many image formats, and also PDF documents. This feature is not possible with conventional free Tesseract engines.

OcrInput offers the option for PDF characteristics to be automatically corrected if scans are bad quality.

Developers may specify to read and entire PDF, a selection of pages or a single crop area.

C#:

using IronOcr;

var Ocr = new IronTesseract();


using (var Input = new OcrInput())
{
    // OCR entire document
    Input.AddPdf("example.pdf", "password");

    // Alternatively OCR selected page numbers
    Input.AddPdfPages("example.pdf", new[] { 1, 2, 3 }, "password");

    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}

VB:

Imports IronOcr

Private Ocr = New IronTesseract()


Using Input = New OcrInput()
    ' OCR entire document
    Input.AddPdf("example.pdf", "password")

    ' Alternatively OCR selected page numbers
    Input.AddPdfPages("example.pdf", { 1, 2, 3 }, "password")

    Dim Result = Ocr.Read(Input)
    Console.WriteLine(Result.Text)
End Using

DEV Community

PDF OCR Text Extraction

Top comments (0)