DEV Community

IronSoftware
IronSoftware

Posted on • Originally published at ironsoftware.com

1 1

PDF OCR Text Extraction

Iron Tesseract can read many image formats, and also PDF documents. This feature is not possible with conventional free Tesseract engines.

OcrInput offers the option for PDF characteristics to be automatically corrected if scans are bad quality.

Developers may specify to read and entire PDF, a selection of pages or a single crop area.

C#:

using IronOcr;

var Ocr = new IronTesseract();


using (var Input = new OcrInput())
{
    // OCR entire document
    Input.AddPdf("example.pdf", "password");

    // Alternatively OCR selected page numbers
    Input.AddPdfPages("example.pdf", new[] { 1, 2, 3 }, "password");

    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}
Enter fullscreen mode Exit fullscreen mode

VB:

Imports IronOcr

Private Ocr = New IronTesseract()


Using Input = New OcrInput()
    ' OCR entire document
    Input.AddPdf("example.pdf", "password")

    ' Alternatively OCR selected page numbers
    Input.AddPdfPages("example.pdf", { 1, 2, 3 }, "password")

    Dim Result = Ocr.Read(Input)
    Console.WriteLine(Result.Text)
End Using
Enter fullscreen mode Exit fullscreen mode

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay