DEV Community

Mehr Muhammad Hamza
Mehr Muhammad Hamza

Posted on • Updated on

How to use Tesseract OCR in C#

Introduction

This article aims to give you an understanding of OCR, with an emphasis on how to extract text from images in C# using Tesseract and IronOCR. After reading this article, you will be able to extract text from images using C# in Windows Forms or ASP.Net. You will then be able to use that text for any purpose.

Contents:

  1. What is OCR?
  2. What is Tesseract OCR?
  3. What is IronOCR?
  4. Extract text from an image using IronOcr - A Step-by-Step Guide. a. Create Project b. Install Nuget Package for IronOcr c. Design Windows Form d. Write Code e. Run the Solution
  5. Extract text from the images in different languages using IronOcr.
  6. Extract text from the image using Tesseract - A Step-by-Step Guide. a. Install Nuget Package for Tesseract b. Write Code c. Run the Solution
  7. A Fair Comparison between Tesseract and IronOcr
  8. Why IronOcr?
  9. Summary

What is OCR (Optical Character Recognition)?

OCR stands for "Optical Character Recognition”. It is a technology that recognizes text within a digital image, and is commonly used to recognize text in scanned documents and images.
OCR (Optical Character Recognition) software can be used to convert a physical paper document or image into an accessible electronic version with text. For example, if you scan a paper document or photograph with a printer, the printer will most likely create a file with a digital image in it. The file could be a JPG/TIFF or PDF, but the new electronic file may still only be an image of the original document. You can then load this scanned electronic document into an OCR program. The OCR program will recognize the text and convert the document into an editable text file.

(adsbygoogle = window.adsbygoogle || []).push({});

What is C# Tesseract OCR?

The Tesseract optical character recognition engine (OCR) is a technology used to convert scanned paper documents, PDF files, and images into searchable text data. The OCR engine detects the characters in the image and puts those characters into words, enabling developers to search and edit the content of the document.

What is IronOCR?

IronOcr is another Optical Character Recognition Technology. It is a .Net Library that is used to convert images into editable and readable text. This library helps us to read text from images in our C# Application. This library offers support for more than 100 languages, meaning that you can get the text from the image in most languages, from English to Persian.
Let’s see how we can use IronOCR in our Application.
Here is a step-by-step guide for using IronOCR to extract text from images.
Image description

Step # 1: Open Visual Studio and Create Project

Open Visual Studio. I am using Visual Studio 2019, but you can use any version.
image
Select “Create New Project”. Select the Windows Form Application from the template.
image
Click “Next”. Name the Project, select Location, and click “Next”.
image
Click “Next” and select the “target framework''. I have chosen .Net (5.0), but you can choose your preferred option. Click “Finish”. The Windows Form Application will be created as shown below.
image
Before proceeding further, we need to install the Nuget Package for IronOCR.

Step # 2: Install Nuget Package IronOcr

Open the Nuget Package Manager Console from Tools > Nuget Package Manager > Package Manager Console.
image
The Package Manager Console will open as shown below.
image
Type “Install-Package IronOcr” in the Nuget Package Manager Console and click “Enter”.
image
IronOCR will begin installing in your project. Wait for a while. After installation is complete, open your Windows Form and design your Application.

Step # 3: Design Windows Form

Open the Tool Box, drag one label (for labelling our program) , two buttons (one for selecting an image, and another for converting image into text), one text box to display the image path, one picture box to display the image, and one Rich Text Box to display the extracted text.
Design the form as per your choice. I have designed it in the following way:

image
Let’s look at the code behind the buttons to see how easy it is to extract the text from an image using IronOcr.

Step # 4: Writing the Code behind the Buttons

Double-click on the “Select Image” button.
The following code will appear:

private void SelectImage_Click(object sender, EventArgs e)
        {

        }
Enter fullscreen mode Exit fullscreen mode

Write the following code inside this function:

private void SelectImage_Click(object sender, EventArgs e)
        {
            OpenFileDialog open = new OpenFileDialog();
            // image filters  
            open.Filter = "Image Files(*.jpg; *.jpeg; *.gif; *.bmp)|*.jpg; *.jpeg; *.gif; *.bmp";
            if (open.ShowDialog() == DialogResult.OK)
            {
                // display image in picture box  
                pictureBox1.Image = new Bitmap(open.FileName);
                // image file path  
                ImagePath.Text = open.FileName;
            }
        }
Enter fullscreen mode Exit fullscreen mode

Next, double-click on the “Convert to Text Button” and the following code will appear:

private void ConvertToText_Click(object sender, EventArgs e)
        {

        }
Enter fullscreen mode Exit fullscreen mode

Add the following namespace at the top of the file:

using IronOcr;
Next, add the following code inside the ConvertToText_Click() function:

 private void ConvertToText_Click(object sender, EventArgs e)
        {
            IronTesseract IronOcr = new IronTesseract();
            var Result = IronOcr.Read(ImagePath.Text);
            richTextBox1.Text = Result.Text;
        }
Enter fullscreen mode Exit fullscreen mode

As you can see, we only needed to write three lines of code to perform this major task, all thanks to IronOcr.

Step # 5: Run the Project

Let’s run the Project.
Press Ctrl + F5 to run the Project.
image
Click on the “Select Image” button to select the image.
image
Select an image of your choice. I am selecting a snapshot of an article, but you can select any of your choosing.
image
Next, click the “Convert to Text” button to extract all the text from this newspaper image as shown below.
image
You can see that I have easily extracted text from an image of the article. It is very accurate and easy to use for any ongoing purpose. IronOcr has made this job incredibly easy.

Using IronOcr to Extract Text in Different Languages

IronOcr supports more than 100 languages. Let’s do the same test with the Chinese language.
To extract a language other than English, you need to install the Nuget Package for that particular language. So, let’s assume that we want to extract characters from the Chinese language.

Step # 1: Install the Nuget Package for the Specific Language

Install the following Nuget Package.
Write the following command in the Nuget Package Manager Console of your Visual Studio:
Install-Package IronOcr.Languages.Chinese
image
Amend the following changes in the code:
IronOcr.Language = OcrLanguage.ChineseSimplified;
Such as:

 private void ConvertToText_Click(object sender, EventArgs e)
        {
            IronTesseract IronOcr = new IronTesseract();
            IronOcr.Language = OcrLanguage.ChineseSimplified;
            var Result = IronOcr.Read(ImagePath.Text);
            richTextBox1.Text = Result.Text;
        }
Enter fullscreen mode Exit fullscreen mode

Let’s do the test again.

Step # 2: Run the Project

image
We can see that we have easily converted our Chinese language image into text with just one line of code. The IronOcr .Net library provides accuracy, efficiency, and an easy method to employ with our .Net Application.

How to Extract Text from the Image using Traditional Tesseract: A Step-by-Step Guide

Let’s look at the following example to see how we can achieve the same goal using Tesseract OCR. We can keep the same Windows Form as the previous example and just change the code behind the “ConvertToText”_Click button. Everything else will remain the same as before.

Step # 1: Install Nuget Package for Tesseract

Write the following command in the Nuget Package Manager Console.

Install-Package Tesseract
image
After installing the Nuget Package, you must install the language files manually in the project folder. One could say that this is a drawback of this particular library. Download the language files from the following link .Unzip it and copy the tessdata folder in the debug folder of your project.
Next, write the following code inside the ConvertToText_Click function:

Now, write the following code inside the ConvertToText_Click Function

private void ConvertToText_Click(object sender, EventArgs e)
        {
            var ocrengine = new TesseractEngine(@".\tessdata", "eng", EngineMode.Default);
            var img = Pix.LoadFromFile(ImagePath.Text);
            var res = ocrengine.Process(img);
           richTextBox1.Text = res.GetText();
        }
Enter fullscreen mode Exit fullscreen mode

Step # 2: Run the Project

Press Ctrl + F5 to run the project. Select the image file you want to convert. I have selected the same file in the English language as in the previous example. Click the “Convert to Text” button to extract the text from the image. The following window will appear:
image
Tesseract also supports images featuring different languages. However, we have to add separate language files into our project folder.
It is now becoming clear that the IronOcr .Net Library is far easier to use.

Now, It is clearly understood that IronOcr .Net Library is more easy to use and easy to understandable.

A Fair Comparison Between IronOcr and Tesseract

Interoperability

In Tesseract we do most of our work with a C++ library. Interoperability is not good with .Net and it offers poor cross-platform compatibility, including with Azure. It requires us to choose the bit-version of our application, meaning that we may only deploy to either 32-bit or 64-bit targets. Visual C++ runtimes are required for running Tesseract.
With IronOcr, complete installation, including languages, is done using the Nuget Package manager. We do not need to install native exe or dll. Everything is handled by a single .Net component library.

Up-to-Date and Well-Maintained

The latest builds of Tesseract 5 have never been designed to compile on Windows. Installing Tesseract 5 for C# for free requires manually modifying and compiling Leptonica and Tesseract for Windows. In addition, free C# API wrappers on GitHub may be years behind or incompatible.
We can run IronOcr on Windows, MacOS, Linux, Azure, AWS, Lambda, Mono and Xamarin Mac with little or no configuration. There are no native binaries to manage and it is compatible with Framework and Core.

Why IronOcr?

IronOcr is the best tool for Tesseract Management for the following reasons:

  1. It works seamlessly across .Net.
  2. You do not have to install Tesseract on your machine.
  3. It allows you to run the latest engines, such as Tesseract.
  4. It is available for all .Net Projects such as: .Net Framework 4.5, .Net Framework Standard 2, and .Net Core 2, 3 and 5.
  5. It offers greater accuracy, speed, efficiency and performance.
  6. It supports the latest technologies such as Xamarin, Mono, Azure and Docker.
  7. It manages the complex Tesseract dictionary system using the Nuget Package.
  8. It supports Pdf, MultiFrame Tiff and all major image formats without any configuration.
  9. It can correct low quality scans of documents or images and get the best results from Tesseract.
  10. Only a few lines of code are needed to use it with our Application.

Summary:

IronOcr is the most up-to-date and well-maintained character-recognition technique for .Net. It provides accuracy, speed, simplicity and usability. You can download this product from here For a more in-depth and advanced study of IronOcr, please refer to this link.

So, what are you waiting for? 30 Days Free Trial You can obtain the License here and begin straightaway.
I hope that you have found this article useful. If you have any questions, please post them in the comments section below.

Top comments (3)

Collapse
 
boldtm profile image
Tomasz Mętek • Edited

Title is about Tesseract, article not really...
Seems like clickbait.

Collapse
 
1mouse profile image
Mohamed Elfar

nice content <3

Collapse
 
mhamzap10 profile image
Mehr Muhammad Hamza

Thank you