DEV Community

Kaddour Alaa
Kaddour Alaa

Posted on • Edited on

Extract characters from image using tesseract.js (OCR)

Hello 👋🏻.

Welcome to my first post here,So in the past couple of years i readed many posts in this website and i feel it's very useful to share informations with other and have differents opinions about many tech subjects.
My name is Alaa ,I am a web developer and a 'Webmaster' graduated from the Faculty of Economics and Management of Nabeul and a 2nd year computer science engineering student specializing in WEB technologies at the Private School of Engineering and Technologies (Esprit).
What is OCR ? Well ,it's an algorithm that we use to extract characters from a photo where we teach the algorithm to know the shape of a character in pixels prospective.
We gonna use tesseract.js (OCR) package to extract the words from an image and a file contain the data (characters shape) to use it for the character recognition.
To run the tesseract.js properly you should run the .html file that we gonna make on a server not on local.

  1. Create a HTML file with the name index.html
        <!-- the tesseract javascript file -->
        <script  src = "js/tesseract.min.js" ></script>

        <script>
        console.log("Processing");
                Tesseract.recognize(
                "OCR.png", 
                "eng",{
  workerPath: "js/worker.min.js",
  langPath: "langs-folder/",
  corePath: "js/tesseract-core.wasm.js",
}).then(function(result){


                    console.log(result.data.text);


                   // alert(result.data.text);
                }).finally(function(){


                });
        </script>
Enter fullscreen mode Exit fullscreen mode

2.Create a directory in your root named js and put the js files :
Download the files : https://github.com/geekalaa/OCRJS/tree/main/js
3.Create a directory named 'langs-folder' and download the data files : https://github.com/geekalaa/OCRJS/tree/main/langs-folder
The global lang directory : https://github.com/tesseract-ocr/langdata
4.We gonna use an image for the test : https://github.com/geekalaa/OCRJS/blob/main/OCR.png

Execution :

Sans titre-1

I used the same script with more advanced features in my online tool try it : Character Count

Top comments (3)

Collapse
 
meladkamari profile image
MeLad Kamari

Why Developer Use This?
It does not support many languages

Collapse
 
geekalaa profile image
Kaddour Alaa

good point there ,i just added the link for the global lang data : github.com/tesseract-ocr/langdata

Collapse
 
geekalaa profile image
Kaddour Alaa • Edited

Because i think it's the easiest way to extract text from image without using so much ram and processing power .