DEV Community

loading...

Extract characters from image using tesseract.js (OCR)

geekalaa profile image geekalaa ・2 min read

Hello 👋🏻.

Welcome to my first post here,So in the past couple of years i readed many posts in this website and i feel it's very useful to share informations with other and have differents opinions about many tech subjects.
My name is Alaa ,I am a web developer and a 'Webmaster' graduated from the Faculty of Economics and Management of Nabeul and a 2nd year computer science engineering student specializing in WEB technologies at the Private School of Engineering and Technologies (Esprit).
What is OCR ? Well ,it's an algorithm that we use to extract characters from a photo where we teach the algorithm to know the shape of a character in pixels prospective.
We gonna use tesseract.js (OCR) package to extract the words from an image and a file contain the data (characters shape) to use it for the character recognition.
To run the tesseract.js properly you should run the .html file that we gonna make on a server not on local.

  1. Create a HTML file with the name index.html
        <!-- the tesseract javascript file -->
        <script  src = "js/tesseract.min.js" ></script>

        <script>
        console.log("Processing");
                Tesseract.recognize(
                "OCR.png", 
                "eng",{
  workerPath: "js/worker.min.js",
  langPath: "langs-folder/",
  corePath: "js/tesseract-core.wasm.js",
}).then(function(result){


                    console.log(result.data.text);


                   // alert(result.data.text);
                }).finally(function(){


                });
        </script>
Enter fullscreen mode Exit fullscreen mode

2.Create a directory in your root named js and put the js files :
Download the files : https://github.com/geekalaa/OCRJS/tree/main/js
3.Create a directory named 'langs-folder' and download the data files : https://github.com/geekalaa/OCRJS/tree/main/langs-folder
The global lang directory : https://github.com/tesseract-ocr/langdata
4.We gonna use an image for the test : https://github.com/geekalaa/OCRJS/blob/main/OCR.png

Execution :

Sans titre-1

I used the same script with more advanced features in my online tool try it : character count

Discussion (3)

pic
Editor guide
Collapse
meladkamari profile image
MeLad Kamari

Why Developer Use This?
It does not support many languages

Collapse
geekalaa profile image
geekalaa Author • Edited

Because i think it's the easiest way to extract text from image without using so much ram and processing power .

Collapse
geekalaa profile image
geekalaa Author

good point there ,i just added the link for the global lang data : github.com/tesseract-ocr/langdata