DEV Community

Kaddour Alaa
Kaddour Alaa

Posted on β€’ Edited on

Extract characters from image using tesseract.js (OCR)

Hello πŸ‘‹πŸ».

Welcome to my first post here,So in the past couple of years i readed many posts in this website and i feel it's very useful to share informations with other and have differents opinions about many tech subjects.
My name is Alaa ,I am a web developer and a 'Webmaster' graduated from the Faculty of Economics and Management of Nabeul and a 2nd year computer science engineering student specializing in WEB technologies at the Private School of Engineering and Technologies (Esprit).
What is OCR ? Well ,it's an algorithm that we use to extract characters from a photo where we teach the algorithm to know the shape of a character in pixels prospective.
We gonna use tesseract.js (OCR) package to extract the words from an image and a file contain the data (characters shape) to use it for the character recognition.
To run the tesseract.js properly you should run the .html file that we gonna make on a server not on local.

  1. Create a HTML file with the name index.html
        <!-- the tesseract javascript file -->
        <script  src = "js/tesseract.min.js" ></script>

        <script>
        console.log("Processing");
                Tesseract.recognize(
                "OCR.png", 
                "eng",{
  workerPath: "js/worker.min.js",
  langPath: "langs-folder/",
  corePath: "js/tesseract-core.wasm.js",
}).then(function(result){


                    console.log(result.data.text);


                   // alert(result.data.text);
                }).finally(function(){


                });
        </script>
Enter fullscreen mode Exit fullscreen mode

2.Create a directory in your root named js and put the js files :
Download the files : https://github.com/geekalaa/OCRJS/tree/main/js
3.Create a directory named 'langs-folder' and download the data files : https://github.com/geekalaa/OCRJS/tree/main/langs-folder
The global lang directory : https://github.com/tesseract-ocr/langdata
4.We gonna use an image for the test : https://github.com/geekalaa/OCRJS/blob/main/OCR.png

Execution :

Sans titre-1

I used the same script with more advanced features in my online tool try it : Character Count

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here β†’

Top comments (3)

Collapse
 
meladkamari profile image
MeLad Kamari β€’

Why Developer Use This?
It does not support many languages

Collapse
 
geekalaa profile image
Kaddour Alaa β€’ β€’ Edited

Because i think it's the easiest way to extract text from image without using so much ram and processing power .

Collapse
 
geekalaa profile image
Kaddour Alaa β€’

good point there ,i just added the link for the global lang data : github.com/tesseract-ocr/langdata

SurveyJS custom survey software

JavaScript UI Library for Surveys and Forms

Generate dynamic JSON-driven forms directly in your JavaScript app (Angular, React, Vue.js, jQuery) with a fully customizable drag-and-drop form builder. Easily integrate with any backend system and retain full ownership over your data, with no user or form submission limits.

View demo

πŸ‘‹ Kindness is contagious

Please leave a ❀️ or a friendly comment on this post if you found it helpful!

Okay