DEV Community

loading...
walrus.ai

Using Google's OCR API with Puppeteer for Visual Testing

scahhht profile image Scott White ・7 min read

The Puppeteer framework is a Node.js library that can be used for automated browser testing. This library provides high-level access to chromium-based browsers through the dev tools protocol. In automated testing, you can use assertion libraries like chai.js or should.js to make assertions about elements or objects that should appear in an application.

One of the most common scenarios in automated testing is to assert if a word or phrase is displayed in a web application. There are few methods available for a developer to accomplish this task.

  • One of them is to iterate through all the elements in a web page to look for the matching word. However, this method is inefficient and error-prone. Besides, querying every element is time-consuming and will result in incorrect assertions as you have not scoped any part of the page, meaning the word could appear in unexpected locations.
   // Iterate through all elements
          const contents = await page.$$('*');

          for (let i = 0; i < contents.length; i++) {
              const text_value = await (await contents[i].getProperty('innerText')).jsonValue();
              if (text_value.match(/Hello/g)) {
                  console.log(text_value);
              }
          }
Enter fullscreen mode Exit fullscreen mode
  • A better method is selecting elements using the XPath to find the element which contains the specified word. This is far simpler and more efficient than iterating through all the elements. However, without a defined scope, this method will also provide results from the whole web page.
          // Using XPath
          const title = await page.$x("//h1[contains(text(), 'Domain')]");
          let text = await page.evaluate(h1 => h1.textContent, title[0]);
          console.log(text)
Enter fullscreen mode Exit fullscreen mode
  • The most effective method would be to use optical character recognition (OCR) to specify a part of the page and carry out searching within that scope to find the location of the word. Using OCR allows you to utilize a powerful character recognition algorithm without having to go through HTML elements.

What is OCR?

Optical Character Recognition, or OCR, is a technology that enables you to convert images of text content into machine-readable, digital data. OCR can convert any type of text content, whether it’s handwritten or printed, into editable and searchable data using an image of that text content. This is done by processing the image through a machine-learning algorithm to clean and recognize each character of a specific image.

There are multiple open-source OCR tools like pytesseract or EasyOCR, which can be used to integrate OCR functionality into a program. However, these tools require significant configurations to get up and running to provide results with an acceptable accuracy level.

Google provides a ready-made solution to integrate OCR functionality to an application using its Vision API. It will be used in this tutorial since it abstracts most of the model fine-tuning. The Vision API provides two distinct detection types to extract text from images, as mentioned below.

TEXT_DETECTION - This will detect and extract text from any provided image. The resulting JSON will include the extracted string and individual words with their bounding boxes.

DOCUMENT_TEXT_DETECTION - This will also detect and extract text from any image but is optimized to be used for dense text and documents. Its JSON output will feature all the details, including extracted strings plus page, block, paragraph, word, and break information.

Let’s take a look at the following image, which contains a dense text block.
Text overlaid on an image of a computer with app icons

The following code can be used to implement both detection types to extract the text in that image but there will be some differences in the output. If you want to use “DOCUMENT_TEXT_DETECION”, uncomment it and comment the line type: 'TEXT_DETECTION'.

const [result] = await client.annotateImage({
        image: {
            content: imageContent,
        },
        features: [{
            type: 'TEXT_DETECTION',
            // type: 'DOCUMENT_TEXT_DETECTION',
        }]
    });

    const textAnno = result.textAnnotations;
    textAnno.forEach(text => console.log(text));
Enter fullscreen mode Exit fullscreen mode

TEXT_DETECTION Output
text detection from a document

DOCUMENT_TEXT_DETECTION Output
text detection from a document

As you can see, while TEXT_DETECTION was able to simply extract the text from the image, the DOCUMENT_TEXT_DETECTION was also able to identify the period (.) punctuation mark as the separating character. This is due to the DOCUMENT_TEXT_DETECTION being geared towards denser text blocks like documents and can extract more intricate details, while TEXT_DETECTION is more suited for large text objects like street signs, billboards, etc.

Google Vision API with Puppeteer

In this section, you will be using DOCUMENT_TEXT_DETECTION as you are working with web content. Now, let’s start incorporating Vision API (OCR) within a Puppeteer program.

First, capture a screenshot of the text area you need to extract text from the Puppeteer's page.screenshot() function.

 const browser = await puppeteer.launch({headless:true});
        const page = await browser.newPage();
        await page.goto('https://www.example.com');  

        const imageFilePath = 'ocr-text-block.png'

        // Take screenshot
        await page.screenshot({
            path: imageFilePath,
            clip: { x: 0, y: 0, width: 800, height: 400 },
            omitBackground: true,
        });
Enter fullscreen mode Exit fullscreen mode

Then you have to encode the image to base64 format in order to pass the resulting data to the Vision API.

        // Read the image contents
        const imageContent = Buffer.from(fs.readFileSync(imageFilePath)).toString('base64');
Enter fullscreen mode Exit fullscreen mode

Next, pass the data to the Vision API and you will get the following result.

        // Pass the content
        const client = new vision.ImageAnnotatorClient();
        const [documentAnnotateResult] = await client.annotateImage({
            image: {
                content: imageContent,
            },
            features: [{
                type: 'DOCUMENT_TEXT_DETECTION',

            }]
        });

        // Get the annotated result
        console.log(documentAnnotateResult.fullTextAnnotation)
Enter fullscreen mode Exit fullscreen mode

As you have obtained the OCR result without any errors, compare the image you have captured with the corresponding output.

Captured image (ocr-text-block.png)
example domain written on a white background

OCR Result
terminal output using google ocr api

That’s it! Congratulations, now you have successfully integrated the OCR functionality. The next step would be to iterate over the text annotations and find a match against the string you are searching for. In this instance, you will be searching for a string called “permission”.

        var point_x = 0
        var point_y = 0
        //Iterate the elements
        const textAnno = documentAnnotateResult.textAnnotations;
        textAnno.forEach(text => {
            if(text.description === 'permission') {

                point_x = text.boundingPoly.vertices[0].x;
                point_y = text.boundingPoly.vertices[0].y;            
            }
        }
        );
Enter fullscreen mode Exit fullscreen mode

Iterate through the results using a forEach loop and if a matching text is found, obtain its coordinates from boundingPoly.vertices and store them in point_x and point_y variables.

Then, you have to invoke the document.getElementFromPoint function to get the DOM element that contains the matched text.

        // Find the element using elementFromPoint
        const foundElement = await page.evaluateHandle((x,y) => document.elementFromPoint(x,y), point_x,point_y);
        console.log(foundElement._remoteObject.description)
Enter fullscreen mode Exit fullscreen mode

Since the word “permissions” is inside a paragraph tag (<p>), the console output will show a DOM object description attribute like the following.

ocr javascript file

Limitations of OCR

OCR is a powerful tool when it comes to identifying web content. However, it is not foolproof, and its major issue will be the accuracy of character recognition.

  • Missing some characters entirely.
  • Mixing up similar characters (6 to b, o to 0, I to l, etc.)
  • Messing up spaces in documents.

The points mentioned above are some facts that lead to the inaccuracy of the result. Refer to the following image and its OCR result.

Image
The words Los Altos on a white background

OCR Result
OCR result of predicting Los Altos image

You can assert that the word in the image is “Los Altos”, but the OCR function mistakenly detects the letter “t” as the plus (+) sign. This will cause incorrect character recognition and even breaks the word Altos into three separate segments. (AL, + and OS). These kinds of issues are very common when dealing with handwritten text in images or documents.

To mitigate these issues and keep the Puppeteer test resilient, you need to have some room for variance in OCR results. The simplest method would be to handle the commonly mistaken characters. For example, suppose you encounter a 6 that doesn't match your string, substitute 6 with b, and check again. However, This method will quickly become cumbersome and not a very flexible solution to handle multiple inconsistencies. \

Another option would be to implement a word distance algorithm based on a string metric like Levenshtein distance, which measures the difference between two sequences. You could specify a threshold that allows for some variance in the results while still getting the confidence that the desired text was found in the specified area.

Going beyond visual testing

Visual testing is an essential piece of the testing toolkit. It's a great way to ensure you aren't introducing visual anomalies as you make changes to your app, and it can be helpful in making simple assertions like the presence of text or images.

However, image-based-testing is not one-size-fits-all. To successfully end-to-end test your product, you'll likely need to support a much wider range of assertions and actions (including uploading files, verifying text messages or emails are received, etc.).

To support the automation of testing these harder assertions and experiences, you'll need to expand your toolkit to go beyond visual testing.

Looking to go beyond visual testing?

walrus.ai provides visual assertions and handles everything mentioned in this article out of the box. Beyond that, walrus.ai can be used to test your hardest user experiences end-to-end. Drop us a line if you'd like to see it in action!

Discussion (0)

pic
Editor guide