Introduction
Automating user interfaces has come a long way—but there are still situations where traditional methods fall flat. One of the biggest challenges arises when working with canvas-based applications, where no DOM elements exist for key interactive components. This makes it nearly impossible for standard test frameworks to simulate interactions like clicks, taps, or hovers using selectors alone.
This blog post introduces a novel solution to this problem: the wdio-visual-click-service, a new plugin for WebdriverIO that allows test scripts to interact with UI components using image matching instead of DOM queries.
The problem
In modern UI automation, developers and software engineers in test often run into limitations when trying to interact with components that don’t expose reliable DOM selectors—especially in canvas-based interfaces like lottery games, drawing tools, or dynamic third-party widgets. Traditional approaches using CSS or XPath selectors fall short in these scenarios.
Consider a fictional arcade game called Whack a Guacamole. It's a lighthearted twist on the classic whack-a-mole—but with avocados instead of moles.
Avocados pop up at random positions.
Your objective is to click on as many avocados as possible before time runs out.
Occasionally, a pufferfish appears as a trap—clicking it penalizes you with -10 points.
Simple concept. Complex automation.
When you inspect the DOM while the game is running, you’ll notice something alarming for any automation engineer: no individual HTML elements represent the avocados or the pufferfish. All visual components are drawn directly onto the canvas using JavaScript’s rendering context.
Standard testing tools like WebdriverIO rely on querying the DOM to locate elements. In the case of Guacamole, trying to write a selector such as:
$('img[src="avocado.png"]')
…will yield nothing.
That’s because the avocado isn’t an <img>
or a <div>
—it’s just a group of pixels rendered directly on the canvas.
The Core Question
How can we verify click functionality or automate interactions with components that don’t exist in the DOM at all?
This is where the wdio-visual-click-service (VCS) comes in. Instead of relying on the DOM, this service uses visual data—scanning the screen for a reference image and simulating a click at the detected location.
What It Supports
The VCS supports two image-matching engines:
OpenCV: For robust, multi-scale template matching using grayscale comparison
Pixelmatch (via Jimp): A lighter, pixel-by-pixel fallback engine
Usage
Once the plugin is installed, it automatically registers a new browser command:
browser.clickByMatchingImage(referenceImagePath, options?);
You do not need to register this manually in a hook. Just enable the service in your wdio.conf.ts
:
export const config: WebdriverIO.Config = {
services: ['visual-click'],
};
Then, in your test, call:
await browser.clickByMatchingImage('./images/avocado.png');
The plugin takes care of everything else—from taking a screenshot to matching it with the reference image, to simulating the click.
Under the Hood: How It Works
The wdio-visual-click-service defines a WebDriverIO service that registers a new command in the before()
lifecycle hook. This command—clickByMatchingImage—can be invoked in your test scripts to locate a reference image on screen and perform a click at the match location.
The plugin attempts to load the @u4/opencv4nodejs
module. If OpenCV is available, it uses it for precise and scalable image recognition. If not, it gracefully falls back to a lighter image comparison engine using Jimp and Pixelmatch.
OpenCV Engine: Scalable, Precise Matching
When OpenCV is available, the plugin uses template matching to scan the screenshot for the reference image.
At a high level, the process works as follows:
A screenshot of the browser viewport is captured.
The reference image (e.g., an avocado) is resized to multiple scales (e.g., 1.0, 0.9, 0.8) to account for potential visual differences in size.
Here’s the key snippet:
const matched = grayScreenshot.matchTemplate(resizedRef, cv.TM_CCOEFF_NORMED);
const { maxVal, maxLoc } = matched.minMaxLoc();
This does two important things:
matchTemplate()
produces a correlation map—a matrix where each cell contains a similarity score representing how well the reference matches that region of the screenshot.
cv.TM_CCOEFF_NORMED
is the matching method used. It stands for Normalized Cross-Correlation Coefficient, which gives a match score between -1 and 1. A score of 1 means a perfect match.minMaxLoc()
then retrieves the best match from that matrix.maxVal
the confidence score of the best match andmaxLoc
the top-left coordinate where that best match was found.
If maxVal exceeds the confidence threshold (e.g., 0.7), the plugin computes the center point of the match and simulates a click at that location.
This process is repeated across different scales of the reference image, ensuring reliable matches even if the UI is resized or rendered differently.
Pixelmatch Fallback Engine: Lightweight but Effective
If OpenCV is not available, the plugin falls back to a custom pixel comparison engine built on Jimp and Pixelmatch.
This approach involves:
- Iteratively cropping and comparing regions of the screenshot with the reference image
- Using a configurable stride to balance performance and granularity
- Calculating a match confidence as the ratio of identical pixels
- Refining the match by scanning a smaller area near the best initial result
Though not as fast or robust as OpenCV, this fallback engine still provides accurate results for most use cases—particularly when the screen resolution and content are relatively stable.
Click Accuracy: Handling Screen Resolution
Whether using OpenCV or the fallback engine, the final match coordinates are adjusted based on:
- The current device pixel ratio (DPR)
- The browser viewport dimensions
This is handled by the internal clickAt(x, y)
function, which scales coordinates appropriately and simulates the click using WebDriver's native pointer actions. It ensures that the click is placed exactly where a human would expect it—regardless of display density or zoom level.
Configuration Options
To give Software Engineers in Test flexibility and precision, the clickByMatchingImage command supports an optional options object. This allows you to control how aggressively and accurately the service searches for a match. Here's what you can configure:
await browser.clickByMatchingImage('images/avocado.png', {
scales: [1.0, 0.95, 0.9],
confidence: 0.75
});
scales:
Control Matching Resilience to Size Changes
Type: number[]
Default: [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3]
The scales array determines how many different sizes of the reference image are tried during the matching phase. This is particularly useful when:
The same UI element may appear larger or smaller depending on screen size or resolution.
The canvas is rendered at different sizes in different test environments (e.g., mobile vs. desktop).
The browser zoom level or device pixel ratio affects the apparent size of the image.
By default, the plugin tries 1.0 (full size), then scales down in steps as low as 0.3. This wide range ensures high robustness but may increase execution time. If you know what size to expect, you can limit the array to just a few values for faster tests:
e.g.
scales: [1.0, 0.95, 0.9] // Faster but still tolerant to slight resizing
This level of configurability helps tailor matching performance to your environment's predictability.
confidence:
Set the Minimum Match Quality
Type: number
Default: 0.7
The confidence setting determines the minimum similarity score required for a match to be accepted. The score ranges from 0 to 1, where:
1 means a perfect match
0 means no similarity
This threshold is critical for avoiding false positives:
A higher value like 0.9 ensures that only highly accurate matches are accepted—ideal for static, predictable UIs.
A lower value like 0.6 can help in visually noisy or dynamically styled applications, where minor differences (e.g., shadows, gradients, or anti-aliasing) could otherwise block the match.
Here's how it might look in use:
await browser.clickByMatchingImage('images/target.png', { confidence: 0.85 });
If the best match on screen doesn’t reach the specified confidence, the command will throw an error—indicating that no satisfactory match was found.
Real-World Examples
In a lottery scratch card UI where card pieces appear in slightly different positions and sizes due to animation, you'd want a broader scale range (e.g., scales: [1.0, 0.95, 0.9, 0.85]) and a moderate confidence (confidence: 0.75).
For a CAPTCHA click test, where visual accuracy is paramount, you'd use a tighter scale range and a high confidence threshold (confidence: 0.9) to avoid false clicks.
In a responsive game like Whack a Guacamole, where avocados may scale down on smaller screens, a wider scale range is essential, but confidence could remain at a medium level depending on how stylized the visuals are.
Closing Thoughts
Automating canvas-based interfaces has long been a gap in the test automation landscape. With the introduction of the wdio-visual-click-service, you can now simulate human-like interactions in scenarios where DOM-based selectors fail. Whether you’re testing mini-games, dynamic visualizations, or embedded third-party tools, this plugin offers a powerful new way to bring reliability and precision to your tests.
The future of UI automation isn’t just in the DOM—it’s on the screen. And with visual matching, you’re one step closer to full coverage.
Repository
You can find the source code, installation instructions, and usage examples in the GitHub repository:
wdio-visual-click-service
In addition, the Whack a Guacamole game example shown above can be found here.
If you're passionate about solving hard problems, building tools like this, and working with top-tier engineers Agile Actors is hiring! Check out our open positions and join the team.
Top comments (1)
Wow—this is fantastic! Turning WebdriverIO into a visual click engine for canvas apps is a total game-changer; the clickByMatchingImage API, OpenCV matching, and smart Pixelmatch fallback are superbly explained. Clear, practical examples and config tips—starred and I can't wait to use this in our tests!