A A

Posted on Dec 22

[Paper Implementation] Fast Shape-based Template Matching

#cpp #mfc #imageprocessing

Fast Shape-based Template Matching

While there has been tremendous progress in image and object recognition using DeepLearning and large-scale language models in recent years, this article will describe fast shape-based template matching using classical image processing. The algorithm is based on the following paper with our own improvements, and the implemented application is available on github.

Paper:
https://stefan-hinterstoisser.com/papers/hinterstoisser2011linemod.pdf

Github:
https://github.com/AkitoArai709/HighSpeedShapeBaseTemplateMattching

Background and Algorithm Overview

The main methods used to detect objects in images are “template matching” and “statistical methods/Deep Learning”. Each has the following characteristics.

Template Matching does not require learning and is characterized by its high processing speed. However, it is easily affected by changes in lighting and background, and real-time processing may be difficult.
Statistical methods and Deep Learning provide robust detection that is less affected by lighting and background. However, these methods require large amounts of training data and high computational resources.

In this paper, we propose an efficient and robust method for detecting textureless objects by solving these problems.

In the paper's approach, template data is created from images using only the gradient direction of objects as features, and objects similar to the template data are searched for in test images. By treating only the gradient direction without considering the norm (magnitude) of the gradient, robustness against contrast changes is improved. Similarity is calculated by calculating the cosine of the gradient direction of the feature points to determine the degree of agreement with the template data.
However, since cosine is computationally expensive, a look-up table that quantizes the gradient direction and obtains an evaluation value for each direction is used to streamline the computation cost, and the SSE instruction is used to speed up the process.

Image Processing Approach

Calculate the gradient angle of the edge (gradient) from the template image to be searched and create template data with the gradient angle as a feature. Scan the template data against the test image to obtain similarity.
When the template image is O and the test image is L, the similarity is calculated by the following formula.

ori denotes the gradient angle, and R(c+r) denotes the neighborhood gradient of size T centered at position c+r in the input image. Cosine evaluates the degree of agreement of the gradient angles. In other words, the maximum degree of agreement (cosine) of the feature points of the template with respect to the neighborhood of R(c+r) is calculated, and the value obtained for all feature points is added to the cosine, which is the similarity.

Calculation of gradient direction

Calculate the luminance gradient from the image using a Sobel filter to obtain the gradient direction; for RGB color images, calculate the gradient of each color channel separately and find the gradient direction of the largest channel using the following formula.

The gradient direction is quantized in 8 directions to calculate the response map in a later step. To make it robust to noise, the gradient direction that appears most frequently in a 3x3 neighborhood is placed at each location for gradient norms above a threshold value. Let L be the quantized gradient direction map.

Diffusion in gradient direction

To avoid the evaluation of the cosine and max operators in (1) each time the template is evaluated for a test image location, a new binary representation (J) of the gradient around each image location is introduced. This binary representation and the lookup table table are used to efficiently precompute the maximum value of the max operation.
First, an image of the binary representation (J) with the gradient direction diffused is shown below.

The quantized gradient direction is encoded in a binary representation, and each bit is assigned a gradient direction. This gradient direction is diffused to the neighborhood to create a binary representation (J).
By using a binary representation, J can be obtained by shifting the bits within a specified range and performing an OR operation to calculate the diffusion. The range to be shifted is the neighborhood of size T indicated in (2).

Calculating the response map

Next, the binary representation (J) is precomputed for similarity to each azimuth using a lookup table. The pre-calculated result is defined as the response map. An image of the response map is shown below.

One response map is created for each quantization azimuth. The principle is to calculate and store the maximum similarity for a particular azimuth (image on the right). This calculation is performed by preparing a lookup table in which the similarity for each azimuth is stored, and the similarity is indexed by azimuth (image on the left). Based on this response map, a quantized 8-azimuth response map is created.

For reference, an example of a lookup table with azimuths of 0 degrees (→) and 90 degrees (↑) is shown below. If the azimuths are coincident, “1” is obtained for cos(0), and you can see that the similarity decreases with each deviation in azimuth.

For each azimuth i, the value at each position c in the response map Si is calculated as follows: T denotes the lookup table.

The final similarity calculation is as follows.

Memory linearization for parallelization

Modern processors not only read data from main memory at any one time, but also read multiple data lines, called cache lines, at the same time. If memory is accessed from random locations, the cache is lost and computation speed is reduced.
On the other hand, accessing multiple values from the same cache line is very efficient. Therefore, storing data in the order in which they are accessed can speed up calculations.
Since it is efficient to evaluate every T-th pixel due to the aforementioned diffusion of orientation to the size T neighborhood, the response map is stored in memory in a cache-friendly order, as shown in the following figure.

Reassemble each response map so that values T pixels apart on the X axis are placed next to each other in memory. When the current row is finished, proceed to the rows T pixels apart on the Y-axis.

The linear memory for the different directions of the template (black arrows in the figure below) can be shifted and added by the offset according to the relative position (Rx,Ry)^T of the template relative to the anchor coordinates (search reference coordinates) to calculate the similarity on the input image to the template. The computation can be accelerated by executing these additions with parallel SSE instructions.

The above is the content described in the paper. The following is a verification of the implementation.

Diffusion in a T-neighborhood

As mentioned above, this algorithm diffuses the features to the T neighborhood and evaluates every Tth pixel to improve the efficiency of calculation.
With respect to the size of T, the results of detection at T=8, T=4, and T=1 are shown below.

T=8	T=4	T=1

I see that at T=8, the displacement is large with respect to the object, and as the T size is decreased, it fits better. This is because when diffusing with T size, all evaluation values are the same in the vicinity of T, and it is not possible to narrow down the best fitting position.
This is not a problem if you want to detect approximate positions, but if you want to perform detailed positioning, you need to reduce the size of T. However, reducing the size of T slows down the detection speed, making it a tradeoff between speed and accuracy.

In the application published on github, both high speed and high accuracy are achieved by performing a coarse search for diffusion and pyramid processing in the vicinity of T, followed by a detailed search at T=1.

Evaluation with various images

How the results of evaluation with various images. We also used a dataset published by Hashimoto Laboratory of Chukyo University for the evaluation.

http://isl.sist.chukyo-u.ac.jp/Archives/archives.html

Shogi Piece

Template image size: 149x131

Test image size: 600x400

Detection time: 56[ms]

Puzzle Piece

Template image size: 153x115
Test image size: 600x400
Detection time: 27[ms].

T-shape plate

Template image size: 107x121

Test image size: 600x400

Detection time: 380[ms] !

Discussion of evaluation results

While the chess pieces and puzzle pieces were matched relatively fast and with high similarity, the T-plates took longer to match compared to them.
This is likely due to the large number of objects with similar feature values and the fact that many candidate objects were found in the initial coarse search, which took time in the detailed search. It is also possible that the contours are not well detected due to the similar coloring of the search target and other objects. If the detection sensitivity of the contours is adjusted, the detection similarity may increase, but it will be vulnerable to noise.

Source Code

The complete project set of the application published on github is available for sale at the following site. If you would like to actually check and run the source code of the algorithms introduced, please purchase it.

source code

DEV Community