1 Introduction
Due to the proliferation of smart phone, video cameras, storage devices as well as the development of social networks, large volumes of multimedia data associating to text, image, video and audio have been generated by users every day. For instance, it is reported that there are over 95 million geotagged photos on Flickr with a daily growth rate of around 500,000 new geotagged photos. As a result, in recent years variety of multimedia information retrieval applications have emerged and become more and more popular.
As a vital component of multimedia information retrieval applications, near duplicate detection, also known as similarity search, is a well established research topic and still attracts lots of attention. An element is called a nearduplicate of a reference element if it is close , according to some defined measure, to the reference element. For example, given an image and a collection, we want to find those candidates in the collections which are most similar to the input image based on their Jaccard similarity.
As efficient solution of multimedia near duplicate detection, MinWise approach has been paid substantial effort in recent research literatures. For example, in text near duplicate detection, independent random permutations are generated to convert the words of each document into a set of fingerprints to accelerate similarity search; similarly, in image near duplicate detection, independent random permutations are applied to convert the visual words of each image into a set of fingerprints.
Though MinWise method are successful in large scale similar multimedia searching, it still faced two critical challenges. Firstly, the preprocessing cost of MinWise can be very expensive. To ensure the accuracy of MinWise, hundreds of independent random permutations are generated. Secondly, MinWise is time consuming, because it has to compare all fingerprints to calculate the similarity.
In order to reduce preprocessing time of MinWise, we first introduce a hashing method OPH, which is widely used in text search area, to multimedia near duplicate detection. Different with MinWise, which requires hundreds of random permutation, OPH only require single permutation, which significantly reduces the time consumption during random permutation. Based on OPH, we further divide the fingerprints into several groups, and bring in the concept of small probability event. With the assistance of small probability event, our proposed GOPH can terminal the comparison earlier, but with similar accuracy. Meanwhile, we observe that most of the data are not similar to the input in real application. Thus, based on this observation, we purpose a new hashing method namely HOPH to further reduce the comparison time.
Contributions. The principle contributions of this paper are summarized as follows.

OPH is introduced to avoid the expensive preprocessing step of MinWise. Based on OPH, an efficient group based algorithm called GOPH is develop to reduce the comparison time.

To further accelerate the search speed, we proposed a new hashing method HOPH.

Comprehensive experiments on real and synthetic text and image datasets demonstrate that our proposed hashing method HOPH achieve substantial improvements over MinWise, OPH, but with similar accuracy.
Roadmap. The rest of the paper is organized as follows. Section 2 formally defines the problem of multimedia near duplicate detection, followed by the introduction of the related work. Section 3 describes OPH, GOPH and GOPH’s image application. Section 4 presents HOPH, some important theoretical analysis of HOPH and its image application. Extensive experiments are depicted in Section 5. Finally, Section 6 concludes the paper.
2 Related Work
In this subsection, we present some existing techniques, such as MinWise,
bit MinWise, for the problem of text nearduplicate detection. Then we also present some general techniques related to image similarity computation.
2.1 Nearduplicate detection for text
Nearduplicate detection is one of the key problems in the area of database and information retrieval. It aims to detect groups of documents with almost the same contents among a document collection. Two documents with a great amount of shared attributes do not necessarily count as nearduplicate. For the nearduplicate detection problem, Yang et al.DBLP:conf/sigir/YangC06 proposed an instancelevel constrained clustering approach. Their framework incorporates information such as document attributes and content structure into the clustering process to form nearduplicate clusters. Yang et al. DBLP:journals/jasis/HoadZ03 ; YangINF2013 emphasized that the gap between rare words term frequency in two documents should be smaller than that between common words and their best ranking is giving by a term weighting function biased towards rare terms. Hassanianesfahani et al. DBLP:journals/eswa/Hassanianesfahani18 proposed a Sectional MinWise (SMinWise) for the detection of nearduplicate documents to enhances the MinWise data structure with information about the location of the attributes in the document.
MinWise DBLP:journals/jcss/BroderCFM00 ; LINYANG13 ; DBLP:journals/siamdm/Vsemirnov04 is a locality sensitive hashing for the Jaccard similarity, which is a most popular technique for efficiently text similarity computing. It has a wide range of applications such as duplicate detection DBLP:conf/sigir/Henzinger06 , nearest neighbor search DBLP:conf/stoc/IndykM98 , largescale learning DBLP:conf/nips/LiSMK11 , allpairs similarity DBLP:conf/www/BayardoMS07 , etc.. Border et al. DBLP:journals/cn/BroderGMZ97 invented the MinWise algorithm for nearduplicate web page detection and clustering. Li et al. DBLP:conf/nips/LiSMK11 presented a simple effective solution to largescale learning in massive and extremely highdimensional datasets and indicated that bit MinWise is significantly more accurate than Vowpal Wabbit in binary data. The major defect of MinWise algorithm and bit MinWise is that they require an expensive preprocessing step, by conducting permutations on the entire dataset Li2012One . Pagh et al. DBLP:conf/pods/PaghSW14
studied and addressed the question: How many bits is it necessary to allocate to each summary in order to get an estimate with
relative error.2.2 Similarity computation for image
Similarity computation DBLP:conf/cikm/WangLZ13 ; DBLP:conf/mm/WangLWZZ14 for image is another important technique which is focused by a great many of researchers. The first significant problem concerning efficient similarity measure is image DBLP:conf/ijcai/WangZWLFP16 representations. Scale Invariant Feature Transform (SIFT for short) DBLP:journals/ijcv/Lowe04
proposed by Lowe is a classical approach in image recognition and computer vision area. It aims to detects and describes local features in images
DBLP:conf/iccv/Lowe99 . The SIFT descriptor is invariant to uniform scaling, orientation, illumination changes and partially invariant to affine distortion. Many studies of image processing DBLP:conf/pakdd/WangLZW14 ; NNLS2018 and retrieval DBLP:conf/mm/WangLWZ15 use it as one of the basic methods. In order to identify and remove the most false positive matches, Zhou et al. DBLP:conf/icimcs/ZhouLWLT12 proposed to generate binary SIFT descriptor in a given pair of the images from the original SIFT descriptor. Zhou et al. DBLP:conf/icip/ZhouLZ16 extended SIFTbased match kernels by integrating the match functions for SIFT and CNN features DBLP:journals/cviu/WuWGHL18 ; DBLP:journals/pr/WuWGL18 ; TC2018 ; DBLP:journals/pr/WuWLG18 . In order to improve the performance of image retrieval, they proposed a threshold exponential match kernel for CNN features to filter out the images whose the semantic similarity is lower than the threshold. To obtain enhanced performance, Zhang et al. DBLP:journals/pami/ZhangYWLT15utilized 1000 semantic attributes to revise the vocabulary tree of SIFT descriptors in BagofWord model (BoW for short), which stores only occurrence counts of vector quantized features
DBLP:conf/icimcs/QuSYL13. The BagofVisualWord model (BoVW for short) represent a image by a set of visual words applying SIFT descriptor and Kmeans clustering algorithm
DBLP:conf/cvpr/NisterS06 ; DBLP:conf/cvpr/PhilbinCISZ07 . BoVW with weighting DBLP:conf/iccv/SivicZ03 has proven to be a very successful approach for image and particular object retrieval. Nister et al. proposed a recognition scheme by applying this modal, which scales efficiently to a large number of objects. There are two types of searching methods based on BoVW modal, namely the exact inverted file methods DBLP:journals/csur/ZobelM06 and the hashing methods DBLP:conf/civr/ChumPIZ07 . Hashing methods DBLP:conf/sigir/WangLWZZ15 ; DBLP:journals/ivc/WuW17 are used to solve the problem of image similarity measure DBLP:journals/corr/abs170802288 and multimedia retrieval DBLP:journals/tip/WangLWZ17 ; DBLP:journals/tip/WangLWZZH15 ; DBLP:journals/tnn/WangZWLZ17 , which are concerned by the community. The spectral hashing method proposed by Shao et al. DBLP:journals/prl/ShaoWOZ12 and the local sensitive hashing method designed by are belong to the unsupervised hashing methods. Jain et al. DBLP:conf/cvpr/JainKG08 a method that applied a Mahalanobis distance function that captures the imagespsila underlying relationships well. This approach combined the MinWise method with distance metric learning. Li et al. DBLP:journals/tmm/LiWCXL13 presented a method to directly optimize the graph Laplacian by using spectral hashing combined with a distance learning.For problem of detection of near duplicate images, Chum et al. DBLP:conf/bmvc/ChumPZ08 proposed an efficient way based on a MinWise method, which uses a visual vocabulary of vector quantized local feature descriptors and for retrieval exploits enhanced MinWise techniques. Zhang et al. DBLP:conf/mm/ZhangC04 applied a partsbased representation of each scene by building Attributed Relational Graphs (ARG) between interest points. Based on Stochastic Attributed Relational Graph Matching, they compared the similarity of two images. Torralba et al. DBLP:conf/cvpr/TorralbaFW08 proposed a method to learn short descriptors to retrieve similar images from a huge database, which is based on a dense 128D global image descriptor. Jain et al. DBLP:conf/cvpr/JainKG08 introduced a method for efficient extension of Locally Sensitive Hashing scheme for Mahalanobis distance. Apparently, both above approaches use bit strings as a fingerprint of the image. Philbin et al. DBLP:conf/cvpr/PhilbinCISZ07 proposed a largescale object retrieval system and compared different scalable methods for building a vocabulary. Besides, they introduced a novel quantization method based on randomized trees to enhance the performance of imagefeature vocabularies construction.
3 Basic Apporach
In this section, we first formally define the problem of multimedia near duplicate detection, and then review the hashing method OPH, which is widely used in text search area, in section 3.2. We propose an advanced hashing method called GOPH to improve the performance of OPH in section 3.3. Section 3.4 presents the image application of GOPH. Table 1 below summarizes the mathematical notations used throughout this paper.
Notation  Definition 

a multimedia data (query)  
the multimedia dataset  
the similarity threshold  
a random permutation  
the whole vocabulary space  
the number of “matched bins”  
the number of “empty bins”  
unbiased estimator  
variance of OPH  
the bin number of the comparison part  
the number of times that the fingerprints are equal  
the number of match bin after comparisons  
the error tolerance  
the image dataset  
The number of matched bin of HOPH  
The number of empty bin of HOPH  
The unbiased estimator of HOPH  
the empty bin of OPH and HOPH 
3.1 Problem Definition
Near Duplicate. Given a multimedia data , any multimedia data such that is a near duplicate of , where is a similarity threshold. Among the available similarity comparison function sim which might be exploited, the Jaccard similarity has been chosed in this work, since it has been widely used in different applications. Without loss of generality, this paper mainly consider two types multimedia data, text and image.
3.2 One Permutation Hashing Review
To reduce the times of random permutation of MinWise, Ping Li el al. propose an signature named one permutation hashing in Li2012One . To generate hundreds of samples, traditional signature MinWise requires hundreds of random permutation. However, only single permutation is enough for OPH, which significantly reduces the time consumption during random permutation.
One Permutation Hashing. First, a random permutation is generated. For each document a one permutation hashing min is recorded. Consider , , = {0, 1, …, 16}. Assume = {1, 2, 5, 10, 12, 15}, = {1, 2, 6, 10, 12, 14}, = {2, 9, 10, 12, 14}, we apply the permutation on the three sets and present the corresponding , , as binary (0/1) vector as what is shown in Fig. 1.
Then, we divide the whole space into bins, and select the first nonzero element as a sample for each bin. In some special case, if there is no nonzero element in the bin, is used to represent the empty bin. By this way, a random permutation of OPH can generate sample through bins. Assume =4, the sample selected from is [1, 5, 10, 12], is [1, 6, 10, 12], and is [2, , 9, 12]. Because we only need to compare the sample within the same bin, we can use the smallest possible representation to represent the actual value. For example, for , the final representation is [1, 1, 2, 0]; for , the final representation is [1, 2, 2, 0]; similarly, for , the final representation is [2, , 1, 0].
From the above example, the sets and have 3 identical smallest possible representations and the estimated similarity will be 0.75, while the exact similarity is 0.5. The sets and share one smallest possible representation and their similarity estimate is 0.333 (0.375 is exact).
Finally, we introduce some important properties of OPH. Without loss of generality, we consider two documents and . Firstly, it is the two fundamental definition of OPH and . and and represents the number of “matched bins”and the number of “empty bins”respectively:
(1) 
(2) 
where and are defined for the th bin, as
(3) 
(4) 
Denote a random permutation: . The hashed values are the two smallest possible representation sets after applying the permutation on and . The probability at which the two hashed values are equal is
(5) 
Then the unbiased estimator of one permutation hashing is
(6) 
(7) 
Based on unbiased estimator, assume = the variance of OPH is
(8) 
3.3 Group Based One Permutation Hashing
Motivation.
While calculating the similarity, we have to compare the fingerprints one by one in one permutation hashing, which is time consuming. Meanwhile, one permutation hashing is claimed to satisfy the binomial distribution in
Li2012One . Thus, based on this property, we design a new algorithm named group based one permutation hashing(GOPH) to reduce the comparison time.Basic Idea. After applying one permutation to documents, we first aggregate the generated fingerprints into groups. Then, from the first group to the last group, we progressively compute the similarity between the corresponding groups, and bring in the concept of small probability event as a filter to accelerate the comparison for the remain groups.
Assume is the bin number of the comparison part, is defined as the number of times that the fingerprints are equal in comparison part, denotes as
(9) 
Assume is the estimator of the number of match bin after comparisons. Because there are only two situations: ”match” or ”unmatch” in each comparison, these two situations are opposite each other and independent from each other, and the comparison result is not related to the results of other comparisons. Obviously, satisfies the binomial distribution, denotes as T. Thus, The distribution function of the variable is denoted as follow:
(10) 
In the following, we first introduce the concept of small probability event, then introduce how the small probability event is used to avoid unnecessary comparison for the remain groups.
Definition 1 (Small Probability Event)
Given an error tolerance , is the distribution function of variable , and , an event is called a small probability event, if and only if .
Assume indicates the excepted average match bins for each groups, and represents the minimum average of the remainder groups, if the final average match bins not less than according to the current binomial distribution. Assume the event indicates the probability that can eventually reach , if the probability of event is less than the error tolerance , we can see event as a small probability event, and terminate the subsequent calculation after corresponding processing. More specifically, while comparing the value of and , we have the following two cases:
i) If , for the remain groups, we compute whether is smaller than . If is smaller than , we consider that it is almost impossible for these two documents to meet the similarity requirement, label them as dissimilarity documents, and terminal the algorithm; otherwise, we takes next group into further consideration.
ii) If , for the remain groups, we still compare the value of and . If is smaller than , we consider that these two documents should be a similar document pair, add them to result set, and terminal the algorithm; otherwise, we takes next group into further consideration.
Algorithm. Algorithm 1 illustrates the implementation details of the document pair comparison based on GOPH. In Line 1,we first initialize current group number to 1, current match bins to 0, and calculate by input value , , . Then, for each group of document pair , we gradually calculate their similarity. From Line 3 to Line 8, we update the based on current group’s information. After computing the value of in Line 9, we compare its value with . From Line 10 to Line 17, if , we further compare with , if its value is larger than , we continue to compare the similarity of next group; Otherwise, we consider current document pair is a similar pair, and break the algorithm. From Line 18 to Line 24, If , and is larger than , we continue to compare the similarity of next group; Otherwise, we break the algorithm and consider current document pair is not similar. Following is an example of GOPH comparison algorithm.
Example 1
Given a document pair , assume =100, =10, =0.6, and . Then, the values of for different are shown in Table 2 and the value of is 60. Assume there are 65 bins matching in the 1st comparison, the value of is 59.4. Because , and is larger than , we continue to compare next group. Assume only 5 bins matching in the 2nd comparison, the value of is 66.25. Because , and is larger than , we continue to compare next group. Assume in the 3rd comparison, there are 10 bins matching, similarly, the value of is 74.28. Because , and is larger than , we continue to compare next group. In the 4th comparison, assume there are 10 bins matching, the value of is 85. Obviously, , we continue to compare the value of and . Because the probability of equals 5.0732E08, which is smaller than , it means the probability that the document pair is a similar pair is a small probability event. Thus, we terminal the algorithm and consider current document pair is not similar pair.
5  3.27948E33  1  55  0.131090453  0.868909547 

10  1.25639E26  1  60  0.456705514  0.543294486 
15  2.31928E21  1  65  0.820530647  0.179469353 
20  5.56419E17  1  70  0.975217177  0.024782823 
25  2.71442E13  1  75  0.998810999  0.001189001 
30  3.46422E10  1  80  0.999983588  1.64119E05 
35  1.35466E07  0.999999865  85  0.999999949  5.0732E08 
40  1.80415E05  0.999981959  90  1  2.33876E11 
45  0.000881808  0.999118192  95  1  7.01609E16 
50  0.016761687  0.983238313  100  1  6.53319E23 
3.4 GOPH for Image Nearduplicate Detection
In this subsection, we describe how we extend the GOPH method originally developed for text nearduplicate detection to image nearduplicate detection. We describe it using visual words to replace visual words in the following subsection.
Two images are near duplicate if the similarity sim(,) is higher than a given similarity threshold . The goal is to retrieve all images in the database that are similar to a query image. The outline of the images are near duplicate detection algorithm is as follows: First a list of visual words are extracted from each image. A visual word is a single number having the property that two images , have the same value of visual word with probability equal to their similarity sim(,). To efficient compare the visual words, a one permutation is used to evenly divide these visual words into bins and retrieval smallest possible representations as fingerprints. Then, we adopt GOPH algorithm to compute the similarity between these two fingerprint sets.
How does it work? Consider image = arg min . Since is an one permutation, each fingerprint of has the same probability of being the smallest possible fingerprint. Hence, can be constructed from . If is an fingerprint of both and , i.e. , then min = = . Otherwise, if , then ; if , then . Thus, for an one permutation it follows
(11) 
To enhance the efficiency of comparison, the fingerprints are grouped into tuples. Similar to text comparison, from the first summary to the last summary, we gradually compute the similarity between the corresponding summaries, and estimate whether the remain summaries will meet the small probability event or not. If the remain summaries meets the small probability event, the algorithm terminals; Otherwise, we continue to calculate the fingerprints of the next summary.
4 Our Strategy: Hierarchical One Permutation Hashing
This section presents the Hierarchical One Permutation Hash (HOPH) for efficient multimodal near duplicate detection. Section 4.1 first explains the basic idea of HOPH. Then, section 4.2 presents some theoretical analysis of HOPH, and section 4.3 discuss the image implementation of HOPH.
4.1 Hierarchical One Permutation Hashing Overview
Different with traditional OPH, which evenly divided the whole space into buckets, HOPH scheme is first grouping the original data entries into two groups, namely permutation group and division group, in each iteration. The space of each permutation group is evenly into parts, while the space of each division group is further divide into permutation group and division group sequentially, if the sub group size is greater than . More specifically, assume , HOPH divide the space into two groups with proportion , namely (a:b)HOPH. Thus, after first iteration, the the whole space is dividing into two groups and with size , respectively. Then, we divide the space evenly into parts for the first group . For the second group , we check whether the number of data entries of its sub groups are larger than after division, if all the number is larger than , then HOPH continue to divide the second group into two parts with ; Otherwise, HOPH terminal the division. Figure 2 shows the example of (1:1)HOPH.
Similar to OPH, a random permutation is generated firstly. For each document a one permutation hashing min is recorded. Consider , , = {0, 1, …, 16}. Assume = {1, 2, 5, 10, 12, 15}, = {1, 2, 6, 10, 12, 14}, = {2, 9, 10, 12, 14}, and and . Similar to OPH, a one permutation is generated firstly. Then, we divide the space with (1:1)HOPH into three groups. After that, we apply one permutation on the three groups and evenly divide the space into buckets for each group, select the smallest nonzero in each bucket as samples. For example, [[1, 5], [, 10], [12, 15]], [[1, 6], [, 10], [12, 14]], and [[2, ], [9, 10], [12, 14]]. We use to denote an empty bin, which occur rarely while the number of nonzeros is large compared to .
In the example in Figure 2(which includes 3 documents), the sample selected from is [[1, 5], [, 10], [12, 15]]. We reindex the elements of each bucket to use the smallest possible representations, because only elements with the same bin number need to be compared. For example, for , after reindexing, the sample [[1, 5], [, 10], [12, 15]] becomes [[1, 1], [, 0], [0, 1]]. Similarly, for , the original sample [[1, 6], [, 10], [12, 14]] becomes [[1, 2], [, 0], [0, 0]], etc.
From the above example, the sets and have one identical smallest possible representations in each subgroup and the estimated similarity will be 0.625, while the exact similarity is 0.5. The sets and share one smallest possible representation and their similarity estimate is 0.375 (0.375 is exact).
Algorithm Algorithm 2 illustrates the implementation details of the (a:b)HOPH Comparison for the input document pair. For presentation simplicity, we assume there are groups in each document, , represents the average matched probability should be achieve, represents the excepted average matched probability for the groups haven’t been compared, and an array is used to store the number of matched bin for each group in Line 1. For each group of document pair , we gradually calculate their similarity. From Line 3 to Line 8, we update the based on the th group’s information. Then, we compute the value of in Line 11, we compare it with . From Line 12 to Line 19, if , we further compare with , if its value is larger than , we continue to compare the similarity of next group; Otherwise, we consider current document pair is a similar pair, and break the algorithm. From Line 20 to Line 26, If , and is larger than , we continue to compare the similarity of next group; Otherwise, we break the algorithm and consider current document pair is not similar. Following is an example of HOPH comparison algorithm.
Example 2
Given a document pair , assume =100, =10, =0.6, and . Then, the values of for different are shown in Table 2 and the value of is 60. Assume there are only 40 bins matching in the 1st comparison, then the value of is 0.8. Because , we further compare the value of and . Because is 80, and the probability of equals 1.64119E05, which is smaller than . It indicates the event that the document pair is a similar pair is a small probability event. Thus, we terminal the algorithm and consider current pair is not similar pair.
In Algorithm 2, line 38 computes the number of the bins which have equal value, and the cost is . Line 1125 searches the similar document pairs by probability and , the cost is . Assume , the inverse function of is denoted as , thus . According to the aforementioned theories, , we can compute . Therefore, the total cost of algorithm 2 is .
4.2 Theoretical Analysis of HOPH
In the following section, we will introduce some interesting theoretical analysis of HOPH, such as the number of match bin, the number of empty bin, the unbiased estimator and so on.
Assume , HOPH first divides the whole space into group with length , and respectively, then divide the corresponding space evenly into bins, as what is shown in Fig 3. Based on the above assumption, HOPH has the following properties:
Lemma 1
The number of matched bin of HOPH is
Proof
As shown in Fig 3, assume HOPH consists of groups from different one permutation hashes, which are evenly divided into respectively.
Since,
Then,
Since,
We obtain,
thus completing the proof.
Lemma 2
The number of empty bin of HOPH is
Lemma 3
The unbiased estimator of HOPH is
4.3 HOPH for Image Nearduplicate Detection
In term of image near duplicate detection, the construction and comparison method of HOPH is similar to that of GOPH. Given a image collection, we first extract the visual word list from each image. In construction stage, we first generate a random permutation , then apply HOPH scheme recursively divide the whole visual word space two groups in each iteration. For the front group, we apply permutation to it, and evenly divide the space into buckets. For the latter group, we further divide the space into two groups again, if its sub group size is greater than . In comparison stage, given two HOPH group, we gradually compute the similarity between the corresponding group from the first to the last, and estimate whether the remain part will meet the small probability event or not. If the remain part will trigger the small probability event, the algorithm terminals and outputs the result; Otherwise, the fingerprints of the next group will be calculated for further evaluation.
5 Performance Evaluation
In this section, we present results of a comprehensive performance study to evaluate the efficiency and scalability of the proposed techniques in the paper. In our implementation, we evaluate the effectiveness of the following Hashing techniques.

MinWise. Minwish hashing, which is a natural implementation of the method in DBLP:journals/jcss/BroderCFM00 .

OPH. One permutation hashing, which is a natural implementation of the method in Li2012One .

GOPH. The group based one permutation hashing technique proposed in Section 3.3.

(1:1)HOPH. The hierarchical one permutation hashing whose ratio of to equals 1:1.

(1:2)HOPH. The hierarchical one permutation hashing whose ratio of to equals 1:2.

(2:1)HOPH. The hierarchical one permutation hashing whose ratio of to equals 2:1.
Environment Settings. Experiments are run on a PC with Intel i7 6700HQ 2.60GHz CPU and 16G memory running Ubuntu 16.04 LTS. All algorithms in the experiments are implemented in Java.
Workload. A workload for this experiment consists of 100 input queries, and the precision, recall and response time are employed to evaluate the performance of the algorithms. By default, we set the error tolerance , user preferred similarity threshold , data number .
Performance matric.
The objective evaluation of the proposed approach is carried out based on precision and recall. Precision measures the accuracy of the retrieval. It is the ratio of retrieved documents that are similar to the query.
Precision is the ratio of retrieved images that are relevant to the query image.
Recall measures the robustness of the retrieval. It is defined as the ratio of relevant images in the database that are retrieved in response to a query.
5.1 Evaluation on Text Dataset (FS)
Dataset. Performance of various algorithms are evaluated on real dataset FundSet(FS). FS is obtained from the largescale document database of NSFC in which each document is a NSFC proposal in PDF format. Taking some documents of funds proposal as the data source.
Evaluating training group number. At first, we try to train the group number of text document . Fig.(a) shows that with increasing, response time is reduced. But the downtrend slows down and at the response time is minimum. It indicates that the performance cannot be boosted all along with the gradually increasing of . On the other hand, as shown in Fig.(b), no matter what value is, the precision is very high, nearly 100%. Thus, we choose the as the default value of GOPH.
Evaluation on different data number. We investigate the response time, precision and Recall in Fig. against the dataset FS, where other parameters are set to default values. Fig.(a) depicts the accuracy of GOPH, 1:2HOPH, 1:1HOPH, 2:1HOPH, MinHash and OPH. Obviously, MinHash and OPH have the highest precision. When Data Num is larger than , the growth of precision slows down. Fig.(b) demonstrates that the recall of them all are climbing with the increasing of Data Num, and there is little difference between them. Fig.(c) shows that with the increasing of the Data Num, the response time of these 6 algorithms gradually rise. At Data Num , all of the values are in the range between 150 and 250. But when Data Num increases to 100, we can see that the performance of 1:1HOPH and 2:1HOPH are much better than MinHash and OPH. Particularly, 2:1HOPH has the smallest response time among the algorithms, because the number of filter segments is the most in 2:1HOPH. On the other hand, 1:1HOPH is most suitable because of the equilibrium of precision, recall and speed. As above evaluation shown, in the aspect of efficiency 1:1HOPH is higher than 1:2HOPH. Besides, the response time of 1:1HOPH and 2:1HOPH are almost the same but the precision of 1:1HOPH is higher than the other. Meanwhile, as the accuracy of GOPH is close to OPH and its response time is two or three times faster than OPH, we select GOPH and 1:1HOPH to conduct the following comparison evaluation.
Evaluation on different threshold. Fig.(a) depicts that with the rising of the threshold, the precision of MinHash, GOPH and 1:1HOPH slowly increases. All of them are larger than 98% when the threshold is larger 0.7. Apparently, the precision of MinHash is little higher than the precision of 1:1HOPH. Fig.(b) illustrates that the three recalls stay the same tendency, they are near 100% when the threshold is larger than 0.8. As shown in Fig.(c), GOPH, MinHash and 1:1HOPH demonstrate superior performance in comparison with MinHash and the response time of GOPH and 1:1HOPH decline with the Threshold climbing. Obviously, compared with MinHash and GOPH, 1:1HOPH has the smallest response time.
Evaluation on Small Probability. Fig.(a) reports the precision of Minhash, GOPH and 1:1HOPH. Clearly, the precision of MinHash is invariable and the precision of GOPH and 1:1HOPH grow slowly. All of them are very high, MinHash is nearly 99% and two others are larger than 98%. Fig.(b) demonstrates the recall of GOPH, MinHash and 1:1HOPH. It is easy to find there is little difference in recall. All of them gradually increase and are nearly 100% when the error tolerance equals 1.00E05. In Fig.(c), the performance of 1:1HOPH is nearly 5 times higher than MinHash. When we change the error tolerance to a smaller value, the climbing of the performance of GOPH and 1:1HOPH is not obvious.
5.2 Evaluation on Image Dataset (IS)
Dataset.
Our empirical studies aim to evaluate the performance of the filter against a subset of ImageNet datast. ImageNet is the largest image dataset for image processing and computer vision. It is organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. This dataset includes: (1)14,197,122 images; (2)1,034,908 images with bounding box annotations; (3)1000 synsets with SIFT features; (4)1.2 million images with SIFT features. We generate a image dataset (IS) by selecting 1 million images from ImageNet. On IS, we evaluate the precision and efficiency of GOPH, 1:2HOPH, 1:1HOPH, 2:1HOPH, MinHash and OPH.
Evaluating training group number. Firstly, we try to train the group number . It is illustrated by Fig.(a) that with the raising of , response time is going down gradually. But the downtrend slows down and at the response time is minimum. The performance cannot be boosted all along with gradually increasing. On the other hand, as shown in Fig.(b), with the raising of Data Num, the precision increases step by step with fluctuations, and at , it is nearly 99.4%. Hence we choose the as the default value of GOPH.
Evaluation on different data number. We evaluate the precision, recall and response time of these 6 algorithms against IS. The precision is shown in Fig.(a). apparently, all the precisions fluctuate in the range of 97.1% to 98.7% with the increasing of Data Num and the precision of MinHash and OPH are higher than others. It is obvious that in Fig.(b) there is litter difference in recall over these 6 algorithms and all of them are approximate 100% with the number of data increasing. Fig.(c) demonstrates the trends of response time of MinHash and OPH are almost the same, much higher than the others. Particularly, 2:1HOPH and 1:1HOPH significantly outperform the other 4 algorithms in performance. It is clear that 1:1HOPH dominates to 2:1HOPH in the aspect of precision but the efficiency of the former is not lower than the the latter. On the other hand, the response time of 1:2HOPH is higher than 1:1HOPH. Furthermore, the efficiency of GOPH is much higher than OPH. Therefore, in the evaluation on different threshold and small PR, we compare the two algorithms mentionedabove and MinHash.
Evaluation on different threshold. The precision of GOPH, MinHash and 1:1HOPH on difference threshold are shown in Fig.(a). All of the precisions fluctuate in the interval of 97.9% to 98.7% The precision of MinHash is little higher than 1:1HOPH and GOPH. As shown in Fig.(b), the recall of these algorithms stay the same tendency. They ascend gradually when the Threshold increases from 0.5 to 0.7. Fig.(c) tells us that the response time of GOPH and 1:1HOPH decline step by step but the performance of MinHash remains unchanged. As expected, 1:1HOPH has the best performance among them. When the Threshold is smaller than 0.8, the response time of 1:1HOPH is less than .
Evaluation on Small Probability. In Fig.(a), we evaluate the precision of GOPH, MinHash and 1:1HOPH. It is no doubt that the precision of MinHash stay a very high value, a little higher than the others which slowly raise with the error tolerance increasing from 1.00E3 to 1.00E4. After that they are almost invariable. On the other hand, as shown in Fig.(b) the recall of these algorithms go up moderately and at error tolerance is 1.00E5 they are nearly 100%. We can see from Fig.(c) that, with the changing of error tolerance, the performance of MinHash remains the same but the others changed very smoothly. Like the situation on dataset FS, the performance of 1:1HOPH is much better than two others.
6 Conclusion
The problem of multimedia near duplicate detection is important due to the increasing amount of multimedia data collected in a wide spectrum of applications. In the paper, we propose introduce OPH to reduce the costly preprocessing time. Based on OPH, we propose GOPH to accelerate the comparison speed. Then, we design a novel hashing method namely HOPH to further improve the performance. Both GOPH and HOPH can easily extend to image near duplicate detection. Finally, our comprehensive experiments convincingly demonstrate the efficiency of our techniques.
Acknowledgments: This work was supported in part by the National Natural Science Foundation of China (61379110, 61472450, 61702560), the Key Research Program of Hunan Province(2016JC2018), and project (2016JC2011, 2018JJ3691) of Science and Technology Plan of Hunan Province.
References
 (1) Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 812, 2007, pp. 131–140 (2007)
 (2) Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Minwise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)
 (3) Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks 29(813), 1157–1166 (1997)
 (4) Chum, O., Philbin, J., Isard, M., Zisserman, A.: Scalable near identical image and shot detection. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval, CIVR 2007, Amsterdam, The Netherlands, July 911, 2007, pp. 549–556 (2007)
 (5) Chum, O., Philbin, J., Zisserman, A.: Near duplicate image detection: minhash and tfidf weighting. In: Proceedings of the British Machine Vision Conference 2008, Leeds, September 2008, pp. 1–10 (2008)
 (6) Hassanianesfahani, R., Kargar, M.J.: Sectional minhash for nearduplicate detection. Expert Syst. Appl. 99, 203–212 (2018)
 (7) Henzinger, M.R.: Finding nearduplicate web pages: a largescale evaluation of algorithms. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 611, 2006, pp. 284–291 (2006)
 (8) Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. JASIST 54(3), 203–215 (2003)

(9)
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality.
In: Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 2326, 1998, pp. 604–613 (1998)

(10)
Jain, P., Kulis, B., Grauman, K.: Fast image search for learned metrics.
In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 2426 June 2008, Anchorage, Alaska, USA (2008)
 (11) Li, P., Owen, A., Zhang, C.H.: One permutation hashing for efficient search and learning. Mathematics (2012)
 (12) Li, P., Shrivastava, A., Moore, J.L., König, A.C.: Hashing algorithms for largescale learning. In: Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 1214 December 2011, Granada, Spain., pp. 2672–2680 (2011)
 (13) Li, P., Wang, M., Cheng, J., Xu, C., Lu, H.: Spectral hashing with semantically consistent graph for image indexing. IEEE Trans. Multimedia 15(1), 141–152 (2013)
 (14) Lowe, D.G.: Object recognition from local scaleinvariant features. In: ICCV, pp. 1150–1157 (1999)
 (15) Lowe, D.G.: Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)
 (16) Nistér, D., Stewénius, H.: Scalable recognition with a vocabulary tree. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), 1722 June 2006, New York, NY, USA, pp. 2161–2168 (2006)
 (17) Pagh, R., Stöckel, M., Woodruff, D.P.: Is minwise hashing optimal for summarizing set intersection? In: Proceedings of the 33rd ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems, PODS’14, Snowbird, UT, USA, June 2227, 2014, pp. 109–120 (2014)
 (18) Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 1823 June 2007, Minneapolis, Minnesota, USA (2007)
 (19) Qu, Y., Song, S., Yang, J., Li, J.: Spatial minhash for similar image search. In: International Conference on Internet Multimedia Computing and Service, ICIMCS ’13, Huangshan, China  August 17  19, 2013, pp. 287–290 (2013)
 (20) Shao, J., Wu, F., Ouyang, C., Zhang, X.: Sparse spectral hashing. Pattern Recognition Letters 33(3), 271–277 (2012)
 (21) Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: 9th IEEE International Conference on Computer Vision (ICCV 2003), 1417 October 2003, Nice, France, pp. 1470–1477 (2003)
 (22) Torralba, A., Fergus, R., Weiss, Y.: Small codes and large image databases for recognition. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 2426 June 2008, Anchorage, Alaska, USA (2008)
 (23) Vsemirnov, M.: Automorphisms of projective spaces and minwise independent sets of permutations. SIAM J. Discrete Math. 18(3), 592–607 (2004)
 (24) Wang, Y., Huang, X., Wu, L.: Clustering via geometric median shift over riemannian manifolds. Information Sciences 220, 292–305 (2013)
 (25) Wang, Y., Lin, X., Wu, L., Zhang, W.: Effective multiquery expansions: Robust landmark retrieval. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26  30, 2015, pp. 79–88 (2015)
 (26) Wang, Y., Lin, X., Wu, L., Zhang, W.: Effective multiquery expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans. Image Processing 26(3), 1393–1404 (2017)
 (27) Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q.: Exploiting correlation consensus: Towards subspace clustering for multimodal data. In: Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03  07, 2014, pp. 981–984 (2014)
 (28) Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q.: LBMCH: learning bridging mapping for crossmodal hashing. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 913, 2015, pp. 999–1002 (2015)
 (29) Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q., Huang, X.: Robust subspace clustering for multiview data by exploiting correlation consensus. IEEE Trans. Image Processing 24(11), 3939–3949 (2015)
 (30) Wang, Y., Lin, X., Zhang, Q.: Towards metric fusion on multiview data: a crossview based graph random walk approach. In: 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27  November 1, 2013, pp. 805–810 (2013)
 (31) Wang, Y., Lin, X., Zhang, Q., Wu, L.: Shifting hypergraphs by probabilistic voting. In: Advances in Knowledge Discovery and Data Mining  18th PacificAsia Conference, PAKDD 2014, Tainan, Taiwan, May 1316, 2014. Proceedings, Part II, pp. 234–246 (2014)

(32)
Wang, Y., Wu, L.: Beyond lowrank representations: Orthogonal clustering basis reconstruction with optimized graph structure for multiview spectral clustering.
Neural Networks 103, 1–8 (2018)  (33) Wang, Y., Wu, L., Lin, X., Gao, J.: Multiview spectral clustering via structured lowrank matrix factorization. IEEE Trans. Neural Networks and Learning Systems (2018)

(34)
Wang, Y., Zhang, W., Wu, L., Lin, X., Fang, M., Pan, S.: Iterative views
agreement: An iterative lowrank based structured optimization method to
multiview spectral clustering.
In: Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 915 July 2016, pp. 2153–2159 (2016)
 (35) Wang, Y., Zhang, W., Wu, L., Lin, X., Zhao, X.: Unsupervised metric fusion over multiview data by graph random walkbased crossview diffusion. IEEE Trans. Neural Netw. Learning Syst. 28(1), 57–70 (2017)
 (36) Wu, L., Wang, Y.: Robust hashing for multiview data: Jointly learning lowrank kernelized similarity consensus and hash functions. Image Vision Comput. 57, 58–66 (2017)
 (37) Wu, L., Wang, Y., Gao, J., Li, X.: Deep adaptive feature embedding with local sample distributions for person reidentification. Pattern Recognition 73, 275–288 (2018)

(38)
Wu, L., Wang, Y., Ge, Z., Hu, Q., Li, X.: Structured deep hashing with convolutional neural networks for fast person reidentification.
Computer Vision and Image Understanding 167, 63–73 (2018)  (39) Wu, L., Wang, Y., Li, X., Gao, J.: Deep attentionbased spatially recursive networks for finegrained visual recognition. IEEE Trans. Cybernetics (2018)
 (40) Wu, L., Wang, Y., Li, X., Gao, J.: Whatandwhere to match: Deep spatially multiplicative integration networks for person reidentification. Pattern Recognition 76, 727–738 (2018)
 (41) Wu, L., Wang, Y., Shepherd, J.: Efficient image and tag coranking: a bregman divergence optimization method. In: ACM Multimedia (2013)
 (42) Yang, H., Callan, J.P.: Nearduplicate detection by instancelevel constrained clustering. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 611, 2006, pp. 421–428 (2006)
 (43) Zhang, D., Chang, S.: Detecting image nearduplicate by stochastic attributed relational graph matching with learning. In: Proceedings of the 12th ACM International Conference on Multimedia, New York, NY, USA, October 1016, 2004, pp. 877–884 (2004)
 (44) Zhang, S., Yang, M., Wang, X., Lin, Y., Tian, Q.: Semanticaware coindexing for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 37(12), 2573–2587 (2015)
 (45) Zhou, D., Li, X., Zhang, Y.: A novel cnnbased match kernel for image retrieval. In: 2016 IEEE International Conference on Image Processing, ICIP 2016, Phoenix, AZ, USA, September 2528, 2016, pp. 2445–2449 (2016)
 (46) Zhou, W., Li, H., Wang, M., Lu, Y., Tian, Q.: Binary SIFT: towards efficient feature matching verification for image search. In: The 4th International Conference on Internet Multimedia Computing and Service, ICIMCS ’12, Wuhan, China, September 911, 2012, pp. 1–6 (2012)
 (47) Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), 6 (2006)
Comments
There are no comments yet.