Video Query Reformulation for Near-Duplicate Detection

Viewer
Transcript

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

1

Video Query Reformulation for Near-Duplicate Detection Chih-Yi Chiu, Member, IEEE, Sheng-Yang Li, and Cheng-Yu Hsieh Abstract—In this paper, we present a novel near-duplicate video detection approach based on video query reformulation to expedite the video subsequence search process. The proposed video query reformulation method addresses two key issues: (1) how to efficiently skip unnecessary subsequence matches and (2) how to effectively increase the skip probability. First, we present an incremental update mechanism that rapidly estimates the similarity between two video subsequences to skip unnecessary matches. Second, we formulate an optimization problem of subsequence partition to increase the skip probability; a trust-region-based gradient descent algorithm is applied to solve the optimization problem. Extensive experiments cover various feature representations, subsequence granularities, and baseline methods; the results demonstrate that the proposed query reformulation method is robust and efficient to deal with a variety of near-duplicates in a large-scale video dataset. Index Terms—content-based retrieval, near-duplicate detection

I. INTRODUCTION

N

Near-duplicate video detection (NDVD) techniques have grabbed great research attentions recently. Due to the rapid growth in multimedia and network industries, the video content can be easily copied, edited, and disseminated via the Internet. The NDVD techniques can help content owners to search and manage video copies and near-duplicates in various applications, such as copyright protection, data mining, commercial detection, topic tracking, piracy removal, tag suggestion, search result reranking and clustering [6][15][19][21]. The NDVD techniques can be generally classified into two categories, including whole-video search and subsequence search. Whole-video search summarizes the full content of a video clip into a compact signature. Although whole-video search can be accomplished very efficient, its robustness to various video transformations may be fragile, in particular for dealing with the partial-temporal transformations that alter the frame context of the source video (e.g., random insertion/deletion operations). An alternative approach is partitioning the video clip into shorter subsequences and using them to search for the snippet of source content. Basically, the subsequence granularity is a tradeoff between robustness and efficiency. Using a number of shorter subsequences improves the probability to find the source snippet while increases the computation overhead. Manuscript received ......; revised ...... This work was supported in part by the National Science Council of Taiwan under Grants NSC 100-2221-E-415-017-. The authors are with the Department of Computer Science and Information Engineering, National Chiayi University, Chiayi City, 60004, Taiwan (phone: +886-5-2717228; fax: +886-5-2717705; e-mail: [email protected], [email protected], [email protected]).

In this paper, we propose a novel subsequence search approach to address the efficiency issue. Given a query video, it is partitioned into non-overlapped query subsequences. As these query subsequences usually contain similar content, we leverage this characteristic to derive the similarity upper bounds among these query subsequences, which are exploited to prune unnecessary subsequence matches in the search process. Further, the partition task is formulated as an optimization problem of minimizing the difference between query subsequences. We present a trust-region-based gradient descent algorithm to iteratively update the match priority and boundaries of the query subsequences; this process is called video query reformulation. With the proposed query reformulation method, the increment of the computation overhead is limited. Extensive experiments covering different feature representations, subsequence granularities, and baseline methods demonstrate the proposed video query reformulation method is robust and efficient to deal with a variety of near-duplicates in a large-scale video dataset. The proposed query reformulation method makes our approach different from video shot search, which partitions a video clip at the shot boundaries, i.e., the abrupt and gradual transition points. Each shot is an aggregation of contiguous frames and its content should be clear distinguishable from that of adjacent shots. However, based on our query reformulation concept, we find that it would reduce more computation cost if the difference between query subsequences is smaller; the concept contradicts the basic definition of shots. Besides, if there is no shot boundary found in the query clip, shot search degenerates into whole-video search. The remainder of this paper is organized as follows. Section II contains a review of related works in NDVD. In Section III, we describe the proposed NDVD framework in detail. We show and discuss the experiment results in Section IV, and summarize our conclusions in Section V. II. RELATED WORK The whole-video search generally adopts a compact representation to shorten the feature dimensions, making the matching process more efficient. For example, Chueng and Zakhor [2] defined the video similarity of two video clips as the intersection of their frame clusters. That is, a video clip was represented by random sampling a small set of frames. The video similarity was calculated by counting the number of sampled frame pairs that were similar enough. Shen et al. [13] extended the same idea to characterize a video clip by a hypersphere, and the similarity of two video clips is defined as their intersection volume of hyperspheres. Wu et al. [19] extracted the middle frames of video shots from a video clip. These keyframes were characterized by the global color histogram and the PCA-SIFT descriptor for coarse filtering and

Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < fine matching, respectively. Shang et al. [12] presented a spatiotemporal feature by modeling the ordinal relations and the shingling properties of video frames. A video clip was then represented as the histogram distribution of its frames’ features. Song et al. [16] trained multiple feature hashing functions to map each video clip into a series of binary codes, and used the bit XOR operation in the Hamming space to perform quick match. Although the compact feature representation effectively accelerates the whole-video search, it discards some useful information (e.g., temporal context) and suffers from a worse discriminative power. On the other hand, subsequence matching usually leverages the inherent temporal information to enhance the video discriminative power. For example, Hua et al. [7] applied a dynamic programming algorithm to handle minor temporal distortion for video sequence alignment. Chiu et al. [3] defined a transition graph of two video sequences. The matching was transformed into a shortest path finding problem and can be accelerated by some heuristic rules. Shen et al. [13] modeled the matching problem as a bipartite graph and solved it by the maximum size matching algorithm. Tan et al. [17] constructed a temporal network of the top-k similar frames, and linked the frames with edges based on several temporal constraints. The frame matching task was thus transformed into finding the maximum flow path in the network. Chiu et al. [4] proposed a spatiotemporal matching method to find specific patterns exhibiting on a 2D intensity map, which was a visualization of frame pair similarities between two video sequences. The pattern indicated a set of consecutive frame pairs with high similarities, i.e., a near-duplicate candidate. Yeh and Cheng [20] built a dot plot of two video sequences, where a dot represented a highly similar frame pair. They identified a near-duplicate candidate by finding a long approximate diagonal on the dot plot. Subsequence matching may spend a substantial computation cost, in particular for matching long-length and heavy-distorted video sequences. In the following, we address the efficiency issue of subsequence matching and present some methods to alleviate this problem. III. THE PROPOSED FRAMEWORK Our NDVD framework follows a coarse-to-fine search strategy, as shown in Figure 1. Suppose a query Q is input to a dataset of target video clips, denoted as {T}dataset. First, {T}dataset is organized by inverted indexing in whole-video search. It provides fast filtering of {T}dataset to generate a set of candidate target clips {T}candidate that are approximately similar to Q. Then, in subsequence search, the proposed query reformulation method is applied to examine each candidate target clip to search for a set of near-duplicate subsequence pairs {(Qr, Ts)}near-duplicate , where Qr ⊆ Q and Ts ⊆ T whose similarity is greater than a predefined threshold θ are considered near-duplicates. The details are presented in the following subsections.

2

A. Whole-Video Search: Inverted Indexing Let Q = {qi | i = 1, 2, ... , nQ} be a query clip with nQ frames, where qi is the i-th query frame; and let T = {tj | j = 1, 2, ... , nT} be a target clip with nT frames, where tj is the j-th target frame. For the ith query frame, it is represented by a signature of a 0,1 . We regard qi as a bag D-dimensional binary vector of visual words and its signature as an indicator of the occurrence of visual words. If the dth visual word occurs in qi , 1; otherwise we denote the signature’s dth element as 0. Thus, the signature of the query clip is defined as: , (1) , ,…, for d = 1, 2, ... , D. Clearly 1 indicates that the dth visual word occurs in Q. Each target clip T is represented in the same manner. We employ an inverted index structure to maintain the dataset of target video clips. An inverted index structure X contains D cells that corresponding to D visual words. Each cell stores a link list of the target video clips. A target video clip T is 1, T ’s file ID is indexed according to its signature ; if inserted into the dth cell X(d). Figure 2 shows an example of an inverted index structure. When a query clip Q is submitted, it is processed in a similar manner. If 1 and X(d) contains T, we increment T ’s count by one (the count is set at zero initially). When T ’s count exceeds a predefined threshold ϕ, T is considered a candidate clip and added to the set {T}candidate . B. Subsequence Search: Query Reformulation The next stage is to perform the fine matching between Q and each target clip candidate T ∈ {T}candidate. To deal with temporal-based transformations, we employ a sliding window to scan over T. The segment framed by the sliding window is denoted as a target sequence | , 1, … , 1 , 1, 1 , where w is the length of the sliding window. Q is partitioned into a set of non-overlapped , query segments {Qr | r = 1, 2, ... , R} initially, where ⎣⋅⎦ denotes the floor function, is the number of query segments. We define the similarity between two video segments Qr and Ts by the Jaccard coefficient:

Fig. 2. An example of an inverted index structure used to organize target video clips.

Fig. 1. An overview of the proposed framework.

Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < ,

,

Note

(2)

and where | A | returns the cardinality of vector A. Since are binary vectors, their intersection is equivalent to their inner product. Therefore, we sometimes express Equation (2) as: ,

,

,

(3)

where , returns the inner product of vectors A and B. If sim(Qr, Ts) > θ, we then output the segment pair (Qr, Ts) to the near-duplicate set. Rather than computing the similarity between each query segment and a target segment exhaustively, we prune some query segments to avoid unnecessary matches according to these query segments’ relations. Moreover, query segment partition can be formulated as an optimization function that maximizes the pruning probability. z

Query segment pruning Query segment pruning allows us to skip the computation of sim(Qr, Ts) for some query segments and thus accelerate the search process. We employ a lightweight test to determine whether Qr can be pruned during the similarity computation be the pivot query segment selected from {Qr}; phase. Let the selection mechanism is detailed in later paragraphs. We first and Ts, i.e., , . calculate the similarity between Then, the similarity between Qr and Ts, r ≠ r*, satisfies the following inequality: \

,

,

(4)

where \ returns the set difference of vectors and . If each query subsequence’s set difference to \ is pre-calculated, the the pivot query subsequence numerator of Equation (4) can be obtained in O(1) time. However, the denominator still requires O(D) time. We present two approaches for reducing the computation cost of the denominator. The first approach is applying the set relation to \ . obtain a similarity upper bound. Let We can rewrite Equation (4) as: , ,

.

(5)

The second is an incremental update approach. For each target frame tj, we pre-calculate its hamming distance to its previous frame tj–1: ∑ HDIST , (6) ⊕ where ⊕ denotes the exclusive-or operation of two bits. Based on Equations (4) and (6), we derive the following inequality for the similarity between Qr and Ts+1: \

, \ HDIST

,

.

(7)

3

that

the inequality is held for any two adjacent target subsequences Ts and Ts+1. The reason is when we shift the window forward one frame in T, frame ts slides out of the window and frame ts+w slides into the window. The maximum decrement from to is thus HDIST . Equations (5) and (7) both take O(1) time to derive the similarity upper bound between the query and target subsequences. In Section IV.C, our experimental result shows that the incremental update approach of Equation (7) is more effective than the set relation approach of Equation (5). To facilitate the subsequent discussion, we directly adopt the incremental update approach to compute the similarity upper bound hereinafter. The similarity upper bound is used to prune query subsequences as follows. Suppose that we have pre-calculated the hamming distances of tj, set differences between Qr and and the cardinalities of Qr for r = 1, 2, ... , R. First, we obtain the values of for r = 1, 2, ... , R, where . Then, to compute the next target sequence’s similarity with the rth query subsequence sim(Qr, Ts+1), simupbnd(Qr, Ts+1) can be derived immediately according to Equation (6). If simupbnd(Qr, Ts+1) is not greater than θ, we do not need to calculate the actual similarity sim(Qr, Ts+1) and consider they are not near-duplicate; otherwise we have to calculate sim(Qr, Ts+1) using Equation (2). Compared with Equation (2) that takes O(D) time, the overhead of Equation (7) is relatively lightweight as it only takes O(1) time. Thus, the computation cost of {sim(Qr, Ts+1) | r = 1, 2, ... , R} can be reduced if we prune some query subsequences whose similarity upper bounds are not greater than θ. Algorithm 1 summarizes the proposed query subsequence pruning. In Line 5, we initializes two arrays INTERSECT and , UNION of R elements: INTERSECT UNION INTERSECT , r = 1, 2, ... , R. They are used to store the cardinalities of intersection and union between the Qr and Ts, respectively. Lines 8 to 9 compute the similarity between the pivot query subsequence and the current target sequence. If the similarity is greater than θ, we compute the other query subsequences’ actual similarities sim(Qr, Ts), as shown in Lines 10 to 12; otherwise, we examine the similarity upper bounds , , in Lines 14 to 18. The function ComputeActualSimilarity(Qr, Ts) computes the actual , UNION values of INTERSECT INTERSECT

, and

,

INTERSECT UNION

.

The time complexity of online computation between a query clip Q and a target clip T is analyzed in the following. The similarity computation sim(Qr, Ts) between the query segments Qr and the target segment Ts in Equation (2) spends O(D) time; the similarity upper bound computation simupbnd(Qr, Ts) in Equation (6) spends O(1); thus, the total time complexity of computing Q and T is about · · · , where 0 ≤

δ ≤ 1 is the probability of pruning a query segment.

Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Algorithm 1 Computation of the Similarity between the Query Clip Q and the Target Clip T. ∑ ⊕ 1: Set HDIST(t1) = 0 and HDIST for j = 2, 3, ... , nT . , 1, … , 2: Partition T into target sequences 1, 1, 1, where w is the length of each Ts. 3: Partition Q into R query segments {Qr | r = 1, 2, … , R}, where . and for r = 1, 2, … , R, where 4: Compute \ is the pivot query segment. 5: For s = 1, initialize two arrays INTERSECT and UNION. 6: for s = 2 to nT – w + 1 do 7: | , 1, … , 1. ; UNION . 8: INTERSECT ,

9:

INTERSECT

.

UNION

, >θ for r = 1 to R, r ≠ r*, do ComputeActualSimilarity(Qr, Ts). else for r = 1 to R, r ≠ r*, do INTERSECT INTERSECT UNION UNION HDIST

10: 11: 12: 13: 14: 15:

if

INTERSECT

,

16: 17: 18:

if

\

;

.

.

UNION

, ComputeActualSimilarity(Qr, Ts).

z

Query partition optimization The objective of the optimization task is to find the best strategy to partition a query clip so that the computation cost can be minimized. Intuitively, it can be achieved by lowering the similarity upper bound. The lower the similarity upper bound of a query segment, the higher will be the probability that we can skip its similarity computation. We formulate the objective function as follows: argmin ∑ , ,

,…,

argmin ∑

,

argmin ∑

,

, ,

,

, ,

,…,

,

1

, ,

1

Linearity

HDIST

,…,

argmin ∑

Equation (7)

HDIST

,…,

argmin ∑ ,

, HDIST

,…,

,

,

HDIST

(8)

where is the complement of . To solve Equation (8) corresponds to find the query segment set that has the smallest summation of similarity upper bounds. Suppose that target sequences’ signatures are randomly distributed and independent of those of the query segments. We ignore the terms related to T and focus on the query segments’ relation, i.e.,

,

4

query clip, where each , , should be as close to as possible. To solve the optimization problem, we applied the gradient descent algorithm guided by a trust-region method [5]. The trust-region method is an iterative algorithm that searches for a local solution of the objective function in a certain region called a trust region. In each iteration, a model that approximates the trust region is built to assess the local solution. If the ratio between the achieved reduction in the objective function and the predicted reduction in the model is sufficiently good, we increase the trust region's radius for the next iteration; otherwise we reduce the radius. The trust-region method has been widely applied in large-scale numerical optimization and logistic regression tasks for fast convergence [10]. It is effective in dealing with general bound-constrained optimization problems, such as our case that the query subsequence regions are updated in a certain intervals. Consider a query subsequence Qr and its boundaries br and br+1, where b1 = 1, bR+1 = nQ+1, and , ,…, , as shown in Figure 3. Our goal and 2) is to 1) select a suitable pivot query subsequence find appropriate boundaries for {Qr | r = 1, 2, … , R}, so that and thereby minimize the computation cost. The is close to optimization process is summarized in Algorithm 2. In Line 1, we initialize the boundaries br and their trust-region radiuses ℜr. Lines 3 to 22 are the main loop of the optimization process. In Line 4, we select the pivot query subsequence based on the expression \ ̂ , ̂ 1,2, … , . (9) ̂∑ For each query segment ̂ , the above equation sums up all its set difference cardinalities with other query segments; the one with the minimum summation is selected as the pivot query segment . Next, we consider the following cases for and Qr with respect to the dth dimension: (1) If does not 0 and 1, it means that contain the dth visual word, but Qr does contain it. To make the signature of Qr closer to that of , Qr should be shrunk so that the dth visual word can be removed from Qr. This is achieved by moving br and br+1 closer to each other. contains the 1 and 0, it means (2) If dth visual word, but while Qr does not. Qr should be enlarged so that the dth visual word can be included in Qr. This is achieved by moving br and br+1 away from each other (3) If and Qr have the same , it means property at the dth dimension. Thus, it is not necessary to modify br and br+1.

Fig. 3. A query clip and its partitioned query segments.

, in the summation. To

, , should be different from ; in other minimize should be close to . Thus, the problem of words, minimizing the computation cost is transformed to find the set of query segments {Qr | r = 1, 2, … , R} by partitioning the

Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Algorithm 2 Optimization for Partitioning Query Clip Q into R Segments {Qr | r = 1, 2, … , R} 1: Initialize 1 and ℜ for r = 2, … , R. 2: iter = 1. 3: do \ ̂ , ̂ 1,2, … , . 4: Select ̂∑ 0 for r = 1, 2, … , R and d = 1, 2, … , D. 5: Reset ∆ 6: for r = 1 to R do 7: for d = 1 to D do 0 and 1 8: if ∆ 9: ∆ , ℜ 1, , 0 . ∆ 10: ∆ ℜ , 1, , 0 . 1 and 0 11: else if ∆ 12: ∆ ℜ , 1, , 1 . 13: ∆ , ℜ ∆ 1, , 1 . κ·∑ ∆ for r = 1, 2, … , R. 14: Update | ′ \ | | \ | for r = 1, 2, … , R. 15: Construct M 16: for r = 1 to R do 17: if Mr < 0 18: ℜr = η+×ℜr ; ℜr+1 = η+×ℜr+1 . 19: else 20: ℜr = η-×ℜr ; ℜr+1 = η-×ℜr+1 . 21: iter = iter + 1. ∑ |∆ | 22: while ∑ or iter ≤ iterMAX

∆ Let ∆ | 1,2, … , be br's gradient over ℜr. It reflects the moving vector for updating br. Cases (1) and (2) correspond to Lines 8 to 10 and Lines 11 to 13 respectively. Function count(left, right, d, symbol) returns the number of symbols (symbol = '0' or '1') in the set | , ; it represents br's gradient magnitude on the dth 1, …, dimension of the frame sequence ,..., . We apply the approximate Cauchy point to update br based on the dimension gradients ∆ and a controlling coefficient κ through the Jacobi method, as shown in Line 14. Then, in Line 15, we construct the trust-region models for the rth query segment: , (10) , , M where Qr' is the new updated rth query segment. If Mr > 0, it means the update is acceptable, so we enlarge the boundary trust regions ℜr and ℜr+1 by a updating coefficient η+; otherwise, we shrink the trust regions by η-, as shown on Lines 16 to 20, where η+ > 1 and η- < 1. The algorithm continues to update the boundaries until the update magnitude is less than a predefined threshold ρ or the iteration exceeds the maximum count iterMAX. IV. EXPERIMENTS

A. Video Dataset We evaluated the proposed framework on CC_WEB_VIDEO [1] and TRECVID [18] collections. The CC_WEB_VIDEO collection contains 24 folders with a total of 12,890 video clips downloaded from video sharing websites. The video length is approximately 732 hours. In each folder, there is a major group of near-duplicate video clips. These near-duplicate video clips are mainly different in their visual quality, compression codec, frame resolution, cropping,

5

subtitling, frame rate, etc. For the TRECVID collection, we downloaded the IACC.1.A video data used for the content-based copy detection task, with a total of 8,175 video clips. The video length is approximately 225 hours. All these video clips were converted into a uniform format of 320×240 pixels frame resolution and 1 frame per second (fps). They served as the target dataset, with a total of 21,065 files, 3,448,191 frames, and approximately 957 hours. In order to accelerate the search process, we preprocess and store the window subsequences rather than individual target frames. A window subsequence was represented by a D-dimensional binary vector, which required D/8 bytes for storage despite the window size. For example D = 1024, our target dataset contained 3,047,956 window subsequences and consumes 369.49 MB storage space. To compile a query dataset, we selected the first video clip of each folder of the CC_WEB_VIDEO dataset. This is denoted as a full-duplicate query clip. We also generated partial-duplicate query clips by randomly excerpting two snippets of 15 and 30 seconds from the full-duplicate query clips. Each snippet was inserted into an unrelated video clip at a random position to form a partial-duplicate query clip. In total we compiled 24 spatial and 48 temporal query clips, each of which was truncated to 60 seconds duration, as shown in Figure 4. Four undergraduate students annotated a ground truth for every target video clip with the tags "full near-duplicate," "partial near-duplicate (with timestamps)," and "not near-duplicate."

(a)

(b)

Fig. 4. Examples of (a) a full-duplicate query clip and (b) a partial-duplicate query clip. The crosses indicate that the corresponding video content has been deleted.

B. Feature Extraction In the experiment, both global-based and local-based signatures were modeled in a BoVW form with different codebook sizes to observe their respective performance. We employed the LBP-based ordinal relation feature [12] to produce the global-based feature for a frame f. Each frame was first divided into 3×3 non-overlapping blocks and then these blocks’ intensity ranks were computed. For a 1024-codeword , it was expressed as a vector of 10 ordinal relation functions 2,2 , 1,2 , 2,2 , 2,3 , 2,2 , 3,2 , 2,2 , 2,1 , 1,1 , 1,3 , 1,3 , 3,3 , 3,3 , 3,1 , 3,1 , 1,1 , 1,1 , 3,3 , 1,3 , 3,1 , (11) where function R(s, t) is an indicator that returns 1 if the rank of the sth block is greater than that of the tth block; otherwise it returns 0. Figure 5 gives an example of the global-based feature. was equivalent to a nonnegative integer in the Since interval [0, 1023], it was treated as a visual word code and represented as a 1024-dimensional binary vector, where all bins

Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < were set to zero except for the element corresponding to , which was set at one. For the other codebooks of 2048, 4096, and 8192 codewords, their features were expressed as , 3,2 , 3,3 ; ,

2,1 , 3,1

;

, 3,1 , 3,2 (12) respectively. The ordinal relation functions used in Equations (11) and (12) were selected based on the spatial topology and information entropy proposed by Shang et al. To generate the local-based signature, we extracted SIFT descriptors [11] and quantized them by a codebook of D codewords for each frame, D ∈ {1024, 2048, 4096, 8192}. Each SIFT descriptor was quantized to the nearest codeword and assigned to the corresponding histogram bin. The BoVW form of a video frame f (denoted as ) was therefore expressed as a signature of a D-dimensional binary vector, as defined in Section III.A.

1, 0, 1, 1, 1, 0, 1, 0, 0, 0 Fig. 5. An example of the 3×3 ordinal relation of a video frame and its global-based feature.

C. Performance for the Full-Duplicate Query Dataset z

Preliminary Study of Parameters

The parameters used in this study were set as follows. For feature representation, we investigated different numbers of codewords D = {1024, 2048, 4096, 8192}, the similarity thresholds θ = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} for the global-based feature and θ = {0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5} for the local-based signature, and the window size of the target sequence w = {5, 10, 15, 20, 30}. In query partition optimization mentioned, the maximum iteration count iterMAX = 10, the trust-region parameters ρ = 0, η+ = 1.5, η- = 0.5, · · κ for the global-based feature and κ for the local-based signature. In Section 3.4, the count threshold ϕ = 0.5×max_count, where max_count is the maximum of the inverted indexing counts of all target clips. A detection result was considered correct if it had any overlap with the region from which the query was extracted. The recall and precision rates were used to evaluate the accuracy of the detection result: recall = TP / (TP + FN), precision = TP / (TP + FP), (13) where true positives (TP) refer to the number of positive examples correctly labeled as positives; false negatives (FN) refer to the number of positive examples incorrectly labeled as

6

negatives; and false positives (FP) refer to the number of negative examples incorrectly labeled as positives. We also used the F-measure calculated as F-measure = (2 × recall × precision) / (recall + precision) (14) We first take the full-truncated query dataset for evaluation. Tables I and II list the experiment results in terms of the recall and precision rates for the global-based and local-based signatures, respectively, where query partition optimization was applied and w = 20. The highest F-measure score for each row of D is shown in bold font. As the number of codewords D grows, the recall rate decreases and the precision rate increases; the overall F-measure is improved. It shows that using a more number of codewords can strengthen the feature’s discriminative power. To assess the effectiveness of query segment pruning, we define the query segment pruning ratio (QSPR) as: QSPR

(15)

for matching R query subsequences and a target subsequence. QSPR reflects the proportion of the query subsequences that are pruned without computing the Jaccard similarity with the target subsequence. Figure 6 shows the QSPR under different θ and D. The QSPR increases as θ and D grow. It is obvious that a larger θ enlarges the gap to the similarity; more query subsequences can be thus pruned. A larger D generally makes the similarity lower; it also enlarges the gap between θ and the similarity and thus increases the QSPR. We also observe these pruned subsequences for the ratios of correct-pruning (the pruned subsequences that are actually not near-duplicate) and incorrect-pruning (the pruned subsequences that are actually near-duplicate). By setting θ = 0.4, D = 1024, and w = 20, the global-based feature yields 92.71% correct-pruning rate and 7.29% incorrect-pruning rate; the local-based signature yields 98.17% correct-pruning rate and 1.83% incorrect-pruning rate. Note that multiplying the QSPR rate by the incorrect-pruning rate indicates the proportion of false negatives contributed by the proposed pruning scheme; in the above case, the false negative proportions are 2.91% and 0.72% for the global-based feature and local-based signature, respectively. To strike a balance between accuracy and efficiency, we set θ = 0.4 and D = 1024 for both global-based and local-based signatures permanently in the subsequent experiments. z

Query Partition Optimization Table III and IV list the recall and precision rates under different window sizes and optimization configurations. The window size w = {5, 10, 15, 20, 30}. Three optimization configurations are used for comparison: naïve, pivot selection, and pivot selection with optimization. The naïve configuration applies neither pivot selection nor optimization; the pivot selection configuration uses Equation (8) to select the pivot from the query segments; and the pivot selection with optimization configuration first selects the pivot segment and then performs the trust-region-based gradient descent algorithm.

Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

7

Table I. The recall and precision rates of the global-based feature under different similarity thresholds and feature dimensions. Global-based (opt, w = 20) R P R P R P R P

D = 1024 D = 2048 D = 4096 D = 8192

θ=0.1

θ=0.2

θ=0.3

θ=0.4

θ=0.5

θ=0.6

θ=0.7

θ=0.8

θ=0.9

0.9812 0.6243 0.9750 0.6842 0.9750 0.7140 0.9719 0.8639

0.9656 0.8374 0.9656 0.8983 0.9656 0.9062 0.9625 0.9448

0.9656 0.9169 0.9656 0.9208 0.9656 0.9279 0.9563 0.9503

0.9500 0.9441 0.9396 0.9454 0.9375 0.9464 0.6344 0.9486

0.9437 0.9497 0.6601 0.9470 0.6062 0.9463 0.4469 0.9408

0.6250 0.9434 0.5375 0.9350 0.5062 0.9364 0.2656 0.9239

0.5406 0.9454 0.3882 0.9397 0.3406 0.9327 0.0063 1.0000

0.4000 0.9275 0.3043 0.9271 0.2781 0.9271 0.0040 1.0000

0.2500 0.9195 0.2375 0.9157 0.2375 0.9319 0.0015 1.0000

Table II. The recall and precision rates of the local-based signature under different similarity thresholds and feature dimensions. Local-based (opt, w = 20) D = 1024 D = 2048 D = 4096 D = 8192

R P R P R P R P

θ=0.15

θ=0.2

θ=0.25

θ=0.3

θ=0.35

θ=0.4

θ=0.45

θ=0.5

0.9781 0.5681 0.9781 0.7098 0.9719 0.8338 0.9719 0.9284

0.9781 0.6576 0.9719 0.8163 0.9656 0.9224 0.9531 0.9502

0.9750 0.7411 0.9656 0.9115 0.9500 0.9530 0.9125 0.9511

0.9688 0.8356 0.9531 0.9531 0.9125 0.9511 0.4156 0.9366

0.9656 0.9169 0.9125 0.9545 0.3844 0.9318 0.3156 0.9182

0.9500 0.9530 0.4344 0.9456 0.2687 0.9247 0.2437 0.9070

0.6094 0.9512 0.3000 0.9231 0.2344 0.9259 0.2344 0.9259

0.3531 0.9339 0.2406 0.9167 0.2188 0.9211 0.1656 0.9138

1

1 0.9

D= D= D= D=

0.8 0.7

0.9

1024 2048 4096 8192

D= D= D= D=

0.8 0.7

1024 2048 4096 8192

0.6 QSPR

0.6 QSPR

θ=0.1 0.9844 0.5130 0.9781 0.5796 0.9781 0.6819 0.9750 0.8211

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2 0.1

0.1 0.1

0.2

0.3

0.4 0.5 0.6 threshold θ

0.7

0.8

0.1

0.9

0.2

0.3

0.4 0.5 0.6 threshold θ

0.7

0.8

0.9

(a) (b) Fig. 6. The QSPR of (a) the global-based feature and (b) the local-based signature. Query partition optimization is applied and the window size w = 20 for both features. Table III. The recall and precision rates of the global-based feature under different window sizes and optimization configurations. Global-based (θ = 0.4, D = 1024) Naïve Pivot Pivot + Opt

R P R P R P

w=5

w=10

w=15

w=20

w=30

0.9688 0.7635 0.9688 0.7692 0.9688 0.7769

0.9656 0.8983 0.9656 0.8983 0.9688 0.8883

0.9563 0.9329 0.9563 0.9329 0.9656 0.9307

0.9500 0.9412 0.9500 0.9412 0.9500 0.9441

0.9469 0.9469 0.9469 0.9469 0.9500 0.9530

(small w) yields a high recall rate while a long query segment (large w) yields a high precision rate. In addition, the local-based signature yields a higher and stable precision rate than the global-based feature on small w. Generally, a video segment that is long or characterized by the local-based signature contains a more number of feature descriptors, which constitutes a stronger discriminative power. Figure 7 shows the QSPR under different window size and optimization configurations. The result demonstrates that by applying the pivot selection method and the optimization algorithm, we can skip more unnecessary matches between query segments and target sequences. For the naïve configuration, a larger w has a lower QSPR due to the two reasons: the number of query segments is fewer and the query segments are more diverse to each other. However, in the optimization configuration, the larger w gains much more improvement on the QSPR. Since the proposed optimization algorithm makes each query segment close to the pivot segment, the total similarity upper bound defined in Equation (7) is reduced and thus the probability of query segment pruning is raised. 0.8

0.9344 0.9432 0.9344 0.9432 0.9313 0.9460

w=10 0.9500 0.9530 0.9500 0.9530 0.9500 0.9500

w=15 0.9531 0.9502 0.9531 0.9502 0.9531 0.9502

w=20 0.9531 0.9502 0.9531 0.9502 0.9500 0.9530

w=30 0.9625 0.9536 0.9625 0.9536 0.9656 0.9421

The recall and precision rates vary slightly among these optimization configurations. It shows that the proposed optimization algorithm does not influence the accuracy significantly. For different window sizes, a short query segment

naive pivot pivot + opt

0.7

0.6

0.6

0.5

0.5 QSPR

w=5

QSPR

Local-based (θ = 0.4, D = 1024) R Naïve P R Pivot P R Pivot + Opt P

0.8 naive pivot pivot + opt

0.7

Table IV. The recall and precision rates of the local-based signature under different window sizes and optimization configurations.

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

5

10

15 window size w

20

30

5

10

15 window size w

20

30

(a) (b) Fig. 7. The QSPR under different window sizes and optimization configurations of (a) the global-based feature and (b) the local-based signature.

z

Subsequence Pruning Effectiveness

Recall that Equations (5) and (7) both serve as the similarity upper bounds with the computation cost about O(1)

Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < time. We compare their QSPRs and average similarity upper bounds that reflect the effectiveness on segment pruning, as shown in Tables V and VI. We find that the incremental update approach, i.e., Equation (7), can yield a higher QSPR and a lower average similarity upper bound than the set relation approach, i.e., Equation (5). The result also supports our assumption that a segment with a lower similarity upper bound has a higher probability to be pruned. Based on the observation, we thus choose the incremental update approach for query segment pruning. Table V. The QSPRs and average similarity upper bounds of two query pruning approaches for the global-based feature. Global-based (pivot+opt, θ = 0.4, D = 1024) Incremental QSPR Update avg. simupbnd QSPR Set Relation avg. simupbnd

w=5

w = 10

w = 15

w = 20

w = 30

0.3663 0.5403 0.3603 0.7880

0.3739 0.5205 0.3238 0.7412

0.3425 0.5138 0.2664 0.7677

0.3993 0.4760 0.3278 0.6635

0.6710 0.3499 0.6433 0.4239

8

1 (or 1) for the subsequence Qr (or Ts). The set is maintained by a link list in ascending order. The intersection between two link lists can be completed in O(L) time, where L is the length of a link list. We design an experiment, with the following configuration: θ = 0.4, D = 1024, w = 20, and applying pivot optimization, to compare the two intersection methods. The execution time to search a 60-second query video in our target dataset is listed in Table VIII. We find that the average length of a subsequence’s link list is 8 and 163 for the global-based feature and local-based signature, respectively. Since the global-based feature space is very sparse, using the inverted indexing method to implement the intersection function is more efficient. On the other hand, the AND operation method is more appropriate for the local-based signature. We adopt the AND operation for the intersection function throughout the experiments. Table VIII. The computation time for two intersection methods.

Table VI. The QSPRs and average similarity upper bounds of two query pruning approaches for the local-based signature. Local-based w=5 (pivot+opt, θ = 0.4, D = 1024) Incremental Update Set Relation

QSPR avg. simupbnd QSPR avg. simupbnd

0.0574 0.6222 0.0898 0.9421

w = 10

w = 15

w = 20

w = 30

0.1843 0.5319 0.1286 0.7278

0.2265 0.5027 0.1882 0.6764

0.3949 0.4444 0.3382 0.5690

0.3851 0.4026 0.3185 0.5181

z

Computation Time Table VII summarizes the computation time of the proposed method. The row of clip-level search lists the time spent for inverted indexing of 21,065 target files. The row of subsequence-level search lists three optimization configurations for comparison. Basically, the computation time is inversely proportional to the QSPR metric; a higher QSPR (e.g., pivot+opt) spends a less search time than a lower QSPR (e.g., naïve). The computation time of the global-based feature is less than that of the local-based signature since the compact global-based feature is computation-efficient. The program was implemented in C++, and ran on a PC with a 2.8GHz CPU and 4GB RAM. Table VII. Summary of the computation time (in second). Global-based Local-based 0.001 0.028 0.035 0.205 0.033 0.181 0.030 0.153

(pivot+opt, θ = 0.4, D = 1024, w = 20) Whole-video search Naïve Subsequence search Pivot Pivot + opt

In addition to investigate the efficiency issue, we implement two methods for the intersection function of the query and target subsequences, which is the main overhead of the similarity computation. One is using the AND operation and the other is using the inverted indexing. In the AND operation method, a video subsequence is stored in a binary array of D bits. The intersection can be accomplished by applying the AND operation of the two arrays. The method is efficient since the AND operation is a fast machine instruction. The computation cost is about O(D) time. In the inverted indexing method, we generate a set {d | d ∈ [1, D]}, where

(pivot+opt, θ = 0.4, D = 1024, w = 20)

Global-based

Local-based

AND operation (seconds) Inverted indexing (seconds)

0.030 0.010

0.153 0.303

D. Performance for the Partial-Duplicate Query Dataset In this subsection, four state-of-the-art search methods proposed by Shang et al. [12], Tan et al. [17], Huang et al. [8], and Zhou and Chen [21] were implemented as the baselines. Shang et al.’s method follows the whole-video search scheme. The ordinal relation functions described in Section IV.B were used to represent each frame. To model the spatiotemporal property of a frame, the visual shingle concept that utilized the ordinal relations of continuous frames was applied. The visual shingle feature was regarded as a visual word, and the video clip was represented as an aggregation of all frames in a signature of the bag-of-visual-words form. The similarity between two clips was computed by the Jaccard coefficient of their signatures. In our implementation, the LBP-based ordinal relation feature [12] was used and the shingle size is 3. In Tan et al.'s method, a query video and a target video were aligned by constructing a temporal network between their frame pairs. For each query frame, the top-k similar frames were retrieved from the target video. Edges were established between the retrieved frames based on several heuristic temporal constraints, including temporal distortion level wnd and minimal length ratio of the near-duplicate subsequence Lmin. A network flow algorithm was employed to find the best path in the graph. In our implementation, Tan et al.'s method used the same local-based signature, and the related parameters were set as wnd = 5, k = 1, and Lmin = 0.1. Huang et al. proposed a frame skipping algorithm for speedup. The algorithm combined the histogram pruning algorithm [9] with a temporal order checking. The temporal order checking examined if the first frame of the window subsequence is among the first p frames of the query subsequence, otherwise the weighted edit similarity between the two subsequences was not greater than θ. p was calculated by: | |· 1 , (16)

Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < where ⎡x⎤ rounded x to the smallest integer greater than or equal to x. If the check condition was not held, we did not need to compute the weighted edit similarity between the query and target subsequences; we can skip Ts and check the next window sequence Ts+1 directly. The weighted edit similarity adopted the dynamic time warping algorithm to compute the similarity between two sequences. In our implementation, it also used the same local-based signature. Zhou and Chen characterized a frame by variable-size video cuboid descriptors and computed the subsequence similarity by the Earth mover’s distance. In our implementation, a frame was partitioned into 3×3 non-overlapped blocks. Two adjacent blocks were merged to form a bigger block if their intensity difference is small. The merged block then calculated the intensity difference between its temporal adjacent blocks on the next 2 keyframes. Thus, a video cuboid descriptor consisted of a 3-gram intensity difference vector and the block size; a frame was represented by a set of video cuboid descriptors. The similarity between two sequences was calculated by the Earth mover’s distance. For acceleration, the incremental signature construction and similarity computation techniques were implemented by leveraging the spatial and temporal locality of adjacent frames. Table IX compares the performance of the baseline and our methods. The full-duplicate and partial-duplicate query datasets are evaluated separately. For the full-duplicate query dataset, all these methods achieve a satisfactory accuracy. However, for the partial-duplicate query dataset, the accuracy of Shang et al.'s method degrades sharply. It shows the whole-video search scheme cannot handle partial-temporal transformations well, although its compact representation makes the search process very efficient. On the contrary, the other baseline and our methods yield relatively slight degradation against partial-temporal transformations. It manifests the subsequence search scheme is more accurate and robust than the whole-video search scheme. However, the price is a higher computation cost. For example, Tan et al. applied the kNN search algorithm to build the temporal network of similar frames; Huang et al. employed the dynamic time warping algorithm to measure the frame order and similarity between two subsequences; Zhou and Chen extracted the complicated video cuboid signature and calculated the similarity by the time-consuming Earth mover’s distance. Although these baseline methods present some speedup techniques, their major algorithms still suffer a high computation complexity in analyzing the spatiotemporal relation. In our method, we utilize the set operations to calculate the subsequence similarity. The set operations are relatively computation-efficient compared with kNN, dynamic time warping, and Earth mover’s distance. In addition, the set operations have a stronger tolerance to partial-temporal transformations, while these baseline algorithms enhance the discrimination of a video by leveraging the temporal information.

Table VII. The recall and precision rates of the baseline and the proposed methods.

Shang et al. Tan et al. Huang et al. Zhou and Chen Global-based Local-based

R P R P R P R P R P R P

Full-trunca ted query dataset

Partial-temp oral query dataset

0.9412 0.9566 0.9500 0.9490 0.9292 0.9557 0.9589 0.9135 0.9500 0.9441 0.9500 0.9530

0.4861 0.5520 0.8417 0.9785 0.9509 0.7002 0.9768 0.7116 0.7412 0.9613 0.7484 0.9730

3,047,956

Data type Memory

Integer (4 bytes) 11.63 MB

Integer (4 bytes) 4⋅R bytes

Computati on Time (second) 0.007 78.547 0.799 3.376 0.030 0.153

E. Discussion The key factor of the proposed video query reformulation method is to decide the window size, i.e., the subsequence granularity; it also determines the number of partitioned query subsequences. A large window size usually obtains a high QSPR and precision rate, while a small window size can yield a high recall rate. The choice of the window size should be determined case by case. Compared with the global-based feature, the local-based signature performs relatively stable under various window sizes, especially for a small window size. However, since the global-based feature generally yields a higher recall rate, it is particular useful to find positive near-duplicates as many as possible. In addition, the global-based feature spends less indexing and matching time due to its compact representation. Table X summarizes the memory consumption of intermediate values used in Algorithm 1. It shows the proposed method spends quite a few storage space for a large-scale video dataset. V. CONCLUSION In this paper, we present a query reformulation method to address the robustness and efficiency issues in NDVD. The proposed method partitions a query clip into short query subsequences through an optimization algorithm to maximize the probability of pruning unnecessary matches. The proposed framework adopts a coarse-to-fine search strategy by integrating inverted file indexing with subsequence search and pruning. We implement both the global-based and local-based version and baseline methods in the experiments. The results demonstrate that the video query reformulation method expedites the search process efficiently and yields a robust performance to deal with a variety of near-duplicates in a large-scale video dataset.

Table X. The memory consumption of intermediate values used in Algorithm 1. HDIST(tj) |Qr| |Ts| INTERSECT(r) | \ | Size

9

UNION(r)

3,047,956

R

R

R

Integer (4 bytes) 11.63 MB

Integer (4 bytes) 4⋅R bytes

Integer (4 bytes) 4⋅R bytes

Integer (4 bytes) 4⋅R bytes

Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

REFERENCES [1] [2]

[3]

[4]

[5] [6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18] [19]

[20]

[21]

CC_WEB_VIDEO: Near-Duplicate Web Video Dataset. http://vireo.cs.cityu.edu.hk/webvideo/ S. C. Cheung and A. Zakhor, “Efficient video similarity measurement with video signature,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 13, No. 1, pp. 59-74, 2003. C. Y. Chiu, C. S. Chen, and L. F. Chien, “A framework for handling spatiotemporal variations in video copy detection,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 18, No. 3, pp. 412-417, 2008. C. Y. Chiu, H. M. Wang, and C. S. Chen, “Fast min-hashing indexing and robust spatio-temporal matching for detecting video copies,” ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 6, No. 2, pp. 10:1-23, 2010. A. R. Conn, N. I. M. Gould, and P. L. Toint, Trust-Region Methods, MPS-SIAM Series on Optimization, 2000. M. M. Esmaeili, M. Fatourechi, and R. K. Ward, “A robust and fast video copy detection system using content-based fingerprinting,” IEEE Transactions on Information Forensics and Security, Vol. 6, No. 1, pp. 213-226, 2011. X. S. Hua, X. Chen, and H. J. Zhang, “Robust video signature based on ordinal measure,” In Proceedings of IEEE International Conference on Image Processing (ICIP), Singapore, Oct. 24-27, 2004. Z. Huang, H. T. Shen, J. Shao, B. Cui, and X. Zhou, “Practical online near-duplicate subsequence detection for continuous video streams,” IEEE Transactions on Multimedia, Vol. 12, No. 5, pp. 386-398, 2010. K. Kashino, T. Kurozumi, and H. Murase, “A quick search method for audio and video signals based on histogram pruning,” IEEE Transactions on Multimedia, Vol. 5, No. 3, pp. 348-357, 2003. C. J. Lin, R. C. Weng, and S. S. Keerthi, “Trust region Newton method for large-scale logistic regression,” In Proceedings of International Conference on Machine Learning (ICML), Covallis, USA, Jun. 20-24, 2007. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, Vol. 60, No. 2, pp. 91-110, 2004. L. Shang, L. Yang, F. Wang, K. P. Chan, and X. S. Hua, “Real time large scale near-duplicate web video retrieval,” In Proceedings of ACM International Conference on Multimedia (ACM-MM), Firenze, Italy, Oct. 25-29, 2010. H. T. Shen, B. C. Ooi, and X. Zhou, “Towards effective indexing for very large video sequence databases,” In Proceedings of ACM International Conference on Management of Data (SIGMOD), Baltimore, USA, Jun. 14-16, 2005. H. T. Shen, J. Shao, Z. Huang, and X. Zhou, “Effective and efficient query processing for video subsequence identification,” IEEE Transactions on Knowledge and data engineering, Vol. 21, No. 3, pp. 321-334, 2009. J. Sivic and A. Zisserman, “Video Google: a text retrieval approach to object matching in videos,” In Proceedings of IEEE International Conference on Computer Vision (ICCV), Nice, France, Oct. 14-17, 2003. J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong, “Multiple feature hashing for real-time large scale near-duplicate video retrieval,” In Proceedings of ACM International Conference on Multimedia (ACM-MM), Scottsdale, USA, Nov. 28-Dec. 1, 2011. H. K. Tan, C. W. Ngo, R. Hong, and T. S. Chua, “Scalable detection of partial near-duplicate videos by visual temporal consistency,” In Proceedings of ACM International Conference on Multimedia (ACM-MM), Beijing, China, Oct. 19-24, 2009. TRECVID 2011 Guidelines. http://www-nlpir.nist.gov/projects/tv2011/ X. Wu, A. G. Hauptmann, and C. W. Ngo, “Practical elimination of near-duplicates from web video search,” In Proceedings of ACM International Conference on Multimedia (ACM-MM), Augsburg, Germany, Sep. 23-28, 2007. M. C. Yeh and K. T. Cheng, “Fast visual retrieval using accelerated sequence matching,” IEEE Transactions on Multimedia, Vol. 13, No. 2, pp. 320-329, 2011. X. Zhou and L. Chen, “Monitoring near duplicates over video streams,” In Proceedings of ACM International Conference on Multimedia (ACM-MM), Firenze, Italy, Oct. 25-29, 2010.

10

Chih-Yi Chiu (M’10) received the B.S. degree in information management from National Taiwan University, Taiwan, in 1997, and the M.S. degree in computer science from National Taiwan University, Taiwan, in 1999, and the Ph.D. degree in computer science from National Tsing Hua University, Taiwan, in 2004. From January 2005 to July 2009, he was with Academia Sinica as a Postdoctoral Fellow. In August 2009, he joined National Chiayi University, Taiwan, as an assistant professor in the Department of Computer Science and Information Engineering. His current research interests include multimedia retrieval and human-computer interaction.

Sheng-Yang Li received the B.S. degree in computer science and information engineering from Chang Jung Christian University, Taiwan, in 2010, and the M.S. degree in Computer science and information engineering from National Chiayi University, Taiwan, in 2012. In September 2012, he joined Acer Inc. as an R & D engineer. His current research interests include multimedia retrieval and human-computer interaction.

Chen-Yu Hsieh received the B.S. and M.S. degrees in computer science and information engineering from National Chiayi University, Taiwan, in 2010 and 2012. In September 2012, he joined Chung-Shan institute of science & technology as an R & D engineer. His current research interests include multimedia retrieval and computer graphics.

Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

A Query-Dependent Duplication Detection Approach for ...