Performance Bottlenecks in Manycore Systems: A ...

Viewer
Transcript

Performance Bottlenecks in Manycore Systems: A Case Study on Large Scale Feature Matching within Image Collections Xiaoxin Tang∗ , Steven Mills† , David Eyers† , Kai-Cheung Leung† , Zhiyi Huang† and Minyi Guo∗ ∗ Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science, Shanghai Jiao Tong University, China † Department of Computer Science, University of Otago, New Zealand Abstract—In memory-intensive algorithms, the problem size is often so large that it cannot ﬁt into the cache of a CPU, and this may result in an excessive number of cache misses, a bottleneck that can easily make seemingly embarrassinglyparallel algorithms such as feature-matching unscalable in manycore systems. To solve this bottleneck, this paper proposes a general Divide-and-Merge methodology, which divides the feature space into several small sub-spaces, so that the shared resources in each sub-space can be satisﬁed without causing bottlenecks. Experimental results have shown that the Divide-and-Merge methodology reduces the L3 cache misses and time spent on memory-allocation-related system calls, resulting in a 211% performance improvement on an AMD 64-core CPU machine, and 57% and 16% performance improvements on AMD and Intel 16-core machines respectively. Performance results on a modern GPU also show that a well-tuned algorithm with time complexity of O(F 2 ) is able to defeat a state-of-the-art O(F 1.5 ) algorithm by 188% for our real-world dataset, which again highlights the huge performance impact of the memory system. Keywords-Parallel computing; Feature matching; Memory wall; Divide-and-Merge.

I. I NTRODUCTION Multicore and manycore systems have been a disruptive technology within chip manufacturing, which is almost eradicating single-core CPUs. Sequential programs can no longer get the free ride from the continual increases in CPU frequency provided in the past, due to the well-known power wall that chip manufacturers cannot overcome. Now software performance can only be improved through parallel programming. Due to this inevitable trend, multicore and manycore CPUs have dominated desktop and server markets. Many programming languages have been developed to enable parallel computing on manycore architectures [1]–[4]. To ease the way of parallel programming, many programming languages such as OpenMP [5], Cilk [6], OpenACC [7], and X10 [8] have been proposed. They aim to allow programmers to focus on data structures and algorithms. Performance is supposed to be guaranteed through compiler optimization and runtime support, which are transparent to users. An even easier way to utilize parallel architectures is to employ pre-built libraries written by domain experts with expertise in parallel computing. For example, LINPACK [9] and LAPACK [10] are popular software libraries for numerical linear algebra, and cuFFT [11], cuBLAS [12] are popular GPU-accelerated libraries. Through

use of these mature, well-tuned libraries, parallel computing becomes much easier for new programmers. However, achieving good performance in parallel computing is still hard due to both theoretical and practical reasons. Amdahl’s law [13], as expressed in Equation (1), indicates that when the number of processing units (p) is large enough, the overall speedup of a parallel program is actually not decided by its parallelizable portion, but limited by its serial portion (f ), which represents the part of computation that cannot be parallelized efﬁciently. This factor has inspired many researchers to further parallelize their programs to achieve better performance. However, they also have doubt: is it worth spending so much time on parallelizing their programs that can only achieve a limited speedup? SAmdahl =

1 1 = , p→∞ f f + 1−f p

(1)

In practice as more processing units are used to solve larger and more complex problems, the serial portion of a program becomes relatively smaller. Based on this observation, “Gustafson’s law” has been proposed [14]. Its basic assumption is that the parallelizable portion of a program will increase proportionally to the number of processing units (p) used. Thus as Equation (2) shows, the speedup is not only decided by the serial portion (f ), but also by the problem size, which is proportional to the number of processing units (p). When the problem size is large enough, proportional speedup (p) can be achieved. f + (1 − f ) · p = f + (1 − f ) · p (2) SGustafson = )·p f + (1−f p Due to the “big data” era, data parallelism is becoming more popular. In reality, the amount of data is often larger than the available number of processing units. Thus, the embarrassing data-level parallelism can almost reduce the serial portion f to zero. In this case, under both Amdahl’s law and Gustafson’s law, we have S ≈ p. However, given all the reported difﬁculties with practical parallel programming, this result seems too good to be true in reality. In this paper, we will show the performance bottlenecks of manycore systems running parallel programs with sufﬁciently embarrassing data-level parallelism. We ﬁnd that, though

embarrassingly parallel programs have no communication or synchronization overhead between threads and their serial portion is close to zero, shared hardware and software resources in manycore systems like memory, last-level cache, and the OS kernel, may become the new serial portion of parallel programs. When these shared resources cannot satisfy the need of the massive processing units, they will affect the speedup of the embarrassingly parallel programs that are supposed to have good speedup according to both Amdahl’s law and Gustafson’s law. Therefore, we hope this paper will serve as a footnote to both Amdahl’s law and Gustafson’s law, which may help programmers to better optimize their programs. We use SIFT-based feature matching [15] as a case study to show the performance bottlenecks in manycore systems. The performance is evaluated on several manycore machines: one AMD 16-core, one Intel 16-core, one AMD 64-core and one NVIDIA GPU 448-core. Two k-Nearest Neighbors (k-NN) [16] search algorithms are used for image feature matching in our performance evaluation. These algorithms are widely used in related research and in OpenCV [17] for image matching. To demonstrate the performance of GPGPU, we also evaluate two GPU versions of k-NN algorithms [18], [19]. On both 16-core and 64-core machines, our experimental results show that the memory wall and system calls to the OS kernel may become performance bottlenecks. Since images with a large number of features will not ﬁt into the shared last-level cache during processing, they result in an increasing number of memory accesses, which in turn makes memory bandwidth a bottleneck and incurs poor speedup of the algorithm. Large feature spaces also require more per-thread heap space to store intermediate results, which requires frequent system calls to allocate memory. Since OS kernels usually use per-process spinlocks to control access to the virtual memory areas, it will cause serious contention problems between threads if there are many memory allocation operations. To solve these performance bottlenecks, we propose a Divide-and-Merge methodology by which the input data— here the SIFT feature space—is split into several smaller sub-spaces. We will demonstrate that this optimization brings three advantages. First, the feature data and index structures of these smaller sub-spaces can ﬁt into the last-level cache, which can greatly reduce the number of memory accesses. Second, the runtime per-thread memory usage is reduced due to a smaller problem size, which can avoid the contention problem of system calls that allocate heap memory. Third, a smaller problem size makes it possible to implement efﬁcient algorithms on modern GPGPUs that have limited device memory. Thus, this method successfully extends traditional Divide-and-Conquer into manycore computing and improves the scalability of k-NN algorithms on manycore systems. Experimental results show that the proposed Divide-andMerge methodology can efﬁciently improve the speedup of the k-NN algorithms on modern manycore systems. With this methodology, the speedup of k-NN search on an AMD 64-core machine is increased from 9.49× to 29.5×, which is a 211% improvement. The speedups on AMD 16-core

and Intel 16-core machines are also improved, by 57% and 16% respectively. Experimental results on manycore GPGPU devices show that a ﬁnely-tuned brute-force algorithm with a time complexity of O(F 2 ) is able to defeat a state-of-theart algorithm with a complexity of O(F 1.5 ), on our realworld, large-scale dataset. The former is faster than the latter by 188%, which shows that low-level bottlenecks in memory system become key factors in optimization for GPGPUs. In summary, these results show that our Divide-and-Merge methodology is applicable on a wide variety of manycore systems. The contributions of this paper are as follows: • We have conducted a wide range of experiments using large-scale image matching on several modern manycore systems. The results show that the speedup of a typical embarrassingly parallel program is highly dependent on its scalability on the underlying manycore system. • We have comprehensively evaluated the impact of memory wall and system calls on four popular k-NN algorithms. These results and analyses are useful for programmers to understand these performance bottlenecks and to optimize their programs accordingly. • We have proposed a Divide-and-Merge methodology that can efﬁciently improve the speedup of k-NN algorithms for image feature matching on manycore systems. It is general enough for k-NN problems that other related applications may also beneﬁt from this work. This paper is organized as follows: Section II introduces the case study and our motivation. Section III presents the k-NN problem and some popular algorithms. Section IV presents our Divide-and-Merge methodology and explains why it is able to improve performance. Section V shows the experimental results of our performance evaluation and performance tuning on several popular manycore systems. Section VI discusses related work. Finally, Section VII draws conclusions and sheds light on our future work. II. M OTIVATION Images are one of the most important forms of media that people use to store their data. Every day all around the world, people are creating new images. The large number of available images makes computer vision a very interesting and challenging research ﬁeld. However, the problem of how to process such a large number of images efﬁciently and accurately remains a challenge. In this paper, a widely used algorithm in computer vision— performing SIFT-based image feature matching—is studied on manycore machines with up to 64 CPU cores and 448 GPU cores. The algorithm can be used to ﬁnd out which images have overlapping areas. It mainly involves two steps: 1) Use the SIFT [15] algorithm to generate features for each image. 2) Use a k-NN matching algorithm to ﬁnd similar features between images. As a result of the ﬁrst step, black circles in Figure 1 indicate the locations of features that SIFT has extracted.

Fig. 1.

Typical images from the ‘Manawatu’ dataset. SIFT features are marked as black circles. The lines connect matched features on the two images.

Feature descriptors are organized in a high dimensional space. Each feature descriptor is represented by a 128-element array in which each element represents one dimension. The SIFT library describes features in a way that is invariant to image translation, scaling, and rotation. Euclidean distance is frequently used to represent the similarity between two features. In the second step, the k-NN matching algorithm is used to ﬁnd similar features between images. Given F reference features and one query feature, k-NN can pick k features out from the reference set so that they are nearest to the query feature according to their distance. For example, the lines in Figure 1 connect matched features in the two images. The k-NN matching stage costs the most computational time in this feature matching process. Suppose that it takes a constant time to generate features for each image by SIFT and to get matched results for a pair of images by k-NN. The time complexity is O(n) for SIFT and O(n2 ) for k-NN, where n represents the number of images. If n is very large, kNN will dominate the time consumed during image matching. Notice that the nearest neighbor problem is also a very timeconsuming procedure in several other applications including pattern recognition, chemical similarity analysis, and statistical classiﬁcation.1 Thus, it is very worthwhile optimizing this step as our work can potentially be applied in other applications. We choose a set of real-world images as our test set. Aerial images are taken by cameras on a plane ﬂying over an area of the Manawatu District with very clear weather. There are 81 images in this dataset. Each image has a size of the order of 2300×1500 pixels (without any compression). We believe that this data set is sufﬁcient for performance evaluation on a single machine as it can provide up to 6400 image pairs. In summary, our case study has rich content that can help us to identify the factors that may inﬂuence the performance and scalability of feature matching. III. N EAREST N EIGHBOR S EARCHING We can deﬁne a feature f as an array that contains D elements: (3) f = [e1 , e2 , .., eD ] 1 Refer

to http://en.wikipedia.org/wiki/Nearest neighbor search for more information.

Feature space Fs is deﬁned as a set of features: Fs = {f1 , f2 , .., fF }

(4)

Euclidean distance is often used to measure the distance between two features: D (fi [m] − fj [m])2 (5) fi − fj = m=1

Based on the above deﬁnitions, the k-NN problem can be formally described as: given a query feature fq , ﬁnd the k features in Fs that have the shortest distances to fq . When D is small, for example D = 1, the problem can be degraded into a binary search with a time complexity of O(log F ). However, D is usually very large in many applications. For example, the SIFT algorithm generates features with D = 128. The high feature dimension brings two major challenges. First, it increases the time taken to calculate the distances. Second, it will bring more uncertainty and randomness into feature distribution, which often makes existing k-NN algorithms inefﬁcient. There are two main classes of the k-NN algorithms. One is called “Accurate algorithms”, which return the exact k nearest neighbors. Due to the lack of efﬁcient k-NN algorithms for high dimensional spaces, accurate algorithms cost more time. Since accurate results are not necessary for applications like image matching, “Approximate algorithms” are used. Instead of returning accurate results, they return k neighbors that have a high probability of being the k nearest neighbors. The two classes of algorithms represent a trade-off between performance and accuracy. In this section, we will introduce four k-NN algorithms: two accurate algorithms and two approximate algorithms. We will analyze their shortcomings when running on manycore systems. Then we will present our Divide-and-Merge methodology and explain why it is able to improve their performance on manycore systems. A. Accurate algorithms 1) Brute-force algorithm on GPU: A brute-force k-NN algorithm has two steps: ﬁrst, calculate the distances between the query feature and all other features in the reference set;

‫ݎ‬ଵ

݀ͳ Data

….

‫ݎ‬௞

….

‫ݎ‬ξி

݀ଶ …………

…………

…………

…………

Data

‫ͳ݌ݑ݋ݎܩ‬ ݂ଵ

….

‫݌ݑ݋ݎܩ‬ξ‫ܨ‬ ݂௞

….

݂ξி

………...

K-Dimensional tree

Binary search tree

Fig. 2.

From binary search tree to K-Dimensional tree.

second, use a minimum heap of size k to ﬁnd the k nearest features. Suppose, in our feature-matching problem, that there is a query set and a reference set, both of which have F features. Then it takes O(F 2 ·D) time to calculate distances between the two sets and O(F 2 ·log k) time to ﬁnd the k nearest neighbors for each query feature. As D is usually much larger than log k, the overall time complexity of the brute-force algorithm is O(F 2 · D), which shows that the distance calculation in the ﬁrst step will dominate the processing time. If we organize both the query set and reference set as two matrices like Equation (6), then the distance calculation can be transformed into a matrix multiplication problem. ⎛ ⎞ ⎛ ⎞ d11 , .., d1F q1 ,

⎜ ⎟ ⎜ .., ⎟ T ⎝ ⎠ × f1 , .., fFT = ⎝ .., .., .. ⎠ (6) dF 1 , .., dF F qF query

reference

distances

Each row of the query matrix represents one query feature while each column of the reference matrix represents one reference feature. Each row of the resulting distance matrix contains distance values between one query feature and all other features in the reference set. With a smart transformation developed by Garcia et al. [18], we are able to use cuBLAS [12], a state-of-the-art GPU accelerated library, to calculate the distance matrix more efﬁciently. Since this highly optimized GPU algorithm takes O(F 2 ) space to store the distance matrix, the device memory in the GPU may not be large enough. For example, suppose F = 40, 000, then it will take 40, 0002 × 4 = 5.96 GiB of memory. While the latest high-end GPUs have 6GiB of device memory, most ordinary GPUs only have 1-2 GiB of memory. Note that the largest number of features in our test image set is 56,000. This observation shows that the GPU brute-force algorithm could easily be limited by device memory. We will show that our Divide-and-Merge methodology can solve this problem. 2) KD-tree: K-Dimensional trees (KD-tree) are a general form of binary search trees. As Figure 2 shows, there are mainly two differences between the binary search tree and the KD-tree: ﬁrst, only leaves can be used to store data in the KD-tree while all nodes of a binary tree can be used to store data; second, at each level the KD-tree uses a different dimension as the cut plane while there is only one dimension to split in the binary tree. For example, in Figure 2, at the

Fig. 3.

‫݇݌ݑ݋ݎܩ‬ ݂ଵ …. ݂௞

݂ଵ

….

݂௞

….

݂ξி

………... …. ݂ξி

Data structure for random ball cover algorithm.

root level the ﬁrst dimension is selected to split the space into two halves while at second level dimension 2 is selected. A total of k dimensions can be selected for data indexing of the high-dimensional space. During searching of the KD-tree, the algorithm will recursively traverse the tree to ﬁnd the nearest neighbors and those branches that cannot reach the nearest neighbors will be ignored. In low dimensional spaces, KD-trees can work very well. However, for high dimensional spaces, it may perform no better than the previous brute-force algorithm [16]. As different queries may search different branches and result in many random accesses to both the tree structure and the feature data, this algorithm cannot be implemented efﬁciently on a GPU with SIMD architecture. Even on CPU these random accesses may cause a large number of L3 misses, hitting the memory wall. Again, our Divide-and-Merge methodology is applied in this algorithm to reduce the memory accesses. B. Approximate algorithms 1) Randomized KD-trees (RKD-trees): To reduce time complexity of searching a single KD-tree, the RKD-trees algorithm randomly creates multiple KD-trees. It concurrently searches these trees for nearest neighbors. During the search, those best candidate nodes with possibly the smallest distances to the query feature are put into a priority queue. Each time a leaf node is reached and its distance is calculated, the best candidate node from the priority queue is picked to continue the search from the node. Instead of searching the whole feature space, it only repeats the above procedure for C times whose value is decided by users. It cannot guarantee that all of the actual k nearest neighbors will be found in the C trials. Therefore, the RKD-trees algorithm is an approximate algorithm with a time complexity of O(C · F · log F · D). Just like KD-trees, RKD-trees could also hit the memory wall. Another issue of RKD-trees is that, during runtime, it needs a system call to allocate a large heap memory for the per-thread priority queue. This memory allocation hit the bottleneck of memory management in the OS, which is quite difﬁcult for users to ﬁnd. Once again, our Divide-and-Merge methodology helps to avoid the performance bottleneck. 2) Random Ball Cover (RBC) on GPU: The RBC [19] algorithm uses a clustering technique to reduce the time complexity of searching a high-dimensional √ space. As Figure 3 shows, the feature space is divided into F groups, each of √ which has F features. A group i has a feature ri as the

representative feature of the group. These features are put into a special representative group. Given a query feature, the RBC algorithm uses a brute-force search to ﬁnd the nearest feature rj in the representative group. Then the corresponding group j is searched using a brute-force search to ﬁnd the k nearest neighbors of the query feature. The time √ complexity for searching for F query features is O(F · F · D). For high dimensional spaces, it is very hard to group the reference features in a way that satisﬁes the above grouping requirement. Since the nearest neighbors for a query feature may appear in multiple groups, the algorithm may not ﬁnd the actual nearest neighbor set. Therefore, the RBC algorithm is an approximate algorithm. The idea behind the RBC algorithm is very close to Divideand-Merge. Both of them divide the original problem into smaller ones so that the best performance on manycore systems can be achieved. However, the RBC algorithm is highly dependent on the grouping technique and cannot be applied to other k-NN algorithms. Our Divide-and-Merge methodology is more general. We also ﬁnd that the RBC algorithm is not optimized very well. As different queries may be assigned to different feature groups, distance calculations cannot be made as efﬁciently as the matrix multiplication. Another problem is, instead of using a minimum heap as in the previous brute-force algorithm, RBC uses inter-warp sorting, which causes more data movement and puts more pressure on the device memory. Our experimental results will show that although RBC has a better theoretical complexity of O(F 1.5 ·D), its performance on a GPU is worse than the O(F 2 · D) brute-force GPU algorithm. IV. D IVIDE - AND -M ERGE In order to improve the scalability for k-NN algorithms on manycore systems, we propose a Divide-and-Merge methodology that is applied to k-NN algorithms. Its pseudocode is presented in Algorithm 1. Input data are given in Lines 1, 2 and 3. In Line 4, the max size represents the threshold size for each sub-space. Its value is decided by experiment and different processors and algorithms may require a different value. Line 6 calculates s, which is the number of sub-spaces. Line 7 divides the feature space of the reference image into several smaller sub-spaces; Line 8 initializes index structures for k-NN algorithm; Line 9 initializes structures to store the k-NN results; Lines 10 to 12 build the index structures for each sub-space, which is required by k-NN algorithms to do searching. Note that, the index structures only need to be built once and can be shared by all subsequent queries; From line 14 to 16, parallel k-NN searching within the query feature space in each sub-space is done. In line 19 for each query feature in query space, all the resulting k-NN features from s sub-spaces (a total of s × k features) are sorted according to their distances. Finally line 20 returns the k nearest features as the results. The key to this method is to ﬁnd the threshold size for each sub-space so that best speedup can be achieved. For algorithms on GPUs, the size is determined by the capacity

Algorithm 1 Divide-and-Merge 1: Fr : {Feature space of the reference image} 2: Fq : {Feature space of the query image} 3: k: {Number of nearest neighbors} 4: max size: {Maximum size of sub-spaces} 5: s: {Number of sub-spaces} 6: s ← Fr .size ÷ max size 7: divide Fr into s sub-spaces: Fr [1..s] 8: initialize index [1..s] 9: initialize result[1..s] 10: for i = 1 to s do 11: index [i] ← kNN .buildIndex (Fr [i]) 12: end for 13: {Divide stage starts} 14: for i = 1 to s do 15: result[i] ← index [i].knnParallelSearch(Fq , k) 16: end for 17: {Divide stage ends} 18: {Merge stage starts} 19: sort the features in result[1..s]. 20: return the smallest k features in result[1..s] 21: {Merge stage ends}

of available device memory. Given a number of query and reference features, we need to guarantee that their distance matrix will not exceed the capacity of device memory. A. Related questions The idea of Divide-and-Merge is not hard to understand. Nevertheless, to understand why it is able to improve the performance of k-NN algorithms on manycore systems, the following questions need to be answered. 1) How does the memory wall become a bottleneck?: In the single-core age, the memory wall is mainly caused by a gap between the speeds of memory (RAM) and the CPU. Memory is usually much slower than the CPU. In this case, the CPU needs to wait for the memory. Usually a hierarchical cache system is used to bridge this gap so that the CPU can get data quickly from the memory via the cache system. In the current manycore age, even with a large cache system, the memory wall can still be a performance bottleneck due to contention on the shared cache. In the cache system, as the last-level cache is shared by more and more cores, the cache efﬁciency can be reduced dramatically due to a limited cache capacity while a large amount of memory is accessed by each core, which in turn results in more cache misses. The memory requests missed in the cache system are sent to the main memory. Since the memory bus (e.g. northbridge) is shared by all available cores, they have to wait for each other on memory accesses. Therefore, when there are more cores in the system, the memory wall will more seriously become a bottleneck due to the shared cache and memory bus. 2) How could the system calls be a bottleneck?: As the operating system is responsible for managing all hardware and software resources, it can be a performance bottleneck if many

cores are requesting services at the same time. For example, the system calls “brk” and “mmap” are used to manage perprocess virtual memory. Our tests show that when a thread tries to allocate a heap space less than 128 KiB, it will use “brk” to allocate memory from an existing memory management area (MMA), which causes no contention in OS. However, when the requested memory space is larger than 128 KiB, “mmap” has to be used to create a new MMA. In this case, a per-process spinlock is used to protect the integrity of the MMA list, which is common practice in OSs such as Linux. When many threads are trying to create a new MMA, they have to acquire this lock ﬁrst, which becomes a bottleneck. Though recent research [20] has addressed this issue, existing Linux kernels still suffer from this bottleneck. In addition, similar bottlenecks exist in different parts of a system such as device drivers and libraries, depending on their implementations. This kind of bottleneck is very tricky for new programmers as they often fail to notice these issues. 3) Why can Divide-and-Merge avoid the bottlenecks?: We believe that when there are more cores in the system, the program is more likely to meet bottlenecks due to the shared resources. Programs with large problem size often consume more resources, which is more likely to hit those bottlenecks. The principle behind Divide-and-Merge is that if we can control the problem size, we can control the contention on the shared resources. A small problem size requires fewer shared resources and is more likely to be satisﬁed by the system. Through Divide-and-Merge, we are able to divide the original problem into smaller sub-problems. Since a program with smaller problem size allows the data from different cores to ﬁt into the shared last-level cache, cache misses are reduced and the overall performance is improved. Similarly, a program with a small problem size also has less demand on the memory management of the OS. Therefore, the contention on the MMA list is relieved and the bottleneck of system calls is avoided. Though speciﬁc optimizations can be done on particular algorithms, our Divide-and-Merge methodology is more general and can be applied to a range of k-NN algorithms. 4) What are the differences between Divide-and-Merge and other methods?: Divide-and-Merge is not an entirely new idea. Researchers have already proposed and used similar methods such as Divide-and-Conquer and MapReduce. Divide-and-Conquer is frequently used in designing new algorithms. Its basic principle is that a large problem can be divided into smaller problems since it is easier to solve smaller problems ﬁrst and then combine their results. However, its shortcoming is that it is only applied to single problem at a time. If it is applied to k-NN algorithms straightforwardly for each query, it would use the same amount of cache resource and hit the same memory wall as the original algorithms, because each query would have to access the whole feature space during the Divide-and-Conquer for the query. However, in Divide-and-Merge, each query only accesses a smaller feature sub-space along with other queries. After all queries are ﬁnished in one sub-space, they move to another sub-space

until all sub-spaces are explored by all queries in this way. Finally, the results from different sub-spaces are merged for all individual queries. Since the sub-space in the shared cache can be reused by many queries, the number of cache misses can be greatly reduced. MapReduce is often used to schedule independent tasks on computing resources. It ﬁrst maps all independent tasks onto different processing units. Then it collects their results through the “reduce” processing phase. This idea has proved to be simple and efﬁcient in handling a large number of tasks. However, its shortcoming is that it is only applicable to independent tasks; it is hard to handle tasks that form some relationship. For example, the results of the same query from different sub-spaces should be merged together in Divide-andMerge, which is unwieldy for MapReduce. In fact, Divide-and-Merge combines the ideas of both Divide-and-Conquer and MapReduce. It distributes the tasks as MapReduce, but it collects the results for each query in the same way as Divide-and-Conquer. B. Time Complexity Analysis In this section, we will analyze the impact of Divide-andMerge on the time complexity of the k-NN algorithms. The question is, will it change the time complexity of the applied algorithm? The answer is “yes”, but the impact can be conﬁned to a limited scope, e.g., only by a small constant number. 1) Accurate algorithms: As mentioned in Section III-A, no efﬁcient, accurate k-NN algorithms better than brute-force are available for high dimensional problems [16]. Thus, the time complexity of the brute-force algorithm becomes an upper bound for other accurate algorithms, which is denoted in Equation (7). This indicates that other algorithms should not perform worse than the brute-force algorithm. Tacc = O(F 2 · D)

(7)

Suppose the Divide-and-Merge methodology divides the feature space into s sub-spaces. Since it repeats the feature matching s times, the time complexity of Divide-and-Merge applied to accurate algorithms is shown in Equation (8). Tacc−dm = s · O(F ·

F · D) = O(F 2 · D) s

(8)

The equation indicates that the time complexity of accurate algorithms is not changed when Divide-and-Merge is applied. 2) Approximate algorithm: Approximate algorithms have different time complexities. They are usually lower than those of accurate algorithms. Let us take the RKD-trees algorithm as an example. Its time complexity is shown in Equation (9). Trkd = O(C · F · log F · D)

(9)

When Divide-and-Merge is applied to RKD-trees, the time complexity is shown in Equation (10). Here log Fs = log F − log s. As s is very small compared to F , log s can be ignored. Trkd−dm = s·O(C·F ·log

F ·D) ≈ s·O(C·F ·log F ·D) (10) s

TABLE I S AMPLE IMAGES FROM THE M ANAWATU IMAGE SET.

Feature number

Image ID Sample set (1) Sample set (2)

0 8053 5091

1 9102 6912

2 10164 8053

3 11577 9102

4 13481 11578

5 14958 21900

6 16375 29929

7 17343 56074

TABLE II H ARDWARE AND SOFTWARE CONFIGURATION . Name

Processor

AMD 16-core (AMD16)

AMD Opteron Processor 8380 4 cores × 4 @ 2.5 GHz

Intel 16-core (Intel16)

Intel Xeon Processor E5-2665 8 cores × 2 @ 2.4 GHz

AMD 64-core (AMD64)

AMD Opteron Processor 6276 8 cores × 8 @ 2.3 GHz

NVIDIA 448-core (GPU)

NVIDIA Tesla GPU C2075 32 cores × 14 @ 1.15 GHz

Cache L1: 128 KiB, L2: 512 KiB, L3: 6144 KiB L1: 64 KiB, L2: 256 KiB, L3: 20480 KiB L1: 48 KiB, L2: 1000 KiB, L3: 16384 KiB L1: 16 KiB, Shared: 48 KiB L2: 768 KiB

Equation (10) indicates that the time complexity of Divideand-Merge on RKD-trees has increased by s times. The increased time complexity may overshadow the beneﬁts of Divide-and-Merge. To reduce the time complexity, we may reduce C (the number of searched nodes) in the algorithm. If we substitute Cs for C in Equation (10), we can keep the time complexity of Divide-and-Merge on RKD-trees the same as that of original RKD-trees, as shown in Equation (11). C · F · log F · D) ≈ O(C · F · log F · D) s (11) Equation (11) indicates that the number of searched nodes in each sub-space is Cs , and the total number of searched nodes in the whole feature space is still C, which is the same as the original RKD-trees. If the division of the feature space is random enough, the results of Divide-and-Merge should be statistically the same as RKD-trees. In the following section, we will show that our Divide-andMerge is efﬁcient for improving performance on manycore systems when it is applied to various k-NN algorithms. Trkd−dm−opt = s · O(

V. E VALUATION In this section, we evaluate the performance of several k-NN algorithms and our Divide-and-Merge approach on multiple manycore systems. Some of these k-NN algorithms are taken from the FLANN library [16], which is used by OpenCV [17]. Others are state-of-the-art GPU algorithms [18], [19]. Scalability is frequently used in our performance evaluation. With the same algorithm, input data and processor, scalability shows how the speedup is changed when more cores are used in the system. Low-level hardware performance counters are also used to demonstrate the reasons that cause bottlenecks. L3-related events like L3 miss rate and L3 latency are used to demonstrate the memory wall. CPU unhalted rate is used to demonstrate the bottleneck caused by system calls.

Memory

OS

Compiler

16 GiB, DDR2 800 MHz Bandwidth: 12.8 GiB/s

Ubuntu 12.04.1 LTS Linux 2.6.38

g++-4.4

128 GiB, DDR3 1600 MHz Bandwidth: 25.6 GiB/s

Ubuntu 10.04 LTS Linux 2.6.32

g++-4.4

64 GiB, DDR3 1333 MHz Bandwidth: 21.32 GiB/s

Ubuntu 12.04.1 LTS Linux 2.6.38

g++-4.4

6 GiB, GDDR5 1566 MHz Bandwidth: 144 GiB/s

Ubuntu 10.04 LTS Linux 2.6.32

CUDA 5.0

We ﬁnd that different problem sizes may achieve different speedup on manycore systems. To better demonstrate this result, we pick two sample sets from Manawatu image set, each containing eight images. As Table (I) shows, set (1) is used for the KD-tree algorithm and set (2) for the RKD-trees algorithm. Each image is represented by an ID and different images have varying numbers of features. Using different images as the reference image and others as query image provides a better way to show the varied performance within our test set. The hardware and software conﬁgurations are listed in Table (II). Four common manycore platforms are used during our evaluation. In this section, we will use the names listed in the table to represent the corresponding platforms. A. KD-tree As described in Section III-A2, KD-trees are an accurate algorithm and may explore the whole feature space of the reference image to ﬁnd nearest neighbors. We ﬁnd that this algorithm may hit a memory wall on manycore CPUs. To better demonstrate these factors, sample image set (1) is used in this evaluation. Figure 4 shows the experimental results on the AMD 64-core machine. In Figure 4a, each line in the graph represents the speedup when the corresponding image is used as reference image. The highest speedup (near 54× when 64 cores are used) is achieved for images 0, 1 and 2. However, the speedup is not stable for different reference images. For example, speedup is reduced to 50× for image 3, 33× for image 4, and 26× for image 5. Images 6 and 7 achieve the lowest speedup, which is only 21×. The speedup is decreasing from images 0 to 7. The results in Figure 4b indicate that the poor scalability has a strong relationship with the L3 cache. The L3 miss rate for images 0, 1 and 2 is only 0.6%. With the decreasing performance, the L3 miss rate is also increasing to 3.61%, 7.56% and 10.56% for images 3, 4 and 5 respectively. Images

1

40

2

30

3

20

4

20%

6 1

8

16

24

32

40

48

56

64

2 4

7

1

40

2

30

3

20

4 6 16

24

32

40

48

56

64

40

48

56

64

15%

7

24

32

40

48

56

64

(e) L3 miss rate: Divide-and-Merge applied.

6 1

8

16

24

32

40

48

56

64

7

60

0

1

50

1

2

40

2

30

3

20

4

6 16

5

0

5

8

4

20

(c) L3 latency: original results.

4

1

3

30

0

7

3

10%

7

5

10

6

0 1

8

16

24

32

40

48

56

64

7

(f) L3 latency: Divide-and-Merge applied.

Experimental results of the single KD-tree algorithm on the AMD 64-core machine within sample image set (1).

6 and 7 get the highest L3 miss rate, which are 12.74% and 14.03%. The L3 miss rate is increasing from images 0 to 7. The increasing L3 miss rate also has an inﬂuence on the L3 latency, as Figure 4c shows. Here “L3 latency” represents the average cycles taken by L3 cache for each L3 access request. It shows that the average response time for L3 cache is also increasing from images 0 to 7, which means cores have spent more time on waiting for the responses from L3 cache. These results indicate that different reference images may achieve different speedups, which is caused by a varied L3 cache utilization that can be explained by the different number of features in each image. Those with fewer features are more likely to better utilize the L3 cache (for images 0, 1 and 2), as read-only data can be shared by other cores on the same CPU socket. However, when the data exceeds the capacity of L3 cache (for images 3, 4 and 5), cores need to access memory frequently, which may greatly reduce the performance. In the worst case, time spent on waiting for responses from memory can dominate the execution time (for images 6 and 7). In this case, the program hits the memory wall.

Scalability

32

0%

(d) Scalability: Divide-and-Merge applied. Fig. 4.

24

5%

5

8

16

20%

L3 miss rate

Scalability

0

50

1

8

2

40

10

6 1

1

50

(b) L3 miss rate: original results.

60

0

5

0%

(a) Scalability: original results.

10

3

10%

L3 lantency

0

15%

0

60

1

5%

5

10

70

0 L3 lantency

0

50

L3 miss rate

Scalability

60

18 16 14 12 10 8 6 4 2 0 1

4 AMD16

8 AMD16_DM

12

16

Intel16

Fig. 5. Experimental results of the single KD-tree algorithm on AMD 16-core and Intel 16-core machines using sample image set (1).

Another interesting observation is about the relationship between number of cores used and the shared resources. By using more cores, shared resources like the system bus can be kept busy, which leads to a better utilization. Utilization of L3 cache is also improved when more CPU sockets are used, as more L3 cache in those sockets will be available. These factors lead to a reduced L3 latency in Figure 4c. However, due to contention appearing among different cores, the L3 cache miss rate is higher when more cores are used. L3 cache miss rate is markedly increased when 16 cores are used, compared with that when 8 cores are used. In this case, reducing the usage of shared resources is more important. The number of feature in image 2 in sample set (1) (i.e., 10164) is selected as the threshold size for sub-spaces (max size) as those images that contain more features achieve poor scalability on the AMD64. After applying our Divideand-Merge, the shared resources are controlled within a small size. Figure 4d shows all the sample images have achieved good scalability. In Figure 4e, our optimization has successfully controlled the L3 miss rate within 4%. This leads to a smaller L3 latency in Figure 4f for some large images. On the AMD 16-core machine, we also get similar results. To avoid redundantly presenting these results, here we only give their scalability. As Figure 5 shows, the AMD16 also suffers from a reduced speedup, which is 11.5× when 16 cores are used. After applying our Divide-and-Merge, the speedup is increased to 15.2×. Notice that here in the ﬁgure we use sufﬁx “DM” to represents the result when Divide-and-Merge is applied. However, on the Intel 16-core machine, we did not meet the same memory wall. As Figure 5 shows, its speedup is 14.2× when 16 cores are used, which is a quite high speedup. This is due to a enhanced bandwidth in the system. The Intel Xeon E5-2665 has 20 MiB L3 cache per 8 cores, which is substantially larger than both the AMD16 and AMD64 machines. Its memory bandwidth is 25.6 GiB/s, which is also

1

40

2

30

3

20

4 5 6 1

8

16

24

32

40

48

56

64

7

40

2

30

3

20

4 5 6 16

24

32

40

48

56

64

Scalability

0

50

1

40

2

30

3

20

4 5 6 1

8

16

24

32

40

48

56

64

(g) Scalability: Divide-and-Merge applied Fig. 6.

3

10%

50%

4

5

40%

6 8

16

24

32

40

48

56

64

90% 80%

1

8

16

24

32

40

48

56

64

7

100% 90% 80%

1

8

16

24

32

40

48

56

64

7

2

15%

3

10%

4 5

5%

6

0%

7

1

8

16

24

32

40

48

56

64

7

(f) L3 miss rate: reduced system calls 0

30%

0

1

25%

1

20%

2

15%

3

10%

4

6

30%

64

1

5

40%

56

20%

4

50%

48

0

3

60%

40

25%

2

70%

32

30%

6 1

24

1

5

40%

16

0

4

50%

8

(c) L3 miss rate: original results.

3

60%

6

0%

7

2

70%

5

5%

(e) CPU unhalted rate: reduced system calls

60

0

2

15%

4

30%

7

(d) Scalability: reduced system calls

10

60%

100% CPU unhalted rate

1

8

1

20%

3

1

CPU unhalted rate

Scalability

0

50

1

0

25%

(b) CPU unhalted rate: original results.

60

0

30%

1 2

70%

30%

(a) Scalability: original results.

10

80%

0

L3 miss rate

0

90%

7

(h) CPU unhalted rate: Divide-and-Merge applied

L3 miss rate

10

100%

L3 miss rate

0

50

CPU unhalted rate

Scalability

60

5

5%

6

0% 1

8

16

24

32

40

48

56

64

7

(i) L3 miss rate: Divide-and-Merge applied

Experimental results of Randomized KD-trees algorithm on the AMD 64-core machine using sample image set (2).

higher than others’. Another innovative improvement in its CPU architecture is that when one CPU gets a miss in its own L3 cache, it can request that data not only from the memory but also from L3 caches in other CPUs through highbandwidth channels between CPU sockets. This new design has increased the amount of L3 cache available for each core, which may greatly reduce memory accesses. Although an enhanced bandwidth and an innovative design can efﬁciently solve this problem, it is unclear what will happen when more cores are used in the system with this Intel architecture. A larger L3 cache will also take more on-chip space, which may lead to a higher price and limited cores per socket. In the future, we will try to conduct more experiments on this interesting issue. B. Randomized KD-trees As accurate algorithms for k-NN problems are very time consuming, approximate algorithms are preferred in our image matching case. We choose RKD-trees in OpenCV and evaluate the algorithm’s performance on manycore CPUs. We ﬁnd that though this algorithm has a smaller time complexity, it also faces the challenge of the memory wall. Moreover, due to a large runtime memory usage, it also meets bottlenecks caused by system calls. To better demonstrate these factors, sample image set (2) is used in this evaluation. Figure 6 shows the experimental results on the AMD 64-core machine.

As Figure 6a shows, different from the results in Figure 4a, the speedup seems to be polarized. The highest speedup is close to 50× when 64 cores are used while others are only around 10×. The CPU unhalted rate in Figure 6b gives the answer. Here “CPU unhalted rate” represents the percentage of time that the CPU is not in a halted state during execution. From the ﬁgure, we can determine that from images 3 to 7, the unhalted rates are all lower than 60%, which means that most of time the CPU is not doing useful processing. The low unhalted rate does not actually mean that the CPU is halted. During the execution of system calls, the CPU may leave the user mode and enter the kernel mode. The more time spent in the kernel mode, the less time will be spent on useful usermodel workloads. The reason for this reduced unhalted rate is discussed in Section IV-A2. In detail, the RKD-trees algorithm uses a priority queue to store best candidates during searching. For larger images, the algorithm will create a larger queue by default, which will use the system call “mmap” to create extra memory space in MMA. When this step is frequently executed for each query, it becomes a bottleneck. One way to solve this problem is to reduce the size of priority queue for this speciﬁc algorithm. When the size is small, it will not use the costly system call. After applying this method, as Figure 6e shows, the unhalted rate of all sample images are very close to 100%, which means the CPU is doing

18

35

16

30 25

12

Speedup

Scalability

14 10 8 6

20 15 10

4

5

2

0

0 1 AMD16

4 AMD16_DM

8

12 Intel16

16 Intel16_DM

1

8

16

24

32

40

48

56

64

AMD16

AMD16_DM

Intel16

Intel16_DM

AMD64

AMD64_DM

GPU_Linear

GPU_RCB

Fig. 7. Experimental results of the Randomized KD-tree algorithm on AMD 16-core and Intel 16-core machines using sample image set (2).

Fig. 8. Experimental results of the Randomized KD-trees algorithm on different manycore machines using the whole Manawatu image set.

useful work in user mode. Comparing the L3 cache miss rate in Figure 6f with that in Figure 6c, L3 cache miss is also reduced as the spinlock in the system call may also cause many misses during repeatedly acquiring the lock. However, as Figure 6d shows, the bottleneck caused by memory wall is still not solved. The other way is to apply our Divide-and-Merge approach. The number of features in image 2 in sample set (2), which is 8053, is selected as the threshold size (max size) for subspaces. As Figures 6g, 6h and 6i show, the speedup has been improved due to a lower L3 miss rate and higher unhalted rate. Though the bottleneck caused by the memory wall is not totally solved as some images have feature sets that are much larger than the capacity of L3 cache, it still achieves 131% more speedup than the original execution for image 7 in the worst case. We also get similar results on both AMD16 and Intel16 machines. As Figure 7 shows, the speedup on the AMD16 machine when 16 cores are used is increased from 10.2× to 13.6× while a similar improvement from 12.7× to 15.6× is achieved on the Intel16 machine.

best performance is achieved on AMD64, which has a speedup of 29.5×. Speedup has been improved by 211% on AMD64, 57% on AMD16 and 16% on Intel16. Intel16 also achieves 62% better performance than AMD16. Though AMD64 has four times the cores in the system compared to Intel16, its optimized performance is only 23% better. These results show that an enhanced memory system is crucial to the performance in future manycore CPU platforms. On the GPU, the best speedup is achieved by GPU Linear, which is 8.1×. As this brute-force algorithm can provide accurate results, the lost speedup is reasonable as it has a higher time complexity. However for the RBC algorithm, it only achieves a speedup of 2.8×. We believe it is caused by poor code optimization. Similar memory wall problems may also appear in the GPU as it has many more cores and a smaller capacity in terms of shared cache. In summary, we conducted a comprehensive evaluation of image feature matching in this section. Four popular k-NN algorithms are evaluated on four common manycore platforms. To demonstrate the bottlenecks in manycore systems, hardware performance counters are used to show the details during execution. Our Divide-and-Merge approach is proven to be simple to incorporate and efﬁcient at mitigating the bottlenecks that appear in k-NN problems.

C. Test on Manawatu image set To make a fair comparison between CPU and GPU, we make a selection during evaluation. The RKD-trees algorithm is selected as a representative algorithm for manycore CPU platforms as it has a lower time complexity than KD-tree and can still achieve good results. A well-tuned GPU bruteforce algorithm, called GPU Linear, is selected to demonstrate the power of the GPU. As mentioned in Section III-A1, our Divide-and-Merge is also used to reduce its device memory usage. The max size for each sub-space is 25000 as it can support up to 62000 query features at the same time, which can fully utilize the power of GPU card. Another state-of-the-art algorithm for manycore system called RBC is also selected as it has a lower time complexity than the brute-force algorithm. As Figure 8 shows, the whole Manawatu image set is used to evaluate the performance. The time consumed by serial RKDtrees (single threaded) algorithm on AMD 16-core machine is selected as a baseline. After applying Divide-and-Merge, The

VI. R ELATED WORK In exploring parallelism and improving performance in all aspects, we have developed methods and algorithms that optimize the large-scale processing of image features. In this section, we put our algorithms in the context of other similar computer vision processing research, though as far as we know there are not many works that investigate performance tuning for feature matching as thoroughly as we have done. SIFT [15] is widely used algorithm for image matching. In many cases, a KD-tree based search is used to ﬁnd approximate nearest neighbors [21]. This reduces the cost to O(F log F ), but may not be suitable for all architectures. Garcia et al. [22], for example, show that a brute-force GPU

implementation can outperform a CPU approach that uses KDtrees; Fram et al. re-cast the matching problem as a large matrix multiplication task suitable for GPU implementation [23]; Cayton [24] proposes a new data structure, the random ball cover, which can reduce the time complexity of the brute-force algorithm while still be suited for manycore implementation. For large data sets, even state-of-the-art techniques place huge demands on a typical workstation. While Fram et al. [23] use a single workstation in their work, the other research described here use clusters of machines for feature matching computation. These clusters are typically composed of nodes that have dual or quad-core processors, and provide from 200 to 500 processing cores [25]–[27]. The single workstation used by Fram et al. had two quad-core Xeon processors at 3.33 GHz and four NVIDIA 295GTX graphics cards. Despite these resources, computation times are still measured in hours or days. In order to effectively make use of large numbers of processors, it is important to understand how the task of feature matching scales with the computational resources available. Based on our experience, the performance tuning for increased scalability has turned out to be a very demanding endeavor. VII. C ONCLUSIONS We have demonstrated the importance and challenges of performance tuning on manycore systems. Shared resources in both hardware and software may become serious bottlenecks on manycore systems. Two factors: the memory wall and system calls, are identiﬁed as particularly affecting performance. Using the SIFT-based image feature matching as a case study, the impacts of these factors have been comprehensively evaluated. Since these factors involve three different levels of computer systems: application level, system software level and architecture level, performance tuning on manycore systems has been shown to be a challenging task. For example, poor L3 cache utilization may cause serious scalability problems which will be especially harmful to performance on manycore machines, as our experimental results show. Based on the evaluation and optimization results achieved in this paper, a particular aspect of future work will be to address performance issues in distributed systems so that we are able to improve feature-matching problems at even larger scales. Performance tuning in distributed environments is likely even more interesting and challenging since having to coordinate data movements through the network will provide additional challenges to parallel computing. ACKNOWLEDGMENT We would like to thank the anonymous reviewers for their valuable comments. We would also like to thank Hawkeye UAV Ltd. and Areograph Ltd. for providing the Manawatu dataset used in this evaluation. Xiaoxin Tang would like to thank the University of Otago for hosting his PhD internship during the course of this research. This work is also partially supported by the National High-Tech R&D Program of China (863 Program) under grant No. 2011AA01A202.

R EFERENCES [1] D. R. Butenhof, Programming with POSIX threads. Addison-Wesley Professional, 1997. [2] C. Nvidia, “Programming guide,” 2008. [3] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: portable parallel programming with the message passing interface, vol. 1. MIT press, 1999. [4] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A parallel programming standard for heterogeneous computing systems,” Computing in science & engineering, vol. 12, no. 3, p. 66, 2010. [5] L. Dagum and R. Menon, “OpenMP: an industry standard API for shared-memory programming,” Computational Science Engineering, IEEE, vol. 5, no. 1, pp. 46–55, Jan-Mar. [6] R. D. Blumofe and e. Joerg, Christopher F, Cilk: An efﬁcient multithreaded runtime system, vol. 30. ACM, 1995. [7] C. Enterprise, C. Inc., NVIDIA, and the Portland Group, “The OpenACC application programming interface, v1.0,” November 2011. [8] P. Charles, C. Grothoff, and e. Saraswat, “X10: an object-oriented approach to non-uniform cluster computing,” SIGPLAN Not., vol. 40, pp. 519–538, Oct. 2005. [9] J. J. Dongarra, J. R. Bunch, G. Moler, and G. W. Stewart, LINPACK users’ guide. No. 8, Society for Industrial Mathematics, 1987. [10] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, et al., LAPACK Users’ guide, vol. 9. Society for Industrial Mathematics, 1987. [11] C. Nvidia, “CUFFT library,” 2010. [12] C. NVIDIA, “Cublas library,” NVIDIA Corporation, Santa Clara, California, vol. 15, 2008. [13] G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” in Proceedings of the April 18-20, 1967, spring joint computer conference, AFIPS ’67 (Spring), (New York, NY, USA), pp. 483–485, ACM, 1967. [14] J. L. Gustafson, “Reevaluating Amdahl’s law,” Commun. ACM, vol. 31, pp. 532–533, May 1988. [15] D. Lowe, “Object recognition from local scale-invariant features,” in Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, vol. 2, pp. 1150–1157 vol.2, 1999. [16] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm conﬁguration,” in International Conference on Computer Vision Theory and Application VISSAPP’09), pp. 331–340, INSTICC Press, 2009. [17] G. Bradski and A. Kaehler, Learning OpenCV: Computer vision with the OpenCV library. O’Reilly Media, Incorporated, 2008. [18] V. Garcia, E. Debreuve, F. Nielsen, and M. Barlaud, “K-nearest neighbor search: Fast GPU-based implementations and application to highdimensional feature matching,” in Image Processing (ICIP), 2010 17th IEEE International Conference on, pp. 3757–3760, 2010. [19] L. Cayton, “Accelerating nearest neighbor search on manycore systems,” Parallel and Distributed Processing Symposium, International, vol. 0, pp. 402–413, 2012. [20] A. T. Clements, M. F. Kaashoek, and N. Zeldovich, “Scalable address spaces using RCU balanced trees,” SIGARCH Comput. Archit. News, vol. 40, pp. 199–210, Mar. 2012. [21] S. Arya, D. M. Mount, N. S. Netenyahu, R. Silverman, and A. Y. Wu, “An optimal algorithm for approximate nearest neighbor searching,” Journal of the ACM, vol. 45, pp. 891–923, 1998. [22] V. Garcia, E. Debreuve, and M. Barlaud, “Fast k nearest neighbor search using GPU,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2008. [23] J.-M. Fram, P. Georgel, S. Gallup, and Pollefeys, “Building Rome on a cloudless day,” in European Conference on Computer Vision (ECCV), 2010. [24] L. Cayton, “Accelerating nearest neighbour search on manycore systems,” in IEEE Int. Parallel and Distributed Processing Symposium (IPDPS), 2012. [25] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski, “Building Rome in a day,” in Int. Conf. Computer Vision (ICCV), 2009. [26] D. Crandall, A. Owens, N. Snavely, and D. Huttenlocher, “Discretecontinuous optimization for large-scale structure from motion,” in Computer Vision and Pattern Recognition (CVPR), 2011. [27] Y. Lou, N. Snavely, and J. Gehrke, “MatchMiner: Efﬁcient spanning structure mining in large image collections,” in European Conference on Computer Vision (ECCV), 2012.

Performance Bottlenecks in Manycore Systems: A ...

general Divide-and-Merge methodology, which divides the feature space into ... memory-allocation-related system calls, resulting in a 211% performance ...

Download PDF

411KB Sizes 2 Downloads 211 Views

Report

Performance Bottlenecks in Manycore Systems: A ...

Recommend Documents