GFT: GPU Fast Triangulation of 3D Points ´ Jairo R. S´anchez, Hugo Alvarez, and Diego Borro CEIT and Tecnun (University of Navarra), Manuel de Lardiz´ abal 15 20018 San Sebasti´ an Spain, {jrsanchez, halvarez, dborro}@ceit.es
Abstract. Traditionally triangulating 3D points from image features is a complex task that involves non-linear optimization techniques that are computationally very expensive. This work proposes an algorithm based on Monte Carlo simulations that fits well on the graphics hardware and can perform the triangulation in real time. Results are compared against the well known Levenberg-Mardquart using real video sequences, showing that it achieves the same precision but in much less time.
1
Introduction
Structure From Motion (SFM) problem is a subset of the Simultaneous Localisation And Mapping (SLAM) problem, where the objective is to solve the 3D structure of the scene and/or the camera motion relying on image data. This work addresses only the reconstruction part, and extends the work of Escudero et al. [1] that solves the motion estimation using similar principles. Traditionally SFM is addressed using recursive estimations (probabilistic maps) or batch reconstructions (minimization techniques). The first option is fast but not as accurate as one would wish. The second option is very accurate but not suitable for real time operation. Recursive estimators try to compute the reconstruction using Bayesian estimators, typically Extended Kalman Filters (EKF) [2]. We are interested in predicting the sequence of the state of our world Xk = {x1 , . . . , xk } having a set of observations Yk = {y 1 , . . . , y k }. The problem is formulated as the conditional probability of the current state given the entire set of observations: P (xk |Yk ). The key is that the current state probability can be expanded using the Bayes’s rule. Usually the state xk is composed by the camera pose and the 3D structure, and the observation y k is image measurements. Now the problem is how to approximate these probability densities. Authors like [3] use sequential Monte Carlo methods. The difference with the method proposed here is that they only take into account the pose information of the last frame (Markov sequence). Global minimization is used when real time operation is not needed. The problem is solved using non-linear minimization like Levenberg-Marquardt (LM) [4] and Bundle Adjustment (BA) [5]. The objective is to minimize the reprojection error of the 3D points given the measurements on the image. These methods produce good results if the initial value of the parameters is close to the global
optimum. Authors, like [6], use the L∞ norm and convex optimization. However, as the author states, the method is very sensitive to outliers. Recent works [7] are capable to reconstruct huge scenes using BA, but not in real time. Citing from [7]: “3D reconstruction took 27 hours on 496 compute cores.” Authors like [8] use local BA to get the optimization work in real time, but only use the last five frames. In this work, an alternative minimization technique is proposed based on Monte Carlo simulations [9] that can run on parallel systems, like GPUs. The advantage of using the GPU is that it has a lot of computing units, called shader processors, that can run simultaneously the same task on different data following the SIMD (Single Instruction Multiple Data) computational model [10]. The benefit of using the GPU comes when the size of the input data is large, making it necessary to adapt existing algorithms to fit this computational model. The main contribution of this paper is the design of a 3D triangulation algorithm based on this paradigm and an implementation that runs on middle-end graphics adapters. Other authors have used Monte Carlo simulations on GPUs for other topics, like quantum mechanics [11]. However, these algorithms are very specific for each application domain and cannot be compared between them. Section 2 describes the proposed method and Section 3 its implementation on the GPU. Section 4 evaluates performance and precision of the GPU implementation vs. the widely used LM algorithm given in [12].
2
Problem Description
Let Ik be the image of the frame k. Each image has a set of features associated to it given by a 2D feature tracker Yk = y k1 , . . . , y kn . Feature y ki has uki , vik coordinates. The 3D motion tracker calculates the camera motion for each frame as a rotation matrix and a translation vector xk = [Rk |tk ], where the set of all computed cameras up to frame m is Xm = {x1 , . . . , xm }. The problem consists of estimating a set of 3D points Z = {z 1 , . . . , z n } that satisfies: y ki = Π(Rk z i + tk )
∀i ≤ n, ∀k ≤ m
(1)
where Π is the pinhole projection function. The calibration matrix can be obviated if 2D points are represented in normalized coordinates. The proposed triangulation performs a global minimization over all available images using a probabilistical approach based on Monte Carlo simulations. This consists of sampling inputs randomly from the domain of the problem and weighting them using a goal function that depends on the measurement obtained from the system. As the number of samples increases, the triangulated point will be closer to the optimum. This strategy leads to very computationally intensive implementations that makes it unusable for real time operation. Unlike other minimization methods, each possible solution is computed independently from others, making it optimal for parallel architectures, like GPUs. In this work, a bottom-up SFM approach is used, but the proposed triangulation algorithm is valid for top-down approaches as well. In the first stage
the feature tracker finds new features and tracks them using optical flow and Kalman filtering [13]. Each feature has a SIFT descriptor [14] used to detect lost features. These features are used by a particle filtering based 3D motion tracker that uses the capabilities of the GPU as well [1]. Finally, all this information is used to reconstruct new 3D points. This last phase is the one covered in this work. The triangulation algorithm follows three steps for each point to be triangulated: initialization, random sampling and weighting. 2.1
Initialization
New points are initialized via linear triangulation. The results are usually far from the optimum, but it is a computationally cheap starting point. For each image k, a 2D feature point y ki can be described as y ki × (xk z i ) = 0, being xk the pose for the frame k and z i the coordinates of the wanted 3D point (points are expressed in homogeneous coordinates). Having two of these realations, an equation in the form Az i = 0 can be written and solved using DLT [15]. This stage is implemented in the CPU since it runs very fast, even when triangulating many points. 2.2
Sampling Points
o n (1) (q) is generated For each point z i , a set of random samples Si = z i , . . . , z i around its neighborhood using a uniform random walk around its initial position: (j)
zi
= f (z i , nj ) = z i + nj ,
nj ∼ U3 (−s, s)
(2)
where nj is a 3-dimensional uniform distribution having minimum in -s and maximum in s that is chosen to be proportional to the prior reprojection error of the point being sampled. In this way, the optimization behaves adaptively avoiding local minimums and handling well points far from the optimum. It is important to scale and translate points before doing the sampling so that √ the centroid is at the origin and the RMS distance of points to the origin is 3. If not, the parameter s should be conveniently scaled for each point. Moreover, floating point units work better with centered and scaled points. Another good choice for the sampling function would be a Gaussian distribution with mean z i and the uncertainty of the point as the standard deviation. However we will leave this as future work. 2.3
Evaluating Samples
All the set Si for every point z i is evaluated in this stage. The objective function is to minimize the L2 norm of the Equation 1 applied to every hyphotesis for every available frame. In other words, the objective is to solve: m r X (j) (j) (j) Π Rk z i + tk − y ki . (3) argmin wi where wi = j
k=1
The computational complexity of the problem is O(n ∗ m ∗ q), where m is the number of frames, n is the number of 3D points and q is the number of hypotheses per point. However, Equation 3 satisfies the independence needed in stream processing, since each hypothesis is independent from others.
3
GPU Implementation
General Purpose GPU (GPGPU) programming consists of using the programmable 3D graphics pipeline to perform general purpose computing tasks. The algorithm has been implemented using three shader programs written in GLSL 1.2 [16]. These programs run sequentially for each point to be triangulated. The first shader will generate all the hypotheses for a single point location, the second shader will compute the function Equation 3 and the third shader will choose the best hypothesis. Figure 1 shows an overview of the proposed method.
Fig. 1. Overview of the proposed algorithm
Since the GPU is a hardware designed to work with graphics, the data is loaded using image textures. A texture is a matrix of RGBA values, thus each element (s, t) of the matrix, called texel, can be up to a 4-vector of floating point values. The output is the image to be rendered, so the way to obtain the results from it is using a special texture called framebuffer and the render-to-texture mode. It is very important to choose good memory structures since the transfers between the main memory and the GPU memory are very slow. 3.1
Sampling Points
In this stage hypotheses are generated from the 3D point to be optimized using Equation 2. The data needed are the initial position of the 3D point and a texture with uniformly distributed random numbers. Since GLSL lacks random number generating functions, this texture is computed in the preprocessing stage and due to performance optimizations remains constant, converting this method into a pseudo-stochastic algorithm [1]. Therefore, the only datum transfered is the 3D point. The output is a texture containing the coordinates for all hypotheses in the RGB triplet. The alpha channel is used as the value of the objective function, so in this stage, it is initialized to 0. It is not necessary to download the framebuffer
with the generated hypotheses to the main memory, because they are only going to be used by the shader that evaluates the samples. The computational cost of this function is O(q), where q is the total hypotheses count. Currently available GPUs can execute more than 200 hypotheses simultaneously, so this stage runs very fast. Algorithm 1 summarizes the code executed by each stream j.
Algorithm 1 Point Sampling. Require: zˆi → Initial point location Require: randomT ex → Random numbers [s, t] = GetStreamCoordinates() [dx, dy, dz] = ReadFromTexture(randomT ex, [s, t]) (j) z i = zˆi + [dx, dy, dz]
3.2
Evaluating Samples
All generated hypotheses are weighted applying Equation 3. This shader runs once for each projection y ki using texture ping-pong [17]. This is a common resource in GPGPU that uses two textures as input and output. The texture that acts as input in a render pass, acts as output in the next pass of the shader, and the one that acts as output, acts as input. It is necessary to do it because a texture cannot act as an input and output at the same time due to hardware restrictions. These two textures are generated by the shader described in the previous section. The only data needed to be transferred are the camera pose and the projection of the 3D point for each frame. Another advantage of using one rendering pass per projection, is that in each pass only one point projection and one camera pose is loaded into the GPU, avoiding memory limitations in the algorithm. In this way, the only limitation in the number of frames to optimize is the computational power of the GPU, making this solution very scalable. The computational cost of this function is O(m ∗ q) where m is the number of frames to be optimized and q the number of hypotheses per point. Algorithm 2 shows the program executed by each stream. This shader program must be executed m times for each 3D point. Algorithm 2 Hypotheses Weighting. Require: inputT ex → Texture with sampled points Require: xk → Camera pose in frame k Require: y ki → Projection of z i in frame k [s, t] = GetStreamCoordinates() [x, y, z, weight] = ReadFromTexture(inputT ex, [s, t]) testP oint = Rk · [x, y, z, ]T + tk testP oint = testP oint/testP oint.z output.xyz = [x, y, z] output.a = distance(testP oint.xy, y ki )+inputT ex.a
3.3
Choosing the Best Hypothesis
When all the passes involved in section 3.2 are rendered, the output texture will contain all the hypotheses weighted. The best hypothesis can be searched in the CPU or directly in the GPU. Experimentally, we concluded that GPU search is the fastest way if the size of the texture is big enough. This search is performed in a parallel fashion using reduction techniques [17]. The ping-pong technique is used here as well, but in each pass the size of the output is reduced. Each output pixel is calculated in function of four texels of the input texture, until the output texture size is reduced to one. Figure 2 illustrates the reduction process in a 4 × 4 texture. This operation is done in log4 (q) passes. As seen in Figure 2 it is necessary to work with square textures.
Fig. 2. Minimum Reduction
4
Experimental Results
In order to validate the algorithm, we have ran tests on a real video recorded in 320×240 using a standard webcam. Figure 5 shows a virtual building augmenting the scene used in the tests. Results are compared with the implementation of the LM algorithm given by [12]. In our setup, GFT runs with a viewport of 256 × 256, reaching a total of 216 hypotheses per point. The maximum number of iterations allowed to the LM algorithm is 200. The precision is measured as the mean value of the objective function 3 after the triangulation of 25 points using 15 frames. Figure 3 (in logarithmic scale) shows that GFT gets on average 1.4 times better results than LM. There are no outliers in the data, therefore as seen in the figure, the algorithm depends on a good initial approximation of the point. The PC used for performance tests is an Intel C2D E8400 @ 3GHz with 4GB of RAM and a nVidia GeForce GTX 260 with 896MB of RAM. Following tests show the performance comparison between GFT and LM. CPU implementations of GFT are not given since they are very slow, even when using multicore processors. Figure 4(a) shows a test running with 10 frames incrementing the point count in each time step. In Figure 4(b), 15 points are optimized, incrementing the number of frames used in each time step. Figures are in logarithmic scales. These tests show that GFT runs approximately 30 times faster than LM, thus being capable to run at 30fps getting peaks of 100 times better performance when
Fig. 3. Residual error on real images
Fig. 4. Comparison with constant point count (a) and with constant frame count (b)
incrementing the number of tracked frames. However, one advantage of using LM is that it can finish the optimization when the parameters are near the optimal value, as can be seen in some abrupt downwards peaks. GFT could achieve this behaviour running the optimization in various annealing levels, finishing the annealing loop at the desired level of precision.
5
Conclusions
This work presents a method for 3D structure estimation that runs in real time on a GPU. The main contribution is a GPU based parameter estimation framework that can be used to solve many computer vision problems, not only 3D reconstruction. Since the method relies on Monte Carlo simulations, it is very robust to outliers and highly adaptable to different level of errors on the input. For validating it, the algorithm is compared against the popular LM algorithm. Tests on real data demonstrate that the GPU optimizer can achieve better results than LM in much less time. This gain of performance allows to increment the number of data used on the optimization, obtaining better precision without sacrificing the real time operation. Moreover, the GPU implementation leaves the CPU free of computational charge so it can dedicate its time to do other tasks. In addition, the tests have been done in a standard PC configuration using a webcam, making the method suitable for commodity desktop hardware.
Fig. 5. Augmented video sequence. Blue points are the reconstructed 3D points
References 1. Eskudero, I., S´ anchez, J., Buchart, C., Garc´ıa-Alonso, A., Borro, D.: Tracking 3d en gpu basado en el filtro de part´ıculas. In: Congreso Espa˜ nol de Inform´ atica Gr´ afica. (2009) 47–55 2. Welch, G., Bishop, G.: An introduction to the kalman filter. In: SIGGRAPH, Los Angeles, CA, USA (2001) 3. Qian, G., Chellappa, R.: Structure from motion using sequential monte carlo methods. International Journal of Computer Vision 59(1) (2004) 5–31 4. Levenberg, K.: A method for the solution of certain non-linear problems in least squares. Quarterly of Applied Mathematics 2(2) (1944) 164–168 5. Triggs, B., McLauchlan, P., Hartley, R., Fitzgibbon, A.: Bundle adjustment a modern synthesis. Vision Algorithms: Theory an Practice (2000) 6. Kahl, F.: Multiple view geometry and the l infinity norm. In: International Conference on Computer Vision. Volume 2. (2005) 1002–1009 7. Agarwal, S., Snavely, N., Simon, I., Seitz, S.M., Szeliski, R.: Building rome in a day. In: International Conference on Computer Vision, Kyoto, Japan (2009) 8. Klein, G., Murray, D.W.: Parallel tracking and mapping for small ar workspaces. In: International Symposium on Mixed and Augmented Reality. (2007) 9. Metropolis, N., Ulam, S.: The monte carlo method. Journal of the Americal Statistical Association 44(247) (1949) 335–341 10. Kapasi, U.J., Rixner, S., Dally, W.J., Khailany, B., Ahn, J.H., Mattson, P., Owens, J.D.: Programmable stream processors. IEEE Computer (2003) 54–62 11. Anderson, A.G., III, W.A.G., Schrder, P.: Quantum monte carlo on graphical processing units. Computer Physics Communications 177(3) (2007) 298 – 306 12. Lourakis, M.: levmar: Levenberg-marquardt nonlinear least squares algorithms in C/C++. [web page] http://www.ics.forth.gr/~lourakis/levmar/ (Jul. 2004) 13. S´ anchez, J., Borro, D.: Non invasive 3d tracking for augmented video applications. In: IEEE Virtual Reality 2007 Conference, Workshop ”Trends and Issues in Tracking for Virtual Environments” (IEEE VR’07), Charlotte, USA. (2007) 22–27 14. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 2(60) (2004) 91–110 15. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2000) 16. Kessenich, J.: The opengl shading language. Technical report, 3Dlabs (2006) 17. Pharr, M.: GPU Gems 2. Programing Techniques for High-Performance Graphics and General-Purpose Computing. (2005)