©2007 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Non Invasive 3D Tracking for Augmented Video Applications Sánchez, J. and Borro, D. CEIT and TECNUN (University of Navarra) Manuel de Lardizábal 15, 20018 San Sebastián, Spain
ABSTRACT In this work we present a tracking method that has demonstrated to be accurate enough for creating augmented videos from the information obtained by a standard pinhole camera. It is based on the well known epipolar geometry concept. The proposed implementation can make all the tracking and calibration process automatically and has the advantage of not to need any marker in the scene, imposing only a few restrictions in the motion of the camera. CR Categories: I.4.1 [Image Processing and Computer Vision]: Digitization and Image Capture --- Imaging geometry, I.4.8 [Image Processing and Computer Vision]: Scene Analysis --Tracking. Keywords: Augmented Reconstruction. 1
Reality,
Feature
Tracking,
3D
INTRODUCTION
The aim of Augmented Reality is to add computer generated data to real images. This information goes from explanatory text to three-dimensional objects that merge with the scene realistically. Depending on the amount of virtual objects added to the real scene, Milgram et al. [8] proposed the taxonomy shown in Figure 1.
Figure 1: Milgram taxonomy. Mixed reality has proven to be very interesting in areas like industrial processes, environmental studies, surgery and entertainment. In order to insert synthetic data in a real scene, it is necessary to line up a virtual camera with the observer viewpoint. Different options have been tried, like magnetic, inertial trackers or other tracker sensors. However, image based systems are becoming the most interesting solutions due to their lower cost and its less invasive way of setup. In this paper we present a complete method for authoring mixed reality videos easily using only image information. Our implementation can calibrate a pinhole camera, find a 3D
reconstruction and estimate the camera’s motion using only 2D features in the images. The only constraint imposed is that the camera must have constant intrinsic parameters. 2
STATE OF THE ART
Within the image based tracking solutions, various possible choices exist: one or multiple camera systems, but single camera solutions are becoming more popular in last years. For single camera configurations there are several algorithms to calculate the pose. They can be divided into model based, marker based and feature based techniques. The model based methods calculates the camera transformation from the 2D projections of a known 3D model. A typical algorithm is POSIT [3]. This algorithm has the disadvantage that the known object must be always in the image to be tracked. Marker based systems consist in introducing into the scene some type of markers that the system can recognize. These methods are fast and accurate but very invasive too. An example of marker based tracking is the ArToolkit library developed in HITLab at the University of Washington [6]. Many works have been done using this library. Other marker based example is ArTag [9]. Feature based algorithms are becoming more important in the last years. They do not need any markers in the scene or the presence of known objects but they are less accurate than other methods and computationally more expensive. An example of previous work in this area is [2]. There is another group of systems that use hybrid solutions combining a camera with another type of sensor, like inertial or magnetic sensors. An example of this technique is ArQuake, that uses inertial sensors, GPS and pattern recognition to perform the tracking [11]. Products like Boujou (2D3D) or PFTrack (The Pixel Farm) can be used to make augmented video sequences, but they have the disadvantage of being expensive solutions. 3
OVERVIEW OF THE PROPOSED ALGORITHM
In this work we present a complete algorithm that can augment a video sequence without any prior knowledge of the recorded scene. The method proposed includes a 2D feature tracker, that finds and tracks features along the video, and a 3D tracker, that calculates the camera pose in every frame. The overall algorithm can be seen in Figure 2.
Email: {jrsanchez, dborro}@ceit.es Figure 2: Algorithm overview.
Initially, the feature tracker finds corners with high eigenvalues in the first frame of the video sequence and tracks them using optical flow techniques. Using the matched features, a first approach of the epipolar geometry can be found and outliers can be detected. Remaining inliers allow refining the epipolar geometry, finding the camera’s focal length and obtaining a preliminary 3D reconstruction of the scene. Finally, the 3D motion can be recovered from 3D-2D matches. All the process is completely automatic, except the selection of the coordinate system where the virtual object is going to be registered, which is performed by the user. The next sections show in detail each part of the process. 4
FEATURE TRACKER
This is the first step and probably the most important one. This module looks for feature points in the first frame and tracks them along the whole video sequence. Its precision is critical for the geometry estimation performed in the next steps. This tracker analyzes the video frame by frame, with a performance suitable for real time in a standard desktop PC. The tracker needs to be initialized with the first frame of the video. This frame is used to search features suitable for the tracking process, typically points corresponding to corners. 4.1 Feature detector The algorithm used to find the features is based on the GoodFeaturesToTrack proposed by Shi and Tomasi [10]. It calculates the minimal eigenvalue of the derivative covariation matrix for every pixel. Equation 1 shows that matrix where I(x,y) is the intensity of the pixel (x,y). Points with high eigenvalues are considered valid features.
where d is the displacement vector, I(x, y) and J(x, y) are the intensity of the pixel (x, y) in both frames respectively and (ωx, ωy) is the chosen search window. In order to obtain accuracy and robustness the algorithm is executed iteratively in pyramidal reductions of the original image as shown in Figure 3. Low level pyramids (L2) provide robustness when handling large motions, and high level pyramids provides local tracking accuracy (L0). L0
L1 L2
Figure 3: Pyramidal reduction. However, this method is very sensitive to noise and to the possible outliers detected by the corner detector. In order to detect these outliers, a Kalman filter is attached to each feature [5], [14]. Using the Kalman filter, unexpected displacements of a feature can be detected. This allows the detection of outliers that could degrade the reconstruction of the scene. The Kalman filter provides a set of equations to model a discrete time linear system. It is capable of predicting present, past and future states of a system described by the following equation:
In Equation 3, xk is the state vector in the instant k, A is the transition matrix of the model and wk a random variable representing the process noise. The state modeled by this implementation of the Kalman filter includes the position and the linear velocity of the feature. This state is represented by the next vector: The threshold used to decide if a pixel corresponds to a feature is chosen according to the number of features detected in the image. The smaller this number, the lower the threshold is set. The corner detection only runs in the first frame but it also should be carried out again if the number of locked features decreases due to occlusions or tracking errors. 4.2 Optical flow tracker Once feature points are detected in a frame, the tracking algorithm creates a history with their positions in the next frames. Later, this information is used by the 3D tracker to estimate the geometry of the scene. The method used is an iterative version of the Lucas-Kanade optical flow method proposed by Jean-Yves Bouguet [1]. This algorithm calculates the displacement of a feature between two frames. The optical flow vector is defined as the one that minimizes the residual function:
where (ux, uy) is the position of the feature and (vx, vy) its velocity vector. The position of the feature in the next frame should be the addition between the current position and the velocity, so the transition matrix A of the filter is modeled by
In addition, there is a relationship between the measurements and the process state given by:
In Equation 6, vk is a random matrix that models the measurement noise, H the measurement extraction matrix and zk the measurement vector. In our case:
If the prediction of the Kalman filter is far away from the position calculated by the optical flow algorithm, the feature is an outlier and must be removed from the tracking process. Otherwise, the state of the filter is corrected using the measured position. This distance is parameterized with the filter’s prior error covariance. The complete scheme of the 2D feature tracking process is shown in Figure 4. The result of combining these techniques is a robust and fast tracker with enough precision to calculate a good 3D reconstruction in the next stage of the whole algorithm (Figure 2). First frame
Feature Search
Predict the new state with Kalman
Found enough features
Predicted state looks like the measure
approach, outliers are detected and removed. Remaining inliers are used to refine the fundamental matrix. After this, the camera’s intrinsic parameters can be found and an initial 3D frame can be set. Finally, the camera pose can be calculated. For the fundamental matrix and 3D reconstruction estimation, Philip Torr’s Matlab toolkit has been used [12]. 5.1 Outlier detection Although the feature tracker can detect outliers, only very strong inliers should be used for geometry estimation due to numerical stability problems. The best way to detect these outliers is checking the epipolar constraint using an initial guess of the fundamental matrix. This matrix is calculated using the RANSAC algorithm [13].
In Equation 8, F corresponds to the initial fundamental matrix and x and x’ to the coordinates of one feature in two different frames. In optimal circumstances, this equation must be zero, but in practical situations this is not always true. If the result of the equation is far away from zero, the feature is considered as an outlier. Using the remaining features a final fundamental matrix is calculated. 5.2 Camera calibration For camera calibration a standard pinhole model is assumed. Some constraints are imposed in order to simplify the model, such principal point (px, py) centered in the image and no skew or distortion. With these assumptions, the calibration matrix K looks like this, where f is the focal length:
No
No
Delete outliers
No Yes
Yes
Initialize Kalman filters
Correct filters
New frame
Enough inliers
Calculate optical flow
Yes
Figure 4: 2D Tracker flow diagram. 5
The algorithm used for calibration is very simple and works quite well. The method used is a simplification of the method proposed by Mendoca and Cipolla [7]. It is based on the properties of the essential matrix. The essential matrix is the fundamental matrix for a calibrated camera. This means that if the calibration matrix K is known, it can be defined as follows:
3D TRACKER
This module is capable of solving the camera geometry and get a 3D reconstruction of the captured scene using the tracked features. A minimum of eight points are needed to solve the geometry, but the result will be better when more points are considered. All processes involved in this module are based on the epipolar geometry concept [4], thus the first step is to calculate the fundamental matrix for every consecutive frame. Using this initial
An important property of this matrix is that it has two non zero and equal eigenvalues. So, the proposed algorithm searches for a matrix K which complies with this property using minimization techniques. Because of the simplifications introduced in the camera model, the calibration matrix depends only on the focal length, i.e. there is only one variable in the minimization problem. This method gives a focal length for every pair of consecutive frames. However, the accuracy of this algorithm depends on the precision of the estimated fundamental matrix. In order to avoid this problem, the final focal length is the arithmetic mean of all the coherent estimated focal lengths.
5.3 Scene reconstruction Once the camera is calibrated it is possible to calculate the essential matrix from the fundamental matrix using Equation 10. From the essential matrix, the pair of camera matrices (P, P’) can be calculated as follows:
The origin is centred in the first frame. In Equation 12, U (u3 is the 3rd column of U) and V correspond to the orthogonal matrices extracted from the SVD of E, and W is the following matrix:
There is an ambiguity in the transformation corresponding to the second camera. There are another three valid cameras corresponding to the reversed translation and to a rotation of 180º about the line joining the two camera centers [4] as it can be seen in Figure 5.
This system is determined up to a scale factor and the solution is the unit singular vector corresponding to the smallest singular value of the coefficients matrix. Once the correct pair of cameras is chosen the whole scene reconstruction can be recovered using the same triangulation algorithm. For every pair of frames there exists a possible reconstruction, but only one of these reconstructions is needed in order to calculate the camera displacement. Any pair of frames can be chosen for this initial reconstruction taking only one thing into account. If the two selected frames are very near each other, the reconstruction obtained is very poor because the problem becomes ill conditioned [2]. In Figure 6 it can be seen that the closer the frames, the greater is the uncertainty of the reconstruction.
Figure 6: Uncertainty of reconstruction. This uncertainty depends on the angle between the two rays. The uncertainty region is greater as the rays become parallel with the forward or backward motion being the worst possible situation. In Figure 7 an example of a 3D reconstruction is shown. The left image is the original video with the matched features and the right image shows the 3D points.
Figure 5: The four possible cameras. The correct solution is that one that reconstructs all the features in front of the two cameras, so choosing one feature and reconstructing it for all the possible pairs is sufficient to decide which is the correct one. This transformation is used only to calculate the 3D reconstruction. The reconstruction is performed by a linear triangulation method. Knowing the two projections of a feature in both cameras and the transformation relating them, the three-dimensional point can be triangulated solving the next system:
where u, v are the coordinates of the 2D features, X the coordinates of the 3D reconstruction and Pi the ith row of the camera matrix.
Figure 7: 3D reconstruction of the scene. 5.4 Camera motion When the 3D structure is recovered, the camera motion can be estimated frame by frame. This can be achieved performing a match between the reconstructed 3D points and their corresponding feature point in each frame. For every feature xi in frame n the next equation describes the dependency between it and his 3D reconstruction Xi:
If we normalize the feature points to the canonical camera we can express the equation as follows:
Note that we are working with homogeneous coordinates, so and PXi may differ by a non-zero scale factor, i.e. they have the same direction by may differ in magnitude. This can be expressed in terms of the vector cross product as:
xˆ i
(
)
ˆ i = xi , yi , wi . being x This gives a set of three equations, but only two of them are linearly independent, so the system can be rewritten as follows:
6
EXPERIMENTAL RESULTS
This section evaluates the performance and precision of the used algorithms. First, the feature tracker will be evaluated using synthetic images and secondly the camera tracker measuring the back-projection error. The PC used in all the benchmarks is a Pentium 4 family 3.2GHz CPU with 1GB of RAM. 6.1 Testing the feature tracker For testing the precision of the feature tracker we have created an application that generates synthetic images with known borders and additive noise. The first test executes the feature tracker in a frame and compares the obtained coordinates with the real ones of the existing corners in the image. The tests have been made using different resolutions and noise levels. For each situation a total of ten tests have been tried and the results can be seen in Table 1. Table 1: Error in the feature search. 320x240
640x480
σ noise μ error σ error Outliers μ error σ error
A minimum of six points are needed in order to solve the rotation and the translation. However, it is very common to have hundreds of matched 3D-2D features. A solution could be to choose six of them to solve the system, but there can be problems if the chosen pairs are noisy. The best option is to take all the matches into account and solve the problem as the unit singular vector corresponding to the smallest singular value of the coefficients matrix. 5.5 Rendering virtual objects When all camera transformations are known, to render an object just a reference coordinate system is needed. The origin can be set in any of the reconstructed features and then the user can move the object manually to its initial position. Figure 8 shows the final result of augmenting a scene with two towers using the proposed algorithm.
0
0
0
0%
0
0
0%
10
0
0
0%
0
0
0%
20
0
0
0%
0.002 0.09
0%
30
0.005
0.113
0%
0.014 0.201
0%
40
0.014
0.2
0%
0.031 0.297
0%
50
0.032
0.301
0%
0.037 0.322 0.201%
60
0.044
0.363 3.498%
0.05
70
0.049
0.406
0.084 0.473
26 %
Standard deviation 70
0.9% 1%
Standard deviation 90
20
Using Kalman
15
Without using Kalman
10
Outliers
Outliers
25
5 0 1
4
45 40 35 30 25 20 15 10 5 0
7 10 13 16 19 22 25
Using Kalman Withot using Kalman
1
4
Frame
Standard deviation 130 40
45 40 35 30 25 20 15 10 5 0
35 30 Using Kalman Without using Kalman
Outliers
Outliers
7 10 13 16 19 22 25 Frame
Standard deviation 110
There are some problems that have not been considered yet, like occlusion or lighting. Real objects sometimes cover or throw shadows to virtual objects. This fact degrades the quality of the resulting video, and will be addressed in the future.
0.367
In the table above σnoise is the standard deviation of the Gaussian noise added to the image and μerror and σerror the mean and the standard deviation of the measured error in pixels. The next test evaluates the efficiency of the Kalman filter detecting outliers. For this test very noisy images are generated. The next graphs show the evolution of the outlier detection along the video sequence.
30
Figure 8: Augmented scene.
Outliers
Using Kalman
25 20
Without using Kalman
15 10 5 0
1
4
7 10 13 16 19 22 25 Frame
1
4
7 10 13 16 19 22 25 Frame
Figure 9: Evolution of the outlier detection.
As we can see in the results, the Kalman filter can detect practically all the outliers in four or five frames in very noisy situations. The optical flow is capable of detecting outliers as well but the results are very poor for this application. The optical flow calculation process also introduces errors in the feature position. This error has been measured in a moving scene with the following results: Standard deviation 50
Without noise 0,7
0,6 0,5
Using Kalman
0,4 Without using Kalman
0,3 0,2 0,1
Error mean (pixels)
Error mean (pixels)
0,7
0,6 0,5
Using Kalman
0,4 Without using Kalman
0,3 0,2 0,1 0
0 1
11 21 31 41 51 61 71
1
11 21 31 41 51 61 71 Frame
Frame
Figure 10: Error in the tracking process. As can be seen in the graphs, although an error is introduced by the optical flow, it is small enough to be taken into account. This fact combined with the efficiency reached in outlier detection, gives a reliable feature tracker for the 3D reconstruction and camera pose estimation process. The time for the whole feature tracking algorithm is insignificant compared with the camera solving process. For example, a video of 340 frames with a resolution of 704x576 needs approximately 5 seconds to search and track 300 features. 6.2 Testing the 3D tracker The strategy used to test the accuracy of the camera pose estimation algorithm consists in a comparison between the position of the features in the image and the corresponding projection of the reconstructed 3D point. In the tests the 3D reconstructions are calculated every five frames. This avoids the uncertainty problem shown in Figure 6. The next graph shows the mean of the error measured along 100 frames. 0,5 0,45
Projection error (pixels)
0,4 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0 1
6
11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 Frame
Figure 11: Projection error. The time needed to perform the 3D tracking process is approximately one second per frame. This is very far from the maximum of 40 ms needed to run the process in real time, but this is mainly because it is implemented in Matlab. 7
CONCLUSIONS
This work covers all the processes involved in an augmented video application. The method does not need any previous knowledge of the augmented scene or user interaction except in the registration step.
The advantage of this type of system is that any video can be augmented imposing only a few restrictions on it (fixed intrinsic parameters). Additionally, any user without experience can augment videos in an easy way because the whole process is automatic. In the first part of the work, a 2D feature tracker has been developed. This tracker has proven to be accurate enough for many applications, like 3D reconstruction or camera pose estimation and it can work in real time in a standard PC. This fact makes the tracker suitable for surveillance, human computer interaction or any application that needs real time response. Secondly, the designed 3D tracker can add 3D objects to real video sequences. It depends heavily on the accuracy of the feature tracker but the tests demonstrated that the result is satisfactory under normal conditions. On the other hand, currently the prototype works under Matlab so the time needed to run the calibration and reconstruction processes is very high. Thus, an immediate objective is to translate the code into another more efficient language, like C++. However, the proposed algorithm is not proper for running in real time because of the outlier search process and the key frame reconstruction based algorithm. REFERERNCES [1] J.-Y. Bouguet, "Pyramidal Implementation of the Lucas Kanade Feature Tracker", Technical Report, Intel Corporation, 2000. [2] K. Cornelis, "From uncalibrated video to augmented reality", From uncalibrated video to augmented reality, 2004. [3] D. DeMenthon and L. Davis, "Model-Based Object Pose in 25 lines of code", International Journal of Computer Vision, pp. 123-141, 1995. [4] R. Hartley and A. Zisserman, Multiple View Geometry in computer vision, Cambridge University Press, 2000. [5] R. Kalman, "A New Approach to Linear Filtering and Prediction Problems", Journal of Basic Engineering, pp. 35-45, 1960. [6] H. Kato and M. Billinghurst, "Marker Tracking and HMD Calibration for a video-based Augmented Reality Conferencing System", International Workshop on Augmented Reality (IWAR), San Francisco, USA, pp. 85-94, 1999. [7] P. Mendonca and R. Cipolla, "A simple technique for selfcalibration", IEEE Conference on Computer Vision and Pattern Recognition, Fort Collins, Colorado, pp. 112-116, 1999. [8] P. Milgram, H. Takemura, A. Utsumi and F. Kishino, "Augmented Reality: A Class of Displays of the Reality-Virtuality Continuum", SPIE Conference on Telemanipulator and Telepresence Technologies, Boston, USA, pp. 282-292, 1994. [9] H. Park and J. Park, "Invisible Marker Tracking for AR", Third IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2004), 2004. [10] J. Shi and C. Tomasi, "Good Features To Track", IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Washington, pp. 593-600, 1994. [11] B. Thomas, B. Close, J. Donoghe, J. Squires, P. De Bondi, M. Morris and W. Piekarski, "ARQuake: An Outdoor/Indoor Augmented Reality First Person Application", 4th International Symposium on Wearable Computers, Atlanta, pp. 139-146, 2000. [12] P. Torr, "A Structure and Motion Toolkit in Matlab", Technical Report, Microsoft Research, Cambridge, UK, 2002. [13] P. Torr and D. W. Murray, "Outlier detection and motion segmentation", Sensor fusion VI, Boston, pp. 432-443, 1993. [14] G. Welch and G. Bishop, "An Introduction to the Kalman Filter", SIGGRAPH (Computer Graphics), Los Angeles, CA, USA, 2001.