1. Introduction Recent years has seen attempts to apply AR in the digital reconstruction of archaeological sites. In these systems, AR technique is used to overlay computergenerated original architectures onto to the real sites, providing visitors with a more realistic and informative way to access their ancient ruins. Contrary to indoor AR applications, outdoor AR meets some challenges including the uncontrollable factors such as light condition and moving people, no references for the environment and so on. Since the use of AR in such systems is a promising and at the same time challenging issue, some breakthroughs have been observed in the past few years. An earliest one is GUIDE[1], which provides location-related

_____________________________

978-1-4244-3701-6/09/$25.00 ©2009 IEEE

information such as texts and images to the user. A similar system is known as Smart Sight[2], which is an intelligent system and used as a tourist assistant. Microphone and headphones are also used in this system as audio input and output. An important project known as ARCHEOGUIDE[3] is a personalized AR tour-guide system that using mobile computing, networking and 3D visualization. The similar applications have also been found in other projects, such as the ICTs[4], the LIFEPLUS Project[5], the Eternal Egypt Project[6] and the Digital Augmented System[7]. These systems rely on prior knowledgeˈ fiducial markers or hybrid tracking methods such as GPS and inertial sensors to reduce the computation burden and increase robustness when estimating the pose of a viewer. However, the applicability of these systems is limited since the markers require maintenance and are intrusive to insert them in the sense. In addition, if the markers are partially occluded or outside the field of view, the virtual contents cannot be augmented. Moreover, tracking modalities such as GPS cannot satisfy the requirement of accuracy in AR, which can only be offered by optical technologies. In this paper, we extend the application of the method in [8] from indoor to outdoor environment and propose a marker-free AR tracking method based only on an ordinary camera and multi-keyframes by considering the rich texture and wide visiting range of Da Shui Fa, one of the famous scenes of Yuanmingyuan Park. Our algorithm is also proposed under a reasonable assumption that the scene can be viewed as a plane model because its depth is far less than the distance from it to visitors. In our method, feature matching is viewed as a classification problem. The work range is extended by using multi-keyframes. The online selection of the appropriate keyframe is achieved by comparing the normalized Euclidian distances between the current image and all keyframes. The presented method is validated on several image

sequences including outdoor and indoor environments. Results show that the method is robust to light changes and partial occlusion. In the remainder of this paper, Section 2 presents in details the proposed tracking algorithm. Section 3 describes the system build for experiments whose results are also given here. Finally, conclusions and future work are provided in Section 4.

one. In our system, four such viewpoints are good enough to present the whole scene in different positions, so n=4 (see A-D in Figure 1) and the set of keyframes used in our system is named as Fbest={Fibest|i=1,2,3,4} .Then during a training phase, we construct for each keyframe a feature database Di of appearances of the feature points as they would be seen under many different views, which will be detailed in the following parts.

2. Markerless tracking 2.1. Overview The whole algorithm includes the offline training stage and the online tracking stage. In order to extend the application of the method in[8] to an outdoor environment, we choose more than one (usually 4 or 6) reference images during the offline stage. These images are called keyframes, which are manually calibrated to estimate the projection matrices from them to the scene. Then feature points extracted from these frames will be trained by using randomized tree algorithm. We will refer to these keyframes, their pose parameters and feature information gained from training as keyframe database. At each time step of online stage, an appropriate keyframe will be selected whose viewpoint must be as close as possible to that of the current frame. Since the camera position of the current frame is not yet known, we select the keyframe closest to the previous frame. Afterwards, considering the feature correspondences established by using a classification way based on the feature information from the database, the camera transformation between the current frame and the selected keyframe can be computed. Finally the current pose of the camera can be estimated.

2.2. Keyframe database construction The process of constructing the keyframe database is showed in Figure 1. Firstly, some viewpoints around the scene will be chosen uniformly, which can cover the range of the touring routine in front of the scene. From every viewpoint, we capture one image to be used as the keyframe here. Sometimes we found that there are few useful feature points can be extracted from this image due to image blur, so we capture five images for each viewpoint to form an image set Si={Fij|i=1,…,N;j=1,..,5}, Fij denotes the image j in viewpoint i, N is the number of viewpoints. Then for Si (i=1,2,3,4), a best image that is indicated by Fibest and used as the keyframe will be found according to a evaluation mechanism, in which the clearest image with the maximum number of feature points is the best

Figure 1. A flowchart of keyframe database construction The commonly used feature matching methods such as cross correlation of image block and K-D tree show poor performance in dealing with wide-baseline feature matching. In fact, this kind of matching can be naturally formulated as a pattern classification problem, which depends on an offline training process. During a training stage, the following two procedures are carried out for each keyframe of Fbest. Sample sets for training: We use the detector in[8] to extract feature points from the keyframe. Each feature point is regarded as a feature class. A surrounding patch (32 u 32 pixels) centered on this point is used to generate sample sets, which include all the possible views of this training patch. These views are synthesized by applying the following affine transformations on the patch. n m0 RT RM1SRM ( m m0 ) t (1) where m0 are the coordinates of the keypoint detected in the training patch, n are the new coordinates of the transformed point m, R, Rare two rotation matrices: Ri

§ cos i sin i 0 · ¨ ¸ ¨ sin i cos i 0 ¸ , i ¨ 0 ¸ 0 1¹ ©

(T M )

(2)

,ę[-,], 1,2ę[0.5,1.5], S=diag(1,2,1) is a scaling matrix, t=(tu,tv)Tis a 2D translation vector[9]

Different views of each class are created by sampling regularly of the space of (,,1,2,tu,tv). In Figure 2, these new samples presents the same class but seen from different viewpoints or at different scale.

Figure 2. First patch: a training patch centered on a feature point (red cross) from the real scene; others: new views synthesized by using affine transformation. Training: randomized tree, which is a specific variation of decision tree, is applied here to perform supervised learning on the samples. As depicted in Figure 3, each non-terminal node of a randomized tree contains a simple test that decides to which one of the two sub-nodes the sample should go. In our experiments, we use the binary test: If I(m)-I(n)Aˈ go to sub-node 1˗ If I(m)-I(n)>Aˈ go to sub-node 2˗ I(i) is the intensity at pixel location i of a patch. m and n are two pixel locations in the neighborhood of the feature point on the patch. m and n are chosen randomly when a patch reaches a non-terminal node for the first time and after this they will be fixed for this node. A is a pre-determined threshold. All samples will be pushed down in the tree. Each sample falls in each node will be tested until it reaches the bottom of the tree, which we call the leaf nodes. In every leaf node, the posterior probability of each class is calculated, which represents the ratio of samples belonging to this class in all samples that reach here (see Figure 3). The posterior probability of a class in a leaf node will increase accordingly once a sample of this class reaches that node. We grow multiple trees T={Tl, l=1,…,L} to increase the recognition rate. Here L is the number of trees. Even a low recognition rate is found in a tree, the combination with other trees can generate an efficient one. Until now, a feature database Di based on Ti={Til |l=1,…,L} is created for each keyframe Fibest(i=1,2,3,4). Finally, the keyframe database is constructed, which includes the set Fbest, their associated pose parameters and the set of feature database D={Di|i=1,2,3,4} At running time, for a given image of the sence, an appropriate keyframe defined by Fapp (Fapp Fbest) will be selected first according to the rule in Section 2.3.

Then the camera can be registered by matching the feature points present in the image against the feature database Dapp (Dapp D)associated to Fapp. This will be detailed in Section 2.4.

Figure 3. Randomized tree (left) and the posterior distribution (right) in leaf-node

2.3. Online selection of keyframe At the beginning of each time step t of online stage, keyframe with the closest viewpoint to the previous image will be selected. As in[6], many works carry out this selection by calculating the image similarity between keyframe and the previous image. Keyframe with the highest similarity is viewed as the one has the closest viewpoint to that of the previous image. However, computing image similarity requires every pixel to be involved in the calculation, which will increase the required amount of computation and impact the performance of the system. In this paper, we prefer to select Fapp from Fbest by considering the following standard: app arg min Edi (3) i

here 3

Ed i

¦ ( Pc

j

j 1

3

¦ Pc Pk j

j 1

7

Pkij )2 ij

¦ ( Pc

j

Pkij )2

j 4

(4)

7

¦ Pc Pk j

ij

j 4

is defined as the normalized Euclidian distance between the keyframes and the previous image. Pki=(pkij)T(i=1,…,4,j=1,…,7) and Pct-1=(pct-1,j)T(j=1,…,7) are the pose vector of keyframe i and the previous image respectively. Pk and Pc are both conform to the form of P=(x,y,z,q1,q2,q3,q4)T, which represents the pose vector of camera. (x,y,z)T is the translation vector and (q1,q2,q3,q4)T is the normalized quaternion associated to

the rotation matrix. With the denominator acts as the scale normalization, the problem of the different scale between position and rotation parameters is no longer an issue. Moreover, the pose vector P can be easily derived from pose estimation, which makes this method more effective than calculating image similarity.

2.4. Online Feature matching and pose estimation After Fapp is found at each time step, a given patch centered at a feature point extracted from the input image will be dropped down the trees in Dapp. This feature point will be classified considering the following criterion: r

Y(r )

1 ª º ¦ dl c (r ) »¼ «arg max L l 1...L ¬ c

arg max d ca ( r ) ! TH

(5)

K r1 r2

§ Xw · ¨ ¸ Y r3 | T ¨ w ¸ K r1 ¨ Zw ¸ ¨¨ ¸¸ © 1 ¹

so H cw

K r1

and K 1H cw

r2 | T

r , r 1

r3

2

|T

§ Xw · ¨ ¸ r2 | T ¨ Yw ¸ ¨ 1 ¸ © ¹

(8)

(9)

(10)

r1 u r2

(11)

r1 u r2

Finally, the camera pose of the input image can be estimated by:

c

Here, dlc(r) denotes the posterior probability of class c in leaf node of tree l reached by a patch r. dca(r) is the average of dlc(r). L is the number of trees. TH is a given threshold. The criterion means that the feature of patch r is going to be classified to the class that has the maximal average of the posterior probabilities and the average must be larger than TH. After the set of matched feature points S={(mci,mki)|i=1,…,n} between the input image and Fapp is established, the relative pose between them can be calculated, which, under the hypothesis of planar mode, can be represented by a homographic matrix H ck . (6) sm c H ck m k mci, and mki are matched feature points that present in the input image and Fapp respectively. s is a scale factor. Generally, four pairs of feature correspondences will give a solution to H ck . More precise H ck is obtained by involving more feature correspondences, in which outliers are removed by using outlier rejection algorithms such as Random Sample Consensus (RANSAC) and Levenberg-Marquardt (LM). Suppose the projection matrix between the real scene and the input frame is indicated by H cw , which is obtained by (7) H cw H ck H kw Here, we consider the pinhole camera model. H kw donates the projection matrix between the real scene and Fapp. Without loss of generality, we assume the plane of the scene is on the Zw=0 plane of the world coordinate system. Feature point (xt,yt,1) in an image is associated with a point (Xw,Yw,Zw,1) in the scene by: § xt · ¨ ¸ ¨ yt ¸ ¨1¸ © ¹

K donates the intrinsic matrix, (r1, r2) the first two columns of the rotation matrix Rw and Tw is the translation matrix of camera pose for the input image. According to the normalized orthogonal restriction of Rw , r3 is specified by

R

|T

§ ¨ r1 ¨ ©

r2

r1 u r2 |T r1 u r2

· ¸ ¸ ¹

According to (10), (r1, r2) are the first two columns of matrix K 1H cw .

2.5. Jitter elimination by EKF Generally, keyframe-based methods may result in jitters of the virtual projection in the system. In order to reduce the influence of jitter in our system, the Extended Kalman Filter (EKF) is used to smooth the estimated pose parameters. Suppose the state vector of camera at time step t is xt=(pt,vt,qt,t)T. pt=(px,py,pz)T is the translation vector, vt=(vx,vy,vz)T the velocity, qt=(q1,q2,q3,q4)T is a quaternion, which is uniquely associated with the rotation matrix and t=(1,2,3)T the angular velocity. q is used here to avoid the value mutation in the pose estimation when the value of angles are close to zero or 2. We assume that the linear and angular velocity remain constant within a short time step. Small changes of these two velocities are represented by the noise vector nov=(n,nv)T, which is drawn from a zero mean Gaussian distribution with the covariance matrix =diag(,v). The updating procedure for the state vector is

x t+1

§ p t+1 · ¨ ¸ ¨ v t+1 ¸ ¨ q t+1 ¸ ¨¨ ¸¸ © t+1 ¹

p t v t 'T § ¨ v t nov v,t+1 ¨ ¨ cos(0.5 t+1'T ) § ¨ ¨ q t+1 ¨ t ¨ ¨ ¨ sin(0.5 t+1 'T ) t+1 © ¨ ¨ t nov ,t+1 ©

· ¸ ¸ ·¸ ¸¸ ¸¸ ¸¸ ¹¸ ¸ ¹

(12)

and observation equation is specified by z t+1

H t+1x t+1 w t+1

zt+1=(pt+1,rt+1)T, pt+1 is the translation vector, rt+1=(t+1,t+1, t+1)T the rotation vector of the camera in the form of Euler angle, Ht+1 the observation model and wt+1 is the observation noise which is also assumed to

be drawn from a zero mean Gaussian distribution with the covariance matrix Rt+1

3. Experiments and results We refitted the z800 video see-through HMD device of eMagin Corporation by installing an ordinary USB camera, which we chose PHILIPS SPC 900NC, on the top of the device (Figure 4). The whole algorithm was applied on 320 u 240 videos and developed by C++ on VS2005 platform on a 3.0 GHz PC running Windows XP.

touring routine in the front of Da Shui Fa(see Figure 6 for the augmented results). The tracking results under partial occlusion and change of scale in outdoor environment are also tested in our experiment (Figure 7). Suppose Ra is the largest range of tracking when the algorithm works well. Along with the increase number of keyframe, Ra will be enlarged, which is together with a little sacrifice of real-time performance (Table 2).

Figure 6. Augmented results (a) (b) Figure 4. Refitted HMD device(a) and experimental process in the real scene(b)

3.1. Online selection of keyframe Table 1 shows the normalized Euclidian distance between the four keyframes Fibest(i=1,2,3,4) and two input images im(m=1,2). Figure 5 shows the four keyframes, the two input images and the augmented results after keyframe selection according to the results of Table 1.

Table 1. Normalized Euclidian distance between input images and keyframes F1best F2best F3best F4best i1 0.0056 0.0168 0.1980 0.0003 i2

0.1994

0.1225

0.0312

0.0010

Figure 7. Tracking performance under partial occlusion (first two) and change of scale (last two) Table 2. Frame rate and tracking range when using different numbers of keyframe number of keyframe 1 2 4 Ra(m) 4 8 15 frame rate(fps) 21 21 20

3.3. Jitter elimination We recorded the estimation of p and q twice in an experiment. In the first time we used EKF and in another one we did not. Both these two tests were performed on the same video (Figure 8-a,b). The original values of the parameters of the state vector are all set to zero. It indicates that the filter will gain its convergence within 6 frames and most of the jitter is eliminated by the filter.

(a) (b) Figure 5. Online selection of keyframe: (a) images of Fibest, im and the augmented results of im, (b) spatial distribution of the camera poses for Fibest and im

3.2. Outdoor Tracking and augmented results In outdoor experiment, we used 4 keyframes, which covered the range of 0-15m from left to right of the

3.4. Indoor experiments We also tested our system in indoor environment, during which the performance of the proposed method is demonstrated with comparisons to the ARToolkit technique in two condition changes: illumination and occlusion (Figure 9 and Figure 8-c). Clearly, the proposed method shows better performance in terms of robustness by stably tracking the camera and

(a) (b) (c) Figure 8. (a) Jitter elimination of p by EKF; (b) Jitter elimination of q by EKF; (c) Estimation of the translation vector p for ARToolKit and the proposed method; two condition changes are involved: sudden illumination change happens at frames 417 and 478 and partial occlusion happens at frames 541 and disappears at frames 567. augmenting the teapot onto a book cover before and after condition changes, while ARToolKit failed in tracking and could not augment the teapot onto the marker after the changes.

4. Conclusion and future work We present a fairly robust and drift-free reconstruction system of Yuanmingyuan heritage site that combines natural feature matching and the use of mul-keyframes to handle any kind of camera displacement in an update rate appropriate for real-time application. The experiments illustrate the applicability of the presented system in the general scenes of both outdoor and indoor environment. Though our algorithm has some limitations, which we plan to improve in the future, it is a promising work and represents a step towards solving the problem of reconstruction of archeological sites in terms of vision and markerless methods.

Acknowledgement This work is supported by the National Natural Science Foundation of China (60673198, 60827003), the Hi-Tech Research and Development Program of China (2007AA01Z325, 2008AA01Z303) and the Innovation Team Development Program of the Chinese Ministry of Education (IRT0606).

International Symposium on Wearable Computers, IEEE Computer Society, San Francisco, CA, USA, 1999, pp 73-78. [3] http://archeoguide.intranet.gr/ [4]R.Owen, D.Buhalis and D.Pletinckx, “Visitors` Evaluations of ICTs Used in Cultural Heritage”, In Proceedings of the 6th International Symposium on Virtual Reality, Archaeology and Cultural Heritage, M. Mudge, N. Ryan, and R. Scopigno, Eds. Eurographics, 2005ˈpp.129136. [5] LIFEPLUS Project: http://lifeplus.miralab.unige.ch/ [6] Eternal Egypt Project: www.eternalegypt.org [7] G.Papagiannakis, S.Schertenleib, B.O’Kennedy, et. al, “Mixing virtual and real scenes in the site of ancient Pompeii”, Computer animation and virtual worlds, J.Wiley and Sons. Ltd, J.Wiley and Sons, 2005, vol.16, pp.11–24. [8] V.Lepetit and P.Fua, “Towards Recognizing Feature Points using Classification Trees”, Technical Report , EPFL, 2004. [9] R.Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2000.

Reference [1]K.Cheverst, N.Davies, K.Miichell, A.Friday, and C.Efstratiou. “Developing a context-aware electronic tourist guide: Some issues and experiences”, In Proceeding of Computer-Human Interaction 2000, IEEE Computer Society, Netherlands, April 2000, pp 17-24.

Figure 9. Performance comparison between ARToolkit (top row) and the proposed algorithm (bottom row) when changing illumination (left four) and occlusion (right four)

[2] Jie Yang, Weiyi Yang, and M.Debecke. “Smart sight: A tourist assistant system”, In Proceeding of the 3d