MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

1

3D Head Tracking via Invariant Keypoint Learning Haibo Wang, Franck Davoine, Vincent Lepetit, Christophe Chaillou, and Chunhong Pan

Abstract—Keypoint matching is a standard tool to solve the correspondence problem in vision applications. However, in 3D face tracking, this approach is often deficient because the human face complexities, together with its rich viewpoint, non-rigid expression, and lighting variations in typical applications, can cause many variations impossible to handle by existing keypoint detectors and descriptors. In this paper, we propose a new approach to tailor keypoint matching to track the 3D pose of the user head in a video stream. The core idea is to learn keypoints that are explicitly invariant to these challenging transformations. First, we select keypoints that are stable under randomly drawn small viewpoints, non-rigid deformations and illumination changes. Then, we treat keypoint descriptor learning at different large angles as an incremental scheme to learn discriminative descriptors. At matching time, to reduce the ratio of outlier correspondences, we use secondorder color information to prune keypoints unlikely to lie on the. Moreover, we integrate optical flow correspondences in an adaptive way to remove motion jitter efficiently. Extensive experiments show that the proposed approach can lead to fast, robust, and accurate 3D head tracking results even under very challenging scenarios.

Fig. 1: Sample tracking results from a challenging sequence. The proposed approach reliably tracks the face in spite of the challenges of facial expressions, large scale transformations, face non-planarity, partial occlusions, lighting variations, large rotations and a cluttered background.

I. I NTRODUCTION Keypoint matching is a popular tool to solve the correspondence problem in many computer vision applications. These applications range from stereo reconstruction to object recognition or pose tracking. Local descriptors, such as SIFT [24], build on the fine scale and rotation estimates provided by keypoint detectors. They are usually unique enough to allow for global matching without additional regularity constraints. In particular, keypoint matching has been used to track the 3D pose of human heads [41], [22], [35], [19]. The basic scheme is to match detected keypoints with a database of reference keypoints to establish 3D-to-2D correspondences, and solve for an optimal 3D pose that minimizes the reprojection errors in image plane. While this scheme is efficient for other problems, so far it has not been completely successful for head tracking. Among many reasons, the following ones are probably the most significant: (i) The complex shape of the human face can cause complex geometric and photometric distortions between different viewpoints, different facial expressions and different lighting changes. Since keypoint detectors are sensitive to these distortions, the number of correct keypoint matches is often not large enough to recover the 3D pose: The best keypoint technique can only be made invariant to inplane rotation, scale change, and affine transformations [24], H. Wang, C. Pan are with the Sino-French LIAMA lab at Institute of Automation, Chinese Academy of Sciences, Beijing, China. E-mail: {hbwang1427,chunhongp}@gmail.com. F. Davoine is with the LIAMA and French CNRS, E-mail: [email protected]. V. Lepetit is with the CVLAB at Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Switzerland. E-mail: [email protected]. C. Chaillou is with the LIFL&INRIA Lille lab at University of Science and Technology of Lille, France. E-mail: [email protected].

[27]. (ii) Because the human face is often less textured than a cluttered background, most of the keypoints will lie on the background, many outlier correspondences will arise and make the pose estimation fail. (iii) It is difficult to get an accurate estimate of the 3D locations of the keypoints, and this biases the pose estimation. With these drawbacks, keypoint-basedface-tracking usually does not work very well. The goal of this paper is to address the three issues. To this end, we propose a new keypoint matching framework, in which we have three major contributions. The first contribution is to learn reference keypoints robust to the specific distortions of the human face appearance. We rely first on a keypoint detector to estimate in-plane transformations and normalize them to extract invariant descriptors. Then with an accurate 3D head model, we simulate out-ofplane transformations and complex lighting changes to learn most robust keypoints and descriptors. Actually, combining simulation and normalization is not entirely new: Affine-SIFT simulates all possible affine transformations out to gain affine robustness [55], and appears to be more affine-invariant than the normalization-based counterparts. It has also been shown in [36] that if all the simulations are shifted to a learning stage, recognizing keypoints under affine distortions becomes more efficient. Our learning approach can be viewed as an extension of [36] to reach maximum invariance to the typical distortions of a human face. We therefore compare with it in our experiments. As a result, our head tracking system is as stable and robust as many learning-based systems, yet requiring less training samples.

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

The second contribution is to apply local color information to prune keypoints that are unlikely to be on a face before keypoint matching. In particular, we rely on the covariance matrix of color pixels to describe the image patch centered at each keypoint [43], and represent it as a compact descriptor of a concatenation of sigma points [15]. The distinctiveness of this color descriptor enables us to successfully prune background keypoints before matching them. This in turn greatly reduces the ratio of outlier correspondences and speeds up the final pose estimation. The third contribution allows to simultaneously recover the 3D pose and remove jitter removing with optical flow. We add optical flow in the same formulation as keypoint correspondences to achieve simultaneous optimization. We adapt the influence of the optical flow correspondences so that we can remove jitter that is typically visible for small motions, without attenuating true large displacement. Our approach is in contrast with recursive trackers that estimate pose based on motion smoothness prediction. Result shows that these trackers suffer from motion drifting when the prediction is inaccurate. Our approach does not suffer from the drifting issue since we do not predict the motion. Our approach also significantly differs from [44], although we are inspired by it. In [44], Vacchetti et al. used keypoints tracked over the sequence to remove jitter. However, it needs to approximate the 3D positions of keypoints and requires keypoints richly detected. By contrast, optical flow is simpler to compute, and can be easily known the 3D positions of the reached points. In the remainder of the paper, we first discuss related work in Section II. Our invariant keypoint and descriptor learning methods are presented in Section III. Pruning unlikely keypoints is described in Section IV and jitter removal with optical flow in Section V. The tracking algorithm is summarized in Section VI. Extensive experiments are demonstrated in Section VII and conclusions in Section VIII. II. R ELATED W ORK A. Monocular 3D Head Pose Tracking Monocular 3D head pose tracking has been an active research topic for more than two decades. Existing approaches can be divided into three categories. The first category relies on global features. Zhang et al. [57] first interpreted image flow in terms of the motion parameters of a 3D rigid model. Brand and Bhotika [5], DeCarlo and Metaxas [11] and Xiao et al. [54] extended this idea in different directions. In particular, as reported in [11], treating optical flow as a hard constraint produces impressive tracking accuracy. However, measuring image flow often becomes unstable in presence of noise. La Cascia et al. [6] relied on a textured cylinder model, which appears to be more robust than using the motion flow. Wu and Toyama [52] used the more reliable Gabor wavelet responses of texture intensities of a 3D model. Morency et al. [29] employed adaptive face texture as its tracking template. Following the tendency of deformable models, the powerful Active Appearance Model (AAM) was extended to face tracking with the fast Inverse Compositional (IC) alignment algorithm [26]. To handle 3D

2

pose with AAM, Cootes et al. [10] designed the Multi-view AAM by learning shapes and appearances at different angles. Xiao et al. [53] proposed an alternative way that imposes 3D model instances as constraints to encourage physically plausible AAM. However, no matter how it is rectified, the manipulations of AAM remain in 2D space. The 3D Morphable Model (3DMM) [3] was developed to overcome this limitation. It stems from world-captured face data and naturally handles 3D poses. The authors showed that 3D head poses can be reliably tracked with 3DMM, even if the variations allowed are mechanically defined [12]. While these appearance models are robust, they tend to suffer from frequent re-initialization, occlusions and motion drifting. Other pose priors for head tracking have also been used. If large training data is available, training neural networks can produce satisfying tracking results [38], [48]. With a small number of training samples, the nearest-neighbor classifier becomes an appropriate choice [34]. More recently, researchers tend to use learning for dimensionality reduction. Morency et al. [28], for example, proposed to learn linear pose subpaces via PCA. More generally, other reduction techniques, such as LDA, KPCA and KDA, are also widely tested [7], [23], [51]. Progressively, being aware of the nonlinear nature of head poses, Raytchev et al. [39] proposed to turn to nonlinear manifold embedding methods, for example, ISOMAP. Regardless of its impressive theoretical results, this learning strategy is rarely used in practice. In general, the learned model generalizes well on the testing data that are similar to the training ones, but performs much worse on unobserved data. In addition, it is costly to update the learned model with new training samples. The second category of face tracking methods relies on local features. Most early works of this category are built upon local semantic features. Wang and Sung [49] localized eyelines and mouth-line and inferred head poses by detecting the vanishing points of these lines. Ji and Hu [20] detected two pupil centers to perform ellipse fitting on face and deduced head orientations from the geometric distortions of the ellipse. Horprasert et al. [16] tracked four eye corners and one nose tip, and estimated pose by the invariance of cross-ratios of the eye corners and anthropometric statistics. Similarly, Hu et al. [17] detected five facial components but relied on face symmetry to deduce pose. The weaknesses of these methods are that using semantic features requires many training data and detecting semantic features is prone to failure. Instead, recent efforts are taken to detect local interest points to pose head tracking as a Perspective-n-Point problem [41], [22], [35], [19]. Among them, impressive tracking results were reported in [22], [50], which combined keyframe selection with local keypoint tracking on pose estimation. As an extended work, Wang et al. [56] further fused semantic features of a 3DMM and local interest features to gain the stability of the semantic features and the flexibility of the interest features. Since it relies on statistical models, many training samples are required and plenty of manual annotations are necessary. The approaches in the third category fuse multiple features. Early methods fused visual cues in a heuristic way. For instance, Sung et al. [42] simply employed the outcomes

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

of cylinder flow-based tracking to initialize an AAM. Similarly, Malciu and Pr´eteux [25] combined 3D-2D appearance matching and optical flow to track the 3D head pose. Murphy and Trivedi [32] developed a head tracking system that fuses multiple cues for driver assistance. On the other hand, there are also many fusion systems following a certain principle. Nikolaidis and Pitas [33], for example, integrated curve detection, template matching, active contour models and geometric information into a unified bayesian framework. Sherrah et al. [40] proposed an alternative probabilistic framework to combine multiple cues and multiple tracking modules. Huang and Trivedi [18] coupled head detection and pose tracking in a principled fusion method. In this paper, we combine distinctive keypoint, local color and optical flow in a very different way. Keypoints are the main component of our algorithm, and local color and optical flow help to overcome their limitations. B. Keypoint Detection and Description Keypoint detection can be designed to be invariant to various in-plane transformations. For example, one can achieve scale-invariance by Laplacian-scale selection and affineinvariance by second moment normalization [27]. Among the various descriptors methods, the SIFT descriptor [24] outperforms the others in terms of matching rate. However, it is slow for real-time applications [22]. As a simplified version of SIFT, SURF [1] is more efficient to compute but appears less powerful. Alternatively, Ozuysal et al. [36] proposed to use a learning-based method to skip the expensive step of descriptor extraction and matching. This approach can quickly classify each patch by training a classifier based on a set of random binary features. Since the target is assumed to be planar, this approach is not really suitable for human face tracking. Besides, the feature classifier performs worse than the descriptor method. In this paper, we extend this learning strategy to the 3D non-planar domain and combine it with the descriptor method to greatly improve keypoint recognition rate. With the power of keypoint detector and a learning phase, this work attempts to reach maximum invariance to the specific distortions of human face appearance. As this paper shows, this strategy selects most reliable keypoints, and retains useful descriptors from different point of views. III. L EARNING I NVARIANT K EYPOINTS This section describes our scheme to learn keypoints invariant to the specific distortions of human face appearance. In short, we synthesize a face under different points of view and deformations using a 3D head model, and detect a set of keypoints in each synthesized images. This allows us to find keypoints whose locations are most robust, and whose descriptors are most distinctive. Standard keypoint methods can be made robust to most inplane geometric and linear photometric distortions by leveraging normalization and simulation. However, the complexity of the human face often brings out-of-plane transformations and nonlinear lighting changes, which makes it difficult to extract transform parameters for the normalization.

3

TABLE I: NOTATIONS OF BASIC ITEMS k e k f (k) V u K R T Ii H yˆ(f (k)) y(k) c(k) g(c(k)) e V e u OI x(k) r(k1 , k2 )

keypoint reference keypoint invariant descriptor of keypoint k 3D location of k image location of k 3 × 3 camera intrinsic matrix 3 × 3 rotation matrix 3 × 1 translational vector ith training image keypoint classifier keypoint matching response yˆ = H(f (k)) closest 3D position y = Backproject(k) keypoint color descriptor binary color prior of keypoint k, f ∈ {0, 1} 3D location of optical flow correspondence image location of optical flow correspondence image gradient operator, e.g. OIx , OIy and OIt matching indicator of keypoint k, x ∈ {0, 1} similarity measure between keypoints k1 and k2

On the other hand, it has been shown that a learning scheme can improve the rate of keypoint recognition [36]. Specifically, the learning approaches do not rely on the extraction of local transformations but on the ability to generalize well from training data. Therefore, we merge a learning phase with keypoint techniques to benefit from both the mentioned invariance of keypoints and the ability of learning invariance to the remaining distortions. In this way, we expect to reach maximum invariance. In our algorithm, the distortions are divided as small and large out-of-plane categories. Our approach focuses on learning the most promising keypoints under the first category and multiple discriminative descriptors under the second category. The following sections describe these two aspects in details. A. Distortion Synthesis Suppose we already have the 3D head model for a given user. Its surface is made of a set of triangular facets, on which we have defined a set of semantic vertices depicting mouth, nose, eyes and eyebrows. Unlike keypoints, these semantic points are only used for simulating non-rigid shape variations. With this model, we can synthesize the nonlinear distortions not supported by standard keypoint detectors. These distortions are mainly caused by head rotations, non-rigid facial expressions and illumination changes. Therefore, we randomly sample these transformations to synthesize various distortions so as to select the most robust keypoints. Fig. 2 (a) illustrates that using these nonlinear syntheses are necessary to select keypoints, since standard linear affine+illumination changes cannot represent them sufficiently. Pose. Due to the complexity of the human face, small rotations could generate local occlusions and non-affine warping to the face appearance. Since keypoint detectors can normalize the roll angle-corresponding to in-plane rotations, we parameterize the distortions only with respect to yaw and pitch angles, M = R(yaw) · R(pitch) (1) where the changes of yaw and pitch are limited to a reasonable range so that it remains possible to find invariant keypoints.

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

4

1

0.9

Percentage of Stable keypoints

0.8

0.7

0.6

descriptor

0.5

Image gradients

0.4

0.3

Affine/3DPose Lillum/Nillum A+Li/3D+Ni

0.2

0.1

0

5

10

15

20

25

Degrees of synthesized angles

(b)

(a)

Fig. 2: (a) In this experiment, we detect keypoints under linear variations and evaluate the percentage of keypoints that appear also stable under nonlinear variations. In particular, we compared the influence of affine transformations against (nonlinear) 3D transformations (Affine/3DPose), linear illumination against nonlinear illunimation (Lillum/Nillum), and linear affine+illumination against nonlinear 3D pose+illumination (A+Li/3D+Ni). The low stability ratios reveal that linear variations are not a good approximation of the nonlinear ones on the human face. (b) Simulation examples. Top: referential image + two synthesised shapes. Bottom: randomly synthesized illuminations, shown in gray level for better visualization.

Occluded descriptor Occluded Image gradients

Fig. 3: Illustrating the influence of partial occlusion on the keypoint descriptor. Black arrow: original gradient, Red arrow: occluded edge gradient, Green arrow: occluded gradient. Due to the complexity of the human face, even a small rotation will bring occlusion to the local neighborhood of keypoint, which further changes its gradient orientation-based descriptor.

sources so that various lighting distortions can be generated. In the simulations, we assume the reflectance of the face is specular. See Fig. 2(b) for examples. B. Detecting Most Robust Keypoints Let k denote a keypoint and u its 2D location in image coordinates. Ideally, if k is invariant to the above specific distortions of the human face, it must satisfy u = u(k; M, A, L),

Shape. We model non-rigid facial expressions as a piecewise affine transform as in active appearance models [26], and we apply a large number of random piecewise affine transforms to select the most stable keypoints. Let us first denote by s1 , ..., sh the 2D projections of the pre-defined 3D semantic vertices. For each point, the random affine transformation A is parameterized as follows A = R(θ)R(−φ)Σ(λ1 , λ2 )R(φ)

(2)

where R(θ) is a rotation with respect to the axis normal to the image plane, and R(φ) with respect to the optical axis [55]. Σ(λ1 , λ2 ) is a diagonal matrix with diagonal components λ1 and λ2 . By simultaneously running affine transforms to s1 , ..., sh , the new location u of a keypoint is u=

3 X h X

αk wi (uk )Ai si

(3)

k=1 i=1

where wi (uk ) = kuk −si k−1 2 . uk , k = 1, 2, 3 are three vertices of the triangular facet enclosing u, and α1 , α2 , α3 are the barycentric coordinates of this keypoint. Fig. 2 (b) shows two examples. Light. Nonlinear photometric transformations, for example shadows, often arise when illumination changes. When the face pose varies, the shadows change and influence keypoint locations. Modeling shadows is difficult, and we choose to simulate it by rendering lighting changes instead. We create two spot lights s1 , s2 to simulate realistic illumination changes L = L(s1 , s2 )

(4)

where L represents the distortions caused by s1 , s2 . We randomly change the positions and magnitudes of the two light

(5)

where u is the location of k and u(k; M, A, L) is its new location under a given (M, A, L) transformation. In practice, due to detection errors, keypoint location is often shifted a little from its true position. Therefore, we state that a keypoint k is detectable under (M, A, L), if its location satisfies ku − u(k; M, A, L)k < η

(6)

where η is the largest tolerated localization error as used in [22]. By running a sufficient number of samplings on the (M, A, L) spaces, we can count how many times a keypoint is detected and we only keep the keypoints with a high detection frequency. C. Incremental Multi-View Descriptor Learning It is difficult to deal with the distortions caused by large out-of-plane rotations. Not only robust keypoint learning with respect to these distortions is impossible but also learning invariant descriptor is difficult. Fig. 3 shows why the oriented gradient-based SIFT descriptor is sensitive to occlusions formed by out-of-plane distortions. Instead of learning invariance, we study learning multiple descriptors at different view angles. We propose a multi-view class learning algorithm for this purpose. In our algorithm, a keypoint corresponds to a single class and its descriptors observed at different angles correspond to different instances of this class. As shown in Fig. 4, a fronto-parallel face image I0 is supposed to be available. We treat each keypoint in this view as a single class (the boundary keypoints are suppressed with a distance constraint). Then we build a Nearest-Neighbor classifier H, such that H : {k1 , ..., kn } → {−1, 1, ..., n}. n is the keypoint number and −1 denotes the points that do not belong to any existing class.

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

In a newly synthesized image I1 created from a different viewpoint, the extracted descriptors can be categorized as: • a new view of an existing class, or • the first view of a new class. To recognize the identity of each descriptor, we take an incremental multi-view class learning strategy. By back-projecting the detected keypoints onto our 3D head model, we can get a label y(k) ∈ {−1, 1, ..., n} for a keypoint k that has the closest 3D location with an existing class. Moreover, when we apply the classifier H to classify the descriptor f (k), we then obtain another class label yˆ(f (k)) ∈ {−1, 1, ..., n}. As a result, different combinations of y(k) and yˆ(f (k)) must be treated separately. In the following, y(k) and yˆ(f (k)) will be written as y and yˆ for short. (a) When we have y = yˆ 6= −1, f (k) is the new view of an existing class labeled as y. The label y 6= −1 indicates k corresponds to an existing class. But the classification response yˆ = y indicates f (k) is closest to its current view. In order to improve classification, we merge f (k) with the existing view of class y to compose a mean descriptor. (b) When we have y 6= yˆ, y 6= −1, yˆ 6= −1, k is an existing class (y 6= −1) but its descriptor f (k) is the nearest to the view of another existing class (ˆ y 6= y, yˆ 6= −1). It suggests that f (k) is very likely to cause mis-classification due to its similarity with another class view. Therefore, we skip this kind of descriptor. (c) When we have y = −1, yˆ 6= −1, k is a new class (y = −1) but it is likely to be misclassified (ˆ y 6= −1). To reduce misclassification, we also skip this kind of descriptor. (d) When we have y 6= −1, yˆ = −1, k belongs to an existing class (y 6= −1) and it is classified as a keypoint on the background. Therefore, f (k) is a previously unseen view of class y. If we average f (k) with others as in the (a) case, the descriptive ability of this class becomes weak. Given the situation, we simply treat k as an instance of a new class so that f (k) can be kept independently. (e) When we have y = yˆ = −1, k is a new class, and f (k) cannot be interpreted by any of the existing class views. However, we often observe that if the face image is unrealistic due to texture mapped artifacts, the detected k and f (k) yields the same response. Clearly, this situation will not occur in practice. To identify the case, we can rely on a simple heuristic strategy. We count the number of the ‘new classes’ nr and the detected keypoints n. If nr /n is too large, it is likely that the current image is unrealistic,

5

Fig. 4: The keypoint and multi-view descriptor learning. Blue points: keypoints learned from the frontal view. Skyblue points: The same keypoints in alternative views. Red points: rejected keypoints in alternative views. Green circle: keypoint scale. On the right side, accepted keypoint patches are shown as well as rejected keypoint patches covered by a red cross. Algorithm 1: Incremental Multi-View Descriptor Learning Data: {K0 , ..., Kh | Ki = {k1 , ..., kni }} Result: The Keypoint Classifier H Initiate: H0 : K0 → {−1, 1, ..., n}; for i = 1 to h do nr ← 0; ns ← 0; for j = 1 to ni do y = Backproject(kj ); yˆ = H(f (kj )); if y = yˆ 6= −1 then add new view f (kj ) to class y; else if y 6= −1, yˆ = −1 then add a new class kj to H; nr ← nr + 1; n ← n + 1; else if y = −1, yˆ = −1 then temporally store kj and f (kj ); ns ← ns + 1; else give up kj and f (kj ); end end if nr /ni > δ then add the stored {k} as new classes to H; n ← n + ns ; else give up the stored {k}; end Update: update classifier H ← H+ ; end

thus we give up the ‘new class’ k and f (k); otherwise, we accept k as a new class and f (k) as its first view. Once the current image has been learned, we will synthesize another image by drawing random rotations and repeat the above steps. A summary of the whole learning approach is described in Algorithm 1. To find the 3D position of a keypoint quickly, we use the color-filling method proposed in [44] to accelerate the crucial step of facet indexing with GPU. IV. P RUNING U NLIKELY K EYPOINTS WITH C OLOR P RIOR Typically, thousands of keypoints are detected in a single video frame. If we kept all of them to perform descriptor matching, it would be time-consuming and error-prone. Given

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

the observation that skin color is unique enough to distinguish the face from the background, we exploit it to remove non-face keypoints before matching keypoints. A. Keypoint Color Descriptor Inspired by grayscale descriptors, we intend to use color cues of a keypoint in a similar way. We have tested many representation methods and find that the region covariance is rather effective. With , w We compute the color covariance matrix with respect to the RGB values of each pixel (i, j) in the N × N local patch centered on a keypoint: C=

1 N (N − 1)

N X N X

(RGBij − m)(RGBij − m)T (7)

i=1 j=1

where 3×3 • C∈R is the color covariance matrix; 3×1 • m∈R is the mean RGB vector of this patch. The covariance matrix captures the second-order statistics of image patch. Since it mainly involves local additions, it can be quickly computed with integral images [43]. The critical problem with the covariance descriptor is that it cannot be measured in a Euclidean space. An extended method, named Sigma Set, was recently proposed to overcome this drawback [15]. Instead of using the covariance matrix itself, this method first performs a Cholesky decomposition: C = LLT , then concatenates a set of points called sigma points with the elements of the lower triangular matrix L. In this way, the Sigma Set descriptor still captures secondorder statistics like the covariance matrix, yet allows similarity measurements in the Euclidean space. This property motivates us to use this Sigma Set as our color descriptor. Our experiments in Table II show that SigmaSet clearly exceeds histogram-based descriptions. For SigmaSet, we have tested several color spaces (RGB, Opp, rg, Transformed) without noticeable differences in matching accuracy. Therefore, we simply build our Sigma Set descriptor upon the RGB space. B. Creating the Color Prior We use the keypoints’ color descriptors to identify them as lying on the face or on the background. Since our training data is limited, we rely on a discriminative binary classifier. Among many options, we choose the Random Projection classification trees [13]. Of course, one could also use many alternatives such as Support Vector Machines [46] or AdaBoost [14]. Our training set consists of color descriptors of both face and background keypoints. During training period, the projection tree splits the set according to a linear rule: bT c + t ≤ 0, with b being a random projection direction and t a given scalar. To learn a set {b} of directions, we first generate a small dictionary of 200 random directions. Then for each node, the best b is selected from the dictionary, and it maximizes the Shannon’s entropy at this node. The splitting process is repeated until either the leaf node receives too few data or it reaches the maximum tree depth d = 5. In total, our classifier consists of a forest of NT = 20 trees built in this manner.

Length Precision

SigmaSet 21 0.9882

6

RGBHist 45 0.9117

OppHist 45 0.9252

rgHist 30 0.9347

TransHist 45 0.8843

TABLE II: We have evaluated different color descriptors on the ’v2’, ’v3’ and ’ssm1’ sequences and report averages here. Among them, SigmaSet is the best since its length is minimum while achieving highest keypoint classification precision. RGBHist, OppHist, rgHist and TransHist are described in [45]. The used histogram lengths are empirically chosen to best trade between accuracy and speed.

We use the label +1 for the face class and −1 for the background class. Classifying c as lying on the face or on the background is equivalent to labeling it with either y = +1 or −1. If we denote Pji (c|y = +1) and Pji (c|y = −1) as the conditional probabilities of face and background at the j th leaf of ith tree, we have Pji (c|y = +1) =

− + Nij Nij i , P (c|y = −1) = , j N+ N−

(8)

+ − where Nij , Nij are the numbers of color descriptors of face and background at this leaf, and N + , N − denote the numbers of face and background color descriptors. To classify an input keypoint k, we traverse all the NT trees. Suppose the prior is uniform, then we can approximately write

P (y|c) ≈ P (c|y) =

NT 1 X P i (c|y) NT i=1 j

(9)

where Pji (c|y) is the conditional probability at the j th leaf of ith tree reached by c. The posterior responses corresponding to the two labels are P (y = +1|c) and P (y = −1|c). To compare the two responses, we can make a 0-1 decision as follows ( 1 P (y = +1|c) > αP (y = −1|c) g(c) = (10) 0 otherwise It amounts to say that k lies on a face if P (y = +1|c) > αP (y = −1|c), otherwise k lies on the background. The soft factor α is useful to avoid triggering many false alarms. We experimentally set it as 0.9. Since g(c) is computed before descriptor matching, it is called the color prior of k. V. R EMOVING J ITTERS WITH O PTICAL F LOW Keypoint discretization can make its location shift from its true position by one or more pixels. As a result, motion jitter often arises when using keypoint matching without temporal consistency information. Ideally all previous estimates should be used to filter jitter, but in practice only the last frame suffices. In 3D tracking, a regularization term has been widely employed to enforce motion smoothness on the pose estimate [37], which often takes a form similar to b t−1 k2 + kTt − T b t−1 k2 ) Ek + λ(kRt − R 2 2

(11)

where Ek is the energy of data fitting, λ is a given weight and R and T are the pose matrices. However, the smoothness

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

t 1

ut

t

K [ Rt | Tt ]

K [ Rt 1 | Tt 1 ]

t 1

ut 1

t

ut

Fig. 5: Schematics of keypoint flow (a) and optical flow (b). For the 3D-2D pair (V, ut−1 ) at time t−1, keypoint matching results in a flow 4u = ut −ut−1 while measuring optical flow e t − ut−1 . Typically, k4uk > k4e obtains 4e u=u uk.

( Ek , Es )

j

K [ Rt | Tt ]

Ns

K [ Rt 1 | Tt 1 ]

ut 1

V

(b)

Es   K [ R | T ]Vj  u j 22

V

(a)

7

( Ek* , Es* )

Nk

Ek   K [ R | T ]Vi  ui 22 i

term suffers from two aspects. First, a large λ leads to oversmoothed performance while a small one has no substantial influence. In practice, it can only be experimentally determined. Second, adding this term changes the original functional form and needs an alternative optimization. To overcome these drawbacks, we propose the following adaptive approach.

Fig. 6: Balancing Es with Ek . When Ek is given, we choose a reasonable Es by controlling its number of elements Ns according to Method-1. Then we use Method-2 to reach an optimal point (Ek∗ , Es∗ ) that ideally has minimum reprojection error.

A. Integrating Optical Flow

B. Balancing Es with Ek

We use an additive energy model to remove jitter Eg = Ek + E s ,

(12)

where Es is our regularization term and Eg is the total energy. Unlike Eq. (11), we define Es in the same functional form of Ek . For the keypoint-based pose tracking, we have Ek

=

mk X

kK[Rt |Tt ]Vti − uit k22

i

Es

=

ms X

ej − u e jt k22 kK[Rt |Tt ]V t

(13)

j

where K is the camera intrinsic matrix, and {Vti ↔ uit }i=1...mk the set of mk 3D-to-2D correspondences estabej ↔ u e jt }j=1...ms the lished via keypoint matching, and {V t set of added smooth correspondences. To filter jitter, we set e jt = u e jt−1 + ∆e u ujt with ∆ a first-order motion displacement. We compute optical flow to determine ∆e ujt :   ∇Ix ∇Iy ∆e u = −∇It (14) where ∆e u is efficiently computed within a local region. As Fig. 5 shows, integrating optical flow can smooth the possible wrong large motion displacements. To establish 3D-to-2D correspondences, we assign a 3D position to each optical flow point. We select optical flow seeds from two sources with known 3D points: S1 , keypoints successfully matched in the previous frame and S2 , the projections of randomly drawn 3D vertices. Measuring the optical flow of S1 is essentially performing interest point tracking. S2 is useful to deal with the case when |S1 | is too small. In addition, we perform random re-sampling of S2 at each frame to ease motion drifting. Since the added term also has the same form of keypoint matching, the resulting energies can be simultaneously solved. The following section describes the adaptivity of our method.

Let Sk denote the set of 3D-to-2D keypoint correspondences and S1 ∪ S2 for the optical-flow set. When neglecting outliers, we have mk = |Sk | and ms = |S1 ∪ S2 |. Given Eq. (12), we hope to achieve the following goals: • When a large motion occurs, the keypoint set Sk should dominate the pose estimate; • When the motions is smooth, the optical-flow set S1 ∪ S2 should suffice to remove the jitter caused by Sk . In general Sk is reliable enough to handle arbitrarily large motions and S1 ∪ S2 captures smooth motions. Clearly, it is required that we automatically adapt the importance of keypoint matching according to the motion magnitude. We approach the goals by taking the following two strategies. Method-1: We keep the size of S1 ∪ S2 equal to Sk : |S1 ∪ S2 |=|Sk |. The intuition is that more keypoint correspondences need more optical-flow ones to smooth, and vice versa. Since in general |S1 ∪ S2 | = |S1 | + |S2 |, we have |S1 | + |S2 | = |Sk |. If  < |S1 | < |Sk |, we draw |S2 | = |Sk | − |S1 | vertex seeds according to 3D visibility; otherwise, if |S1 | ≥ |Sk | > , we 0 select a subset S1 ⊆ S1 of elements with highest matching scores. We set  = 7 to check if Sk is large enough. For the degenerated case |Sk | ≤ , we keep S1 and draw |S2 | = |S1 | vertex seeds. Method-2: To estimate R and T, we utilize the outlier removing tool PROSAC to equally select inliers from Sk and S1 ∪ S2 . PROSAC stems from RANSAC, and performs importance sampling in terms of a quality function [9]. For Sk , the quality function is the matching score; while for S1 ∪ S2 , it is the error of measuring optical flow. Suppose PROSAC selects two inlier sets Ck ⊆ Sk and Cs ⊆ S1 ∪ S2 , and both are consistent with a global 3D pose estimate. Intuitively, the contributions of Ek and Es to pose recovery are proportional to the numbers of their correspondences. When more opticalflow correspondences are selected than the keypoint ones, i.e. |Ck | < |Cs |, drifting is very likely to happen. To prevent the drifting risk, we should keep more keypoint correspondences.

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

Algorithm

2:

3D

Head

Pose

Tracking

Algorithm

/*keypoint matching with color prior*/ for j = 1 to n do Compute the color prior g(kj ); if g(kj ) = 1 then e Find the nearest-neighbor reference keypoint k; end for i = 1 to mk do if Eq. 17 holds then ei ) = 1 x(k else ei ) = 0 x(k end /*establishing correspondences via optical flow */ Sample seed points according to Method-1 to measure optical flow and establish 3D-to-2D correspondences; /*simultaneous pose P recovery and jitter removing*/ ; Pm m while i k ρ(ri ) ≥ j s ρ(rj ) do Minimize Eq. 19 to solve for R, T; Perform Method-2 to prevent potential bias; end

For this purpose, for each point p ∈ Cs , and also p ∈ Sk , e p as the average of u e p and we set its optical flow position u e p = (e its matched keypoint position up : u up + up )/2. Then we repeat the PROSAC inlier selection and the unbiasing procedures until |Ck | ≥ |Cs | is satisfied. The above solution adapts Eq. (12) to fit the true motion. Fig. 6 illustrates the balancing process. VI. T RACKING A LGORITHM S UMMARY In this section, we summarize our algorithm for exploiting color prior, keypoint matching and optical flow to recover the 3D head pose. It involves establishing a set of keypoint and optical-flow 3D-to-2D correspondences, and solving for the best 3D pose from previous stage. The first step is to match keypoints. For efficiency, we take into account the color prior of each detected keypoint to remove non-face keypoints early. Keypoint matching can be regarded as minimizing the following energy with respect e to a binary term x(k) Ef =

keypoints such that min Ef mk X ei )kf (ki ) − f (k ei )k2 ) = min( x(k 2

Data: Detected Keypoints Set {kj }j=1:n Result: 3D Head Pose R,T

mk X

8

i

=

(16)

Clearly, global optimization of Ef can be achieved by complete candidate search as follows. For each reference keye we search its best candidate match k, which has point k, minimum Euclidean distance to its SURF descriptor f (k). In practice, during the learning stage, we organize all reference keypoints in a K-D tree structure [31], and at runtime we rely on the Best-Bin-First strategy to achieve fast approximate matching [2]. To encourage unique matches, we follow [2] and assume that mismatches can be approximately identified by comparing the distance of the closest neighbor to that of the secondeN N and k e2N N closest neighbor. For a query keypoint k, let k be the closest and second-closest reference keypoints, reeN N ) and r(k, k e2N N ) the corresponding spectively, and r(k, k Euclidean distances. A correct match is supposed to have the closest neighbor significantly closer than the closest incorrect e under the same matches. Similarly, for a reference keypoint k, assumptions we compare the closest match against the secondclosest one. Thus, the correct match must satisfy e kN N ) eN N ) r(k, r(k, k ≥ σ1 and ≥ σ2 e2N N ) e k2N N ) r(k, k r(k,

(17)

where σ1 and σ2 are experimentally set as 0.59. For the e = 1 and matches that satisfy the equation, we keep x(k) e otherwise x(k) = 0. Once keypoint matching is done, we sample a set of seed points to measure optical flow and establish 3D-to-2D correspondences. The sampling strategy detailed in MethodPis m ei ), the general 1. As |S1 ∪ S2 | = ms and |Sk | = i k x(k principle is to have ms =

mk X

ei ), x(k

(18)

i

Pm ei ) > . It reflects the intuitive idea that more when i k x(k keypoint correspondences are likely to require more opticalflow ones to smooth, and vice versa. By putting all the keypoint and optical-flow correspondences together, we can simultaneously minimize the following reprojection error to solve for R and T mk X i

(15)

i

+ e ∈ {0, 1} denotes the state of keypoint descriptor where x(k) e is 1 if there is a matching descriptor k matching, namely x(k) e for ki and its corresponding color prior g(k) is 1; otherwise it is 0. To solve it, it is common to assume independence among

ei )kf (ki ) − f (k ei )k2 ). min(x(k 2

i

Eg = ei )kf (ki ) − f (k ei )k2 . x(k 2

mk X

ms X j

ei )ρ(ri ) kK[R|T]Vi − ui k2 x(k | {z }2 ri

ej − u e j k22 , ρ(rj ) kK[R|T]V | {z } >

subject to RR = I and

rj mk X i

ρ(ri ) ≥

ms X j

ρ(rj )

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

where ρ(r) ∈ {0, 1} is the binary output of PROSAC, i.e. 1 > for inliers and 0 for outliers. The constraint RRP = I forces mk R to be an orthogonal matrix. The constraint ρ(ri ) ≥ i Pms j ρ(rj ) corresponds to |Ck | ≥ |Cs | in Method-2 in Section V-B. In literature, this problem is commonly referred to as the Perspective-n-Point problem or PnP in short. Among many available solutions, the EPnP algorithm recently proposed in [30] is the most efficient because of its O(m) computational complexity. Its main idea is to express all m 3D points as a weighted sum of four virtual control points, which reduces the problem to estimating the coordinates of these control points in the camera referential. Given Eq. 19, m is easily computed by mk ms X X m= xi ρ(ri ) + ρ(rj ). (19) i

j

It implies that m is the sum of the numbers of inlier keypoints and optical flow points. Thus these m globally consistent correspondences form a kernel to deduce the pose. Recovering large pose displacement and removing motion jitter can be simultaneously achieved in this way. The above procedures are summarized in Algorithm 2. In the optimization, keypoint matching is a blind process to the PnP problem. This is because the proposed learning scheme and the color prior generally perform well enough, making adjusting the matching results unnecessary. VII. E XPERIMENTAL E VALUATIONS We run extensive experiments to evaluate the proposed head tracking approach on 9 live-captured video sequences for 6 subjects and the 45 sequences of the public Boston University dataset [6]. These test videos exhibit many challenging photometric and geometric disturbances, such as sudden illumination, facial expression, large head rotation and occlusion. In the experiments, we compared our approach against the related methods, and also assessed the key technical components. The test videos and tracking results are available at the author’s homepage 1 . A. Configurations We use a generic 3D face model with 206 regular vertices and manually select 23 semantic vertices among them. Before tracking starts, we capture one front-parallel and one profile photos of the tracking target, and interactively adjust the generic model to his/her face. Afterwards, we extract the face texture from the front-parallel image to map onto the model. We use this adapted face model to synthesize 6 images by randomly changing yaw and pitch angles in the range of [−45o , +45o ] for learning invariant keypoints. On each image, we simulate various distortions to detect the most stable keypoints and learn reliable descriptors. The typical range of keypoints detected on an image is 80 ∼ 200 and after the selection about 20 ∼ 50 stable keypoints will be kept. To train the color classifier, we collect positive SigmaSet 1 http://liama.ia.ac.cn/wiki/user:hbwang:projects

9

descriptors from the illumination-synthesized images used for the keypoint learning. We also collect negative descriptors from the keypoints in the background of the front-parallel image. Note that the invariant descriptor and color classifier are only learned once for the same subject. When tracking starts, the system is automatically initialized by searching keypoints to match in the whole frame. Since we perform this global matching at each frame, the system can automatically re-initialize whenever tracking fails. All the parameters are optimized to produce the best pose tracking results and fixed throughout our experiments. The affine range for shape simulation is θ ∈ [−30◦ , 30◦ ], φ ∈ [−30◦ , 30◦ ] and λ1 , λ2 ∈ [0.8, 1.2]. We set η = 2.0 for the allowed localization error. The patch size for both SURF and Sigma Set is 9 × 9. The SURF descriptor length is 64 while the Sigma Set color descriptor length is 21. B. Comparative Evaluations We compare our approach (SUNN+OP) against closely related works - the ferns-based patch classification approach (FERNS) [36], our approach when only optical flow is used (OP-Only), and our approach when only the SURF descriptor and Nearest Neighbor matching are used (SUNN). Instead of using a descriptor, FERNS samples a set of binary features to represent a patch and learns a tree-structured classifier by following a semi-naive Bayesian framework. FERNS uses a large set of affine and linear illumination simulations to detect stable keypoints and augment training data. For OPOnly, we use SUNN to initialize the tracking on the first frame, and sample the 229 model vertices to track optical flow on each successive frames. Two different sampling sizes of optical-flow correspondences, 35 and 70, are tested to verify whether more correspondences are better or not. We first compare the baseline method SUNN to FERNS in terms of head pose recognition rate and tracking speed as functions of the number of reference keypoints. The used ’v1’ sequence is rather flat in the background. Its major challenge is large pose displacements. Results are shown in Fig. 7. SUNN always has a better recognition rate, but since FERNS addresses high-speed keypoint recognition, SUNN is slower than FERNS. FERNS is sensitive to the keypoint numbers and can be as slow as SUNN with more than 120 classes. By contrast, the computational time for SUNN is not influenced much by the number of reference keypoints, and hence does not require manually tuning this parameter in practice. In the following, we use 100 reference keypoints for FERNS since it is a good trade-off between speed and accuracy. We also report the pose recognition rates of the four approaches on live-captured video sequences in Fig. 8. The sequences, named ‘v1’ ‘v2’ ‘v3’ ‘v4’, contain 500, 626, 2010 and 518 frames, respectively. The ‘v1’ sequence exhibits a uniform background while the others contain cluttered backgrounds. The main challenge of these sequences is large motion displacements and moderate illumination changes. No facial expressions are captured for this first test. In Fig. 8 we show the recognition rates as well as example images for each sequence. The tracking pose is regarded as correct if the

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

Pose Recognition Rate

100

Tracking Speed

30

FERNS SUNN

90 80

25

frames per second

70 60

rate(%)

20

50 40

15

30 20

0 0

10

SUNN FERNS

10 50

100

150

200

250

300

350

5 0

Number of keypoint classes

50

100

150

200

250

Number of keypoint classes

300

350

Video Length FERNS OP-Only-35 OP-Only-70 SUNN SUNN + OP SUNN+OP+CP

10

v1 500 79.5% 71.6% 69.4% 96.0% 100.0% 100.0%

v2 626 68.0% 79.6% 77.2% 87.2% 94.4% 96.5%

v3 2010 55.5% 80.5% 79.2% 86.2% 89.6% 92.3%

v4 518 65.3% 74.7% 73.2% 90.0% 92.9% 94.2%

Fig. 7: Pose recognition rate and tracking speed as functions of the number of reference keypoints. The 500-frames ’v1’ sequence is used for this test.

RMS (Root Mean Square) of the reprojection error is below 3.0 pixels and the number of inlier correspondences is larger than 5. We have tested several thresholds ranging from 1.0 to 6.0 pixels and empirically found that 3.0 is the best choice to measure the PnP accuracy. We also set a constraint of > 5 inliers to guarantee the reliability of PnP computation. In the ‘v1’ and ‘v4’ sequences, displacements between successive frames are sometimes very large. Hence, for both sequences, the optical-flow based rates (OP-Only) is ≈ 24% worse than SUNN. We can also see that OP-Only-35 is always slightly better than OP-Only-70, which reflects the fact that optical flow becomes more likely to include outliers when the number of correspondences increases. When the background is simple, FERNS still achieves a 79.5% tracking rate, but when against a cluttered background, its rates are only 68%, 55.5% and 65.3%. Hence, FERNS is less discriminative than SUNN. SUNN+OP clearly outperforms the other approaches for all the sequences. When displacements are large, OP-Only tends to fail, but SUNN still captures the correct motion by its global feature matching. When motions are small, the discretization of SURF features cause jitter for SUNN, while using optical flow fixes the problem. Only when SURF matches are not sufficient, optical-flow correspondences will take over and may ensure a good pose estimate. Otherwise, we take more emphasis on SUNN to encourage large displacement estimates since SUNN is usually more reliable. When color prior is integrated, our approach (SUNN+OP+CP) performs better than SUNN+OP since color prior leads to more correct keypoint matches. C. Keypoint Descriptor Learning Learning keypoint descriptors from multiple views is a critical step of our approach. We demonstrate the power of our learning approach in Fig. 9. For comparison, we also consider a naive learning approach that simply keeps all the multiview keypoints and organizes them via the Nearest-Neighbor classifier. Generally, although the naive method can learn more keypoint classes and descriptors, it performs much worse in practice. In the first column of Fig. 9, the mouth is partially occluded, as a result, naive learning fails due to insufficient correspondences. It suggests that naive learning keeps many keypoint classes around the occluded mouth and mustache regions. By contrast, our approach still finds the correct head pose since

Fig. 8: Quantitative comparisons of different 3D head tracking methods. Upper table: the recognition rates of different methods. Lower images: representative scenes of the ‘v1’,‘v2’,‘v3’ and ‘v4’ sequences, respectively.

Fig. 9: The descriptor learning approach comparisons on three live-captured sequences of three subjects. Top Row: our learning approach. Bottom Row: naive learning. Our learning approach results in more correct keypoint matches, therefore pose estimate is more precise.

the matched keypoints are spread over the entire face. In the second column, the estimates of both approaches are correct but ours is more precise. The same outcomes happen in the third column but results are quite different in the last one. In the last column, most keypoints are localized around the two eyes of this face. With naive learning, more than 5 SURF matches cluster around the left eye and cause mis-matches. Correspondingly, global estimation is severely biased as the mismatch errors are magnified. Conversely, at the learning stage of our approach, we merge keypoints that are close to each other and remove ambiguous descriptors as described in Section III-C, which clearly removes keypoints around the eye. As a result, SURF matches are more scattered and global estimation is no longer biased. D. Pruning Unlikely Keypoints In order to evaluate the RGB Sigma Set color prior in pruning keypoints that are unlikely to be in face, we conducted experiments on two live-captured sequences that contain sudden

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY v2

v1

11

v4

v3 1

1

0.9

0.9

0.8

0.8

0.8

0.8

0.7

0.7

0.7

0.7

0.6

0.5

Bayes Skin Ours

0.2

0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

FP/(FP+TN)

0.7

0.8

0.9

Bayes Skin Ours

0.2

0.1

0

0.4

0.4

0.3

1

0.5

0.5

0.4

0.3

0.6

0.6

0.5

0.4

0

TP/(TP+FN)

TP/(TP+FN)

0.6

TP/(TP+FN)

1

0.9

TP/(TP+FN)

1

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

FP/(FP+TN)

Bayes Skin Ours

0.3

0.2

0.1

1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Bayes Skin Ours

0.3

0.2

0.1

1

FP/(FP+TN)

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FP/(FP+TN)

Fig. 10: Performance of our color prior on the four recorded sequences. We use the color prior to classify each detected keypoint. By manually annotating groundtruth, we test the quality of the classifications and measure the ROC for each sequence. For comparison, we also run two related color models and measure their ROC curves. The first is a classification method that uses the Bayes rule to identify skin/non-skin pixel [47]. We train it with the same images used for our color prior. The other is a skin model defined in the (r, g, b) space [21]: {r > 95, g > 40, b > 20, max(max(r, g), b) − min(min(r, g), b) > 15, abs(r − g) > 15, r > g, r > b}. The results show that our color prior outperforms these Bayes and predefined skin models.

illumination changes. When the color prior is integrated, the rate of correct keypoint matches increases for all sequences. In particular, the improvement is significant at the moment of sudden illumination changes, which is shown in Fig. 11. As the correspondence lines indicate, many keypoints are mismatched when our color prior is not used. As a consequence, PROSAC can not find the correct head poses since the ratio of outlier matches is over the tolerable scope. By contrast, the color prior can filter out most keypoints whose colors are inconsistent with it. This significantly reduces the chance of mismatches and meanwhile most right keypoints can be still kept. The RGB Sigma Set captures the second-order color information. As shown in Table II, Sigma Set with a length of 21 bins is better than the histogram models with length of 45 or 30 bins. For the images in Fig. 11, histogram methods do not work because large lighting changes cause irregular shadows and reflections on the face. We also compare the proposed color prior with two related color methods. As shown in Fig. 10, our approach outperforms them. E. Removing Jitter In another set of experiments, we evaluated the efficiency of optical-flow correspondences in terms of removing motion jitter. Only the results for the ‘v1’ sequence is shown in Fig. 12. Some representative snapshots and motion trajectories are also shown. In Fig. 12(a), the translation using SUNN only (left plot) is compared to the trajectories of SUNN+OP in the right plot. Meanwhile, the rapid variations in the original trajectory are maintained after denoising. This can be interpreted by the fact that optical flow represents the true motion displacement. The same result appears when comparing the pitch and yaw angle estimates in Fig. 12(b). The dashed blue curve denotes the estimates of SUNN. The curves are perturbed by noise that can be larger than the signal itself. However, SUNN+OP can remove those noises as the black solid curve depicts. Since the number of OP correspondences varies as that of SURF matches, the two curves have very similar variation pattern. When motion is smooth, optical flows are consistent and contain smaller geometric errors than SURF

matches. Therefore, PROSAC is likely to select more OP correspondences to offset the shifts of SURF correspondences. Conversely, when variation change is rapid, optical flows contain large errors. So they will not be selected to attenuate the rapid variations captured by SURF matches. Therefore, SUNN+OP always preserves the true variations estimated from the SURF matches. Fig. 12(c) shows the jitter when tracking successive frames. In the three examples, while the first estimate is correct, the second estimate deviates from the ground truth by either translational or rotational shifts. However, since optical flow estimates the correct trajectory, SUNN+OP can fix the deviations. The jitter problem is analogous to the denoising issue of data reconstruction. The usual solution is adding a quadratic or a total variation regularization term [4]. However, using optical-flow correspondences is preferable. Since optical flow is a first-order approximate of ground truth motion, it will not attenuate or remove any rapid variations in the original signal as the quadratic term. Besides, its computational cost is much smaller than the l1 -normed total variation term since it remains l2 -normed. F. More Experimental Results The performance of our SUNN+OP+CP tracking system was assessed on the Boston University database with public ground truth [6]. The database consists of 45 sequences (nine sequences for each of five subjects, each of which are 200 frames long). The frame resolution is 320 × 240 and the record rate is 30 frames per second. Each subject moves his head freely with both in-plane and out-of-plane rotations and the lighting condition is uniform. Tracking is regarded as successful if the RMS of the reprojection error is below 3.0 pixels and the number of inlier correspondences is larger than 5. Table III shows the average tracking rate for each of the 45 sequences. Our SUNN+OP+CP approach successfully tracks all the frames while GAVAM [29] has the same performance but the system of La Cascia et al [6] tracks only about 75%. To further illustrate the accuracy, the estimated curves for the ‘ssm1’ sequence are shown in Fig. 13. This sequence

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

12

Fig. 11: The performances of the RGB SigmaSet color prior under sudden illumination changes. For each sequence, Left side: with RGB SigmaSet, Right side: without RGB SigmaSet. As lighting suddenly changes, the face appearance significantly changes. In this case, using RGB SigmaSet makes the results significantly better. For each result, Upper left image: template, Gray lines: keypoint correspondences. Note that no mesh mapping means the pose is not recovered at that time.

Pose tracking Rate

Cascia [6] ≈75%

GAVAM [29] 100%

SUNN+OP+CP 100%

TABLE III: Comparing head tracking rate on the 45 sequences of Boston University database [6]. Our approach successfully tracked all the frames while GAVAM [29] also reported a 100% rate and La Cascia et al. [6] reported only a 75% rate.

contains moderate pose variations. The average translational and angular errors are 0.48 inches and 0.5◦ , respectively. Some tracking snapshots taken from all the ‘ssm’ sequences are also given in the figure. On these 320 × 240 images, SURF keypoints are often not sufficient. Hence, the yaw and pitch can be tracked, only when they do not exceed 40◦ . Table IV shows the translational and angular errors on the 45 sequences with respect to all the six degrees of freedoms for each subject. Both means and standard variations are given. In comparison with GAVAM [29], which also tracks all the frames, our approach is approximately 0.7 inches more accurate in translations and 1◦ more accurate in rotations. Fig. 14 shows four shots of tracking a 1029-frames woman sequence in a laboratory scenario. The background is rather cluttered and many keypoints are detected on the background. The sequence contains the challenging smiling expressions and illumination is uniform. FERNS misses Frame #400 and per-

forms poorly on the others. OP-Only yields precise estimates for frames #306 and #400, but becomes incorrect for frames #473 and #988 due to drift. Comparatively, SUNN estimates correct results, but the results of SUNN+OP are systematically better. This has been achieved by combining the right parts of SUNN and OP-Only. The results of SUNN+OP+CP are almost identical to that of SUNN+OP since a sufficient number of keypoint correspondences have been correctly established without the color prior. To conclude the experiments, we show the performances of SUNN+OP on a number of challenging snapshots in Fig. 15. Results with SUNN+OP+CP are almost the same as SUNN+OP in these snapshots and therefore not shown in the figure. The backgrounds of these shots are similar but the tracking challenges are different. As stated in the caption of Fig. 15, large scale and rotation variations, zoom effect, extreme mouth opening and partial occlusions exist. In spite of these challenges, SUNN+OP still works and not surprisingly the estimate precision remains high for the first three rows. In particular, our approach is robust to the expressions of extreme mouth opening, although beard keypoint descriptors have been greatly distorted. It reflects the joint power of using local SURF features and non-rigid shape simulations to select reliable keypoints.

6 4 0.5 ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY MANUSCRIPT TO IEEE TRANSACTIONS 100

120

180

8

8 z

10

6

6

4 0.5

4 0.5 2

y 0

120

140−2 −0.5

0 160

0

x

−0.5 −2

10

160 80 frame 180 100

160

trajectory (x,y,z) 140 160

z

80

13

2

y 0

x

180

y

0

0

x

2

−0.5 −2

(a) Translational trajectory (x, y, z)

(b) Pitch and yaw rotations

10

z

8 6 4 0.5 y 180

0

0

x

2

−0.5 −2

(c) Examples of removed motion jitters Fig. 12: Removing jitter for the 500-frames ‘v1’ sequence. The results of SUNN tend to be very jittery while our approach (SUNN+OP) is free of jitter by exploiting optical-flow correspondences. (a) (x, y, z) trajectory. Red and green curves represent the results of SUNN and SUNN+OP, respectively. (b) The angles of pitch and yaw. Solid and dashed curves correspond to SUNN and SUNN+OP, respectively. (c) Results for the frames #47,#48,#70,#72,#119 and #120. Upper Row: SUNN. Lower Row: SUNN+OP.

G. Discussions Since we perform keypoint matching globally and at each frame, the approach naturally initializes or re-initializes itself as long as keypoint correspondences become sufficient. Since we always encourage global keypoint correspondences to lower the possibility of using more optical flows, our approach generally does not suffer from drifting. Only in tracking long-term sequences, drifting will occasionally occur because keypoint correspondences might remain few over some time. The main limitations of our system lie in the following aspects. First, it requires relatively good image quality to detect enough keypoints. This suggests that the tracked target should sit not too far away from the camera as depicted in the figures of this paper. Second, since the color prior requires negative descriptors collected from the background, our system suits the scene when the camera stays in the same environment. Lastly, since face keypoints are usually few when the out-ofplane pose goes beyond 45◦ (we do not learn these cases in

training), our approach usually does not handle well these large variations. However, the method of using profiling referential images [44] could naturally be integrated into our system to overcome this drawback. The running speed of our system depends on the scene complexity and the cues integrated. Let’s first consider the case without the color prior. When the scene is simple, e.g. the ‘v1’ sequence, tracking speed is about 15 frames per second. In a cluttered scene, more than one thousand keypoints are detected. The speed degrades up to about 8 frames per second, since a large percentage of time needs to be spent on the SURF descriptor matching. When the color prior is used, the two speeds turn to about 18 and 12 frames per second, respectively. The color descriptor is a 21-vector. Matching a pair of these color descriptors is therefore much cheaper than matching SURF descriptors, which are 64-vectors. For simple scenes, most keypoints belong to the face, thus speedup is not significant. For complex scenes, color descriptors will remove

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

14

Y Translation

X Translation 2

Z Translation

−11

35

−11.2

34.5

1 34

−11.6

33.5 33

−2

−3

Inches

−11.8

−1

Inches

Inches

0

−11.4

−12

32.5

−12.2

32

−12.4

31.5

−12.6

31

−12.8

30.5

−4

−5 0

20

40

60

80

100 120 Frame # Roll

140

160

180

200

−13 0

20

40

60

80

100 120 Frame # Yaw

140

160

180

200

30

12

8

4

10

6

3

4

8

0

20

40

60

80

100 120 Frame # Pitch

140

160

180

200

2

2

4

Degrees

1 Degrees

Degrees

6

0 −2

2

0 −1

−4 0

−2

−6

−2

−8

−4 0

−10 0

20

40

60

80

100 120 Frame #

140

160

180

200

−3

20

40

60

80

100 120 Frame #

140

160

180

200

−4 0

20

40

60

80

100 120 Frame #

140

160

180

200

Fig. 13: Evaluations on the 9 sequences of the subject ‘ssm’ of Boston University database [6]. (a) The top plots illustrate the tracking accuracy of our approach on the ‘ssm1’ sequence. All the six parameters, x, y and z translations, roll, yaw and pitch, are all measured. In each plot, the dashed line is the estimated result while the solid line represents the ground truth. (b) The bottom images show some examples from tracking all the sequences of subject ‘ssm’. Note that large percentage of the orientational errors comes from extracting Euler angles from the 3 × 3 rotation matrix of our PnP solution.

most background keypoints, saving significant computational time. Of course, using less expensive components would further reduce the computation time. VIII. C ONCLUSIONS We have presented a new keypoint-based solution to the 3D head pose tracking problem. We select reference keypoints that are robust to various face distortions to improve keypoint matching quality. This is achieved by combining normalization and simulation techniques within a learning scheme. At runtime, we propose a new color prior to prune unlikely keypoints at the early stage. It saves computational time and significantly reduces the ratio of outlier correspondences. Finally we integrate adaptive optical flow correspondences to solve the motion jitter problem.

Similar to other feature-based systems, the performance of our approach degrades when there is not enough keypoints detected on the face. One situation is with face blurred by either rapid motion or zoom change. The other one is when facial features are not distinctive and no extra inherent features, such as moustache, are present. To overcome the difficulty, besides local keypoints, we plan to integrate other global cues, such as facial shapes in our procedure. Moreover, since keypoints are individually matched, spatial relationships among them are lost. Graph matching allows to capture this spatiality. However, even the most efficient solutions are currently too slow for our application [8]. Therefore we need to explore a fast graph matching algorithm in the future. The proposed color prior becomes problematic with dark skin in our experiments. How to learn a more general color prior needs also to be

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

ssm jam jim vam llm Average GAVAM [29]

Tx (inches) 0.34 ± 0.10 1.24 ± 0.38 0.66 ± 0.19 0.93 ± 0.34 0.84 ± 0.36 0.80 0.90

Ty (inches) 0.21 ± 0.11 0.79 ± 0.33 0.40 ± 0.08 0.77 ± 0.23 0.79 ± 0.37 0.59 0.89

Tz (inches) 0.1 ± 0.03 0.32 ± 0.08 0.16 ± 0.04 0.36 ± 0.16 0.28 ± 0.15 0.24 1.91

Roll (degrees) 0.95 ± 0.26 2.24 ± 1.11 1.36 ± 0.45 2.95 ± 1.54 1.78 ± 1.42 1.86 2.91

15

Yaw (degrees) 1.76 ± 0.93 4.66 ± 3.44 3.67 ± 2.10 4.43 ± 3.27 4.23 ± 3.22 3.75 4.97

Pitch (degrees) 1.80 ± 1.25 2.74 ± 1.78 2.55 ± 1.58 3.34 ± 1.77 3.34 ± 2.47 2.69 3.67

TABLE IV: Average translational and angular errors on the 45 sequences of Boston University database [6]. We first report the mean errors and standard variations over the nine sequences of each subject, namely ssm, jam, jim, vam and llm. We then show their averages to compare with the reported values of GAVAM [29]. Our approach is approximately 0.7 inches more accurate in translations and 1◦ more accurate in rotations.

investigated.

R EFERENCES [1] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In European Conference on Computer Vision, pages 404–417, Graz, Austria, 2006. [2] J. S. Beis and D. G. Lowe. Shape indexing using approximate nearestneighbour search in high-dimensional spaces. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 1000– 1006, 1997. [3] V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. In ACM SIGGRAPH, pages 187–194, New York, NY, USA, 1999. [4] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004. [5] M. Brand and R. Bhotika. Flexible flow for 3D nonrigid tracking and shape recovery. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 315–322, 2001. [6] M. L. Cascia, S. Sclaroff, and V. Athitsos. Fast, reliable head tracking under varying illumination: An approach based on registration of texturemapped 3D models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:322–336, 1999. [7] L. Chen, L. Zhang, Y. Hu, M. Li, and H. Zhang. Head pose estimation using fisher manifold learning. In IEEE International Workshop on Analysis and Modeling of Faces and Gestures, 2003. [8] M. Cho, J. Lee, and K. M. Lee. Reweighted random walks for graph matching. In The 11th European conference on Computer vision, pages 492–505, 2010. [9] O. Chum and J. Matas. Matching with prosac - progressive sample consensus. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 220–226, 2005. [10] T. F. Cootes, K. Walker, and C. Taylor. View-based active appearance models. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 484–498. Springer, 2000. [11] D. Decarlo and D. Metaxas. Optical flow constraints on deformable models with applications to face tracking. International Journal of Computer Vision, 38(2):99–127, 2000. [12] F. Dornaika and F. Davoine. Simultaneous facial action tracking and expression recognition in the presence of head motion. International Journal of Computer Vision, 76(3):257–281, 2008. [13] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the structure of manifolds using random projections. In Neural Information Processing Systems, 2007. [14] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the Second European Conference on Computational Learning Theory, pages 23–37, 1995. [15] X. Hong, H. Chang, S. Shan, X. Chen, and W. Gao. Sigma set: A small second order statistical region descriptor. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 1802– 1809, 2009. [16] T. Horprasert, Y. Yacoob, and L. Davis. Computing 3-D head orientation from a monocular image sequence. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 242–247, 1996. [17] Y. Hu, L. Chen, Y. Zhou, and H. Zhang. Estimating face pose by facial asymmetry and geometry. IEEE International Conference on Automatic Face and Gesture Recognition, pages 651–656, 2004. [18] K. S. Huang and M. M. Trivedi. Robust real-time detection, tracking, and pose estimation of faces in video streams. In IEEE International Conference on Pattern Recognition, pages 965–968, 2004.

[19] J.-S. Jang and T. Kanade. Robust 3D head tracking by online feature registration. In IEEE International conference on Automatic Face and Gesture Recognition, Amsterdam, The Netherlands, September 2008. [20] Q. Ji and R. Hu. 3D face pose estimation and tracking from a monocular camera. Image and Vision Computing, 20:499–511, 2002. [21] J. Kovac, P. Peer, and F. Solina. Human skin color clustering for face detection. In International Conference on Computer as a Tool EUROCON, volume 2, pages 144–148, 2003. [22] V. Lepetit, J. Pilet, and P. Fua. Point matching as a classification problem for fast and robust object pose estimation. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 244– 250, Washington, DC, 2004. [23] W. Liao and G. Medioni. 3D face tracking and expression inference from a 2D sequence using manifold learning. In IEEE International Conference on Computer Vision and Pattern Recognition, 2008. [24] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91–110, 2004. [25] M. Malciu and F. Prˆeteux. A robust model-based approach for 3D head tracking in video sequences. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 169–174, Washington, DC, USA, 2000. [26] I. Matthews and S. Baker. Active appearance models revisited. International Journal of Computer Vision, 60(2):135–164, 2004. [27] K. Mikolajczyk and C. Schmid. Scale & affine invariant interest point detectors. International Journal of Computer Vision, 60(1):63– 86, October 2004. [28] L.-P. Morency, P. Sundberg, and T. Darrell. Pose estimation using 3D view-based eigenspaces. In IEEE Workshop on Analysis and Modeling of Faces and Gestures in Conjunction with ICCV, pages 45–52, 2003. [29] L.-P. Morency, J. Whitehill, and J. Movellan. Monocular head pose estimation using generalized adaptive view-based appearance model. Image and Vision Computing, 28(5):754–761, 2010. [30] F. Moreno-Noguer, V. Lepetit, and P. Fua. Accurate non-iterative o(n) solution to the pnp problem. In IEEE International Conference on Computer Vision, pages 1–8, Rio de Janeiro, Brazil, 2007. [31] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In VISAPP, pages 331–340, Angers, France, 2009. [32] E. Murphy-chutorian and M. M. Trivedi. Hyhope: Hybrid head orientation and position estimation for vision-based driver head tracking. In IEEE Intelligent Vehicles Symposium, pages 512–517, 2008. [33] A. Nikolaidis and I. Pitas. Facial feature extraction and pose determination. Pattern Recognition, 33:1783–1791, July 2000. [34] S. Niyogi and W. T. Freeman. Example-based head tracking. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 374–379, Washington, DC, USA, 1996. [35] S. Ohayon and E. Rivlin. Robust 3D head tracking using camera pose estimation. In IEEE International Conference on Pattern Recognition, pages 1063–1066, 2006. [36] M. Ozuysal, M. Calonder, V. Lepetit, and P. Fua. Fast keypoint recognition using random ferns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3):448–461, 2010. [37] V. Rabaud and S. Belongie. Re-thinking non-rigid structure from motion. In IEEE International Conference on Computer Vision and Pattern Recognition, 2008. [38] R. Rae and H. Ritter. Recognition of human head orientation based on artificial neural networks. IEEE Transactions on Neural Networks, 9(2):257–265, March 1998. [39] B. Raytchev, I. Yoda, and K. Sakaue. Head pose estimation by nonlinear manifold learning. In IEEE International Conference on Pattern Recognition, volume 4, pages 462–466, 2004. [40] J. Sherrah and S. Gong. Fusion of perceptual cues for robust tracking of

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

16

Fig. 14: Frames #306, #400, #473, #988 of a self-captured test sequence against a laboratory background. Results for First Row: FERNS, Second Row: OP-Only, Third Row: SUNN, Fourth Row: SUNN+OP, Fifth Row: SUNN+OP+CP.

head pose and position. Pattern Recognition, 34(8):1565–1572, 2001. [41] J. Strom. Model-based real-time head tracking. EURASIP Journal on Applied Signal Processing, 2002(1):1039–1052, 2002. [42] J. Sung, T. Kanade, and D. Kim. Pose robust face tracking by combining active appearance models and cylinder head models. International Journal of Computer Vision, 80(2):260–274, November 2008. [43] O. Tuzel, F. Porikli, and P. Meer. Region covariance: a fast descriptor for detection and classification. In European Conference on Computer Vision, pages 589–600, 2006. [44] L. Vacchetti, V. Lepetit, and P. Fua. Stable real-time 3D tracking using online and offline information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10):1385–1391, October 2004. [45] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:1582–1596, 2010. [46] V. Vapnik. The natural of statistical learning theory. Springer-Verlag, 1995. [47] V. Vezhnevets, V. Sazonov, and A. Andreeva. A survey on pixel-based skin color detection techniques. In Proceedings of the GRAPHICON, pages 85–92, 2003. [48] M. Voit, K. Nickel, and R. Stiefelhagen. Neural network-based head pose estimation and multi-view fusion. In Proc. CLEAR Workshop,

LNCS, pages 299–304, 2006. [49] J. Wang and E. Sung. Pose determination of human faces by using vanishing points. Pattern Recognition, 34(12):2427–2445, December 2001. [50] Q. Wang, W. Zhang, X. Tang, and H.-Y. Shum. Real-time bayesian 3-D pose tracking. IEEE Transactions on Circuits and Systems for Video Technology, 16(12):1533–1541, 2006. [51] J. Wu and M. M. Trivedi. A two-stage head pose estimation framework and evaluation. Pattern Recognition, 41(3):1138–1158, 2008. [52] Y. Wu and K. Toyama. Wide-range, person- and illumination-insensitive head orientation estimation. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 183–188, 2000. [53] J. Xiao, S. Baker, I. Matthews, and T. Kanade. Real-time combined 2D+3D active appearance models. In IEEE International Conference on Computer Vision and Pattern Recognition, volume 2, pages 535– 542, Washington, DC, June 2004. [54] J. Xiao, T. Moriyama, T. Kanade, and J. Cohn. Robust full-motion recovery of head by dynamic templates and re-registration techniques. International Journal of Imaging Systems and Technology, 13:85 – 94, September 2003. [55] G. Yu and J.M.Morel. A fully affine invariant image comparison method. ICASSP, pages 1597–1600, 2009.

MANUSCRIPT TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

17

Fig. 15: More tracking examples of SUNN+OP. First row: successful tracking under large scale and rotation variations. Second row: successful tracking under very large scale motion. Third row: successful tracking under partial occlusions and facial expressions such as mouth opening and goggling. The subimages in the last row are shown for better visualization.

[56] W. Zhang, Q. Wang, and X. Tang. Real time feature based 3-D deformable face tracking. In European Conference on Computer Vision, pages 720–732, Berlin, Heidelberg, 2008. [57] Y. Zhang, C. Kambhamettu, and R. Kambhamettu. 3D head tracking under partial occlusion. Pattern Recognition, 35:176–182, 2002.

Dear Professors Gharavi and Salembier, On behalf of my co-authors, I would like to thank you for giving us an opportunity to revise our manuscript. The constructive comments from you and all the reviewers are very valuable to improving our manuscript and an important guide to our research. According to all the comments, we have carefully revised our manuscript and also provided detailed answers to the comments at the end of the manuscript. Now we would like to submit them for your kind consideration.

Best Regards.

Yours Sincerely, Haibo Wang Email: [email protected]

1

Paper ID: TCSVT 5329 Paper Title: 3D Head Tracking via Invariant Keypoint Learning Responses to Reviewer 1: All comments have been properly addressed.

Response: Thank you for your comments and appreciation.

2

Paper ID: TCSVT 5329 Paper Title: 3D Head Tracking via Invariant Keypoint Learning Responses to Reviewer 2: We are very grateful to your detailed comments. Please find our answers below. Comment 1: The authors have answered most of my questions and remarks and have improved the paper. However through their answers it has become clear that the proposed approach has certain drawbacks and limitations that should be mentioned in the revised version (in the corresponding sections and perhaps in the conclusions) so that the readers are aware of them. Response: Thank you for these valuable comments. We have mentioned the drawbacks of our approach in the revised manuscript. Please see details below. Comment 2: Answer 2: The authors reply regarding limitations of the method with respect to the distance of the subject and the quality of the face (see also answer 8) should be mentioned in the revised manuscript. The authors should also clearly state the distance conditions required for the method to operate correctly. Response: We have stated the issues of requiring good quality of the face and the limited distance to the camera in Lines 2-4 of the second paragraph of Section VII.G (Page 13). Comment 3: Answer 5: Since the training of the color classifier requires negative descriptors collected from the background, it seems to me that the method should be trained not only for every subject but also for every background instance. This is, in my opinion, a rather serious limitation that should be mentioned in the revised manuscript. Response: We agree that it limits the application ability of our system. In practice, our system suits the scene when the camera stays in the same environment. We have mentioned this in Lines 5-7 of the second paragraph of Section VII.G (Page 13). Comment 4: Answer 6: Problems arising from dark skin should be mentioned in the manuscript. Response: This problem has been addressed in Lines 13-15 of the second paragraph of Section VIII (Page 14).

3

Comment 5: Answer 9: Occasional drifting issues in long sequences should be mentioned in the manuscript. Response: Drifting occurs in long sequences because keypoints might remain few over a period. However, it disappears quickly whenever keypoints turn to sufficient again. We have clarified this point in Lines 3-8 of the first paragraph of Section VII.G (Page 13). Comment 6: In addition, in Section VII.F the authors should mention the criterion they use to characterise tracking in a certain frame as successful. Response: In Section VII.F, the tracking criterion has been mentioned in Lines 8-10 of the first paragraph (Page 11). Comment 7: Finally, the authors should proofread their manuscript and correct various language errors. Response: Thank you for pointing out the language problem. We have carefully read the manuscript and corrected many errors.

4

Paper ID: TCSVT 5329 Paper Title: 3D Head Tracking via Invariant Keypoint Learning Responses to Reviewer 3: We are very grateful to your detailed comments. Please find our answers below. Comment 1: However, given the additional experiments and modifications, I propose to accept the paper provided the following modifications have been made. Response: We have modified the revised manuscript by following all your advices. Please see details below. Comment 2: 1. Color descriptor. The current model is learned from an initial training image, not only from the current person, but also from the sequence background. Even if is done offline as the author suggest, it is nevertheless highly specialized to a current setting. The author should compare their approach to a simpler scheme that perform skin/non skin pixel classification (e.g. using the training face + background images used to learn their model), and then do color pruning by requesting the presence of a minimum percentage (e.g. 40%) of skin pixel in the keypoint patch (a rectangular region can be used to exploit the integral image of skin binary image). Response: Thank you for this advice. We have compared our color prior with two simpler skin/non-skin models. The first is the Bayesian classification method that uses the Bayes rule to identify skin/non-skin pixel. We train it with the same images used for our color prior. The groundtruth is manually annotated. The other is a widely used skin model defined in the (r, g, b) space (see Fig.10 (Page 11) for the definition). We perform the two models in classifying detected keypoints. The quality of the classification is measured with the ROC curves. The test data are the four self-captured sequences. We follow the advice to test skin keypoint by requesting the presence of a minimum percentage of skin pixels in the keypoint patch. Integral image is used for speedup. We vary the percentage from 20% to 80% to construct the ROC curves of the Bayesian and the defined skin models. We also vary the skin/non-skin classification ratio to record the ROC curves of our color prior and the Bayesian method. For each sequence, we show the ROC curves by averaging over all frames. Results show that our color prior outperforms these two models. For more details, please see Fig. 10 (Page 11) in the manuscript. Here are some example images:

5

Figure 1 Firs Row: the Bayesian skin mask and its keypoint mask. Second Row: the defined skin model mask and its keypoint mask. Comment 3: 2. Explain how are tracking failures detected (the authors mention that there is such a mechanism). Response: We detect tracking failures with respect to the number of the inlier correspondences and the RMS of the reprojection error of each frame. The minimum number of correspondences required for PnP computation is 4. Setting a constraint >5 inliers is to guarantee the reliability of PnP computation. Meanwhile, we consider the RMS of the reprojection error to judge the accuracy of the PnP computation. We have tested several thresholds ranging from 1.0 to 6.0 pixels and empirically found that 3.0 is the best choice. In the revised manuscript, we have explained this in Lines 12-15 of the third paragraph of Section VII.B (Page 10). Comment 4: 3. Fig 2a should be complemented with the percentage of selected keypoints which are stable under the same simulated distortion. Response: Thank you for this advice. We have added the percentages of Affine/3D pose and Linear-illumination/Nonlinear-illumination tests in Fig 2a (Page 4) and modified the captions accordingly. Comment 5: 4. provide an idea of the number of keypoint that are detected initially in the face region, and the number of stable keypoints after the selection process. 6

Response: We have given typical values of the two numbers in Lines 5-7 of the second paragraph of Section VII.A (Page 9) in the revised manuscript. Comment 6: 5. Fig 2b: the proposed illumination synthesized images don't seem to take into account the referential RGB images. Are these illumination images that need to be used along with the appearance images to produce the simulated images used for training ? Please clarify. Response: Showing the illumination synthesis in grayscale image is only for better visualization. In practice, these illumination variations are directly applied on the RGB appearance images along with the proposed random pose and shape simulations. We have clarified this point in the caption of Fig.2 (Page 4). Comment 7: 6. There are many uncorrect english sentences. Please have the paper corrected. Response: Thank you for pointing out the language problem. We have carefully checked the manuscript and corrected many errors. Comment 8: 7. If the initialization is done manually by selecting the semantic vertices, please mention it in Section VII.A. Response: This manual initialization has been mentioned in Lines 1-2 of the first paragraph of Section VII.A (Page 9). Comment 9: 8. Table 1: y and hat(y) => check their definition (there is y and hat(y) on the same line; and similarly for hat(y)); it does not match that of algo 1. Response: Thank you very much. We have corrected this problem in Table 1 (Page 3). Comment 10: The motivation of method 2 and figure 6 are still unclear. The authors' answer mention ".. this simple step is successful in preventing potential drifting", although is claimed everywhere that the method does full matching at each time step and is therefore not subject to drift issue. Clarify. Response: Method-2 describes an averaging step towards an ideal balance between keypoint matching and optical flow as shown in Figure 6. Keypoint matching is free from drifting due to its global search but optical flow is known to drift. As shown in Eq. (20) (Page 9), keypoint and optical-flow correspondences are combined to 7

determine the 3D pose. Intuitively, their contributions are related to the respective number of their correspondences, i.e. a larger number means a lager contribution to pose recovery. In this sense, when the number of optical-flow correspondences is larger than that of the keypoint ones, drifting is very likely to happen at the moment. To prevent the underlying drifting risk, we propose this averaging step to always keep more keypoint correspondences. In the revised manuscript, we have clarified this point in Lines 8-13 of the third paragraph of Section V.B (Page 7). Comment 11: 10. III.B define u in Eq (5) (closest match) Response: u has been defined in the texts following Eq. (5) in Page 4.

8

TCSVT-submitted-11-26.pdf

most robust keypoints and descriptors. Actually, combining. simulation and normalization is not entirely new: Affine-SIFT. simulates all possible affine transformations out to gain affine. robustness [55], and appears to be more affine-invariant than. the normalization-based counterparts. It has also been shown. in [36] that if all ...

7MB Sizes 0 Downloads 183 Views

Recommend Documents

No documents