Exploring Regularized Feature Selection for Person Specific Face Verification Yixiong Liang, Shenghui Liao, Lei Wang, Beiji Zou School of Information Science and Engineering, Central South University Changsha, Hunan 410083, P.R. China {yxliang,lsh,wanglei,bjzou}@mail.csu.edu.cn

Abstract In this paper, we explore the regularized feature selection method for person specific face verification in unconstrained environments. We reformulate the generalization of the single-task sparsity-enforced feature selection method to multi-task cases as a simultaneous sparse approximation problem. We also investigate two feature selection strategies in the multi-task generalization based on the positive and negative feature correlation assumptions across different persons. Simultaneous orthogonal matching pursuit (SOMP) is adopted and modified to solve the corresponding optimization problems. We further proposed a named simultaneous subspace pursuit (SSP) methods which generalize the subspace pursuit method to solve the corresponding optimization problems. The performance of different feature selection strategies and different solvers for face verification are compared on the challenging LFW face database. Our experimental results show that 1) the selected subsets based on positive correlation assumption are more effective than those based on the negative correlation assumption; 2) the OMP-based solvers outperform SP-based solvers in terms of feature selection and 3) the regularized methods with OMP-based solvers can outperform state-of-the-art feature selection methods.

1. Introduction As one of the most successful applications of computer vision and pattern recognition, face recognition has received extensive attention during the past decades. Although it has achieved significant progress under controlled conditions, face recognition is still a very challenging problem in the uncontrolled environment where pose, lighting, expression, age, occlusion and makeup variations are more complicated. As local areas are often more descriptive and more appropriate for dealing with those facial variations, there is an increasing use of local features for face recognition. A lots of local feature descriptors, such as Haar-like features [11], SIFT features [3, 13], histograms of oriented

gradient (HOG) [2, 27], edge orientation histograms (EOH) [26], Gabor features [23, 12, 28], local binary patterns (LBP) features[1, 15], bio-inspired features [16], learned descriptor [5] etc., have been successfully applied in face recognition. Those local descriptors are generally extracted by performing some transformation (both linear or nonlinear) on the local region only or followed by some explicitly spatial pooling means such as the spatial histogramming scheme [4]. These methods can be divided into two categories: sparse and dense. Gabor jets in elastic bunch graph matching (EBGM) [23] and SIFT features [3] are typical sparse local features which first detects the facial landmarks or interest points in the face and then samples a local patch to extract local features. The dense local features, which extracts local features pixel by pixel or patch by patch over the input image, are more popular in face recognition [1, 12, 11, 26, 15, 13, 5]. Those local features are often redundant or over-completed, whereas only a relatively small fraction of them is relevant to the recognition task. Thus feature selection is a crucial and necessary step to select the most discriminant ones, which can not only improve generalization but also decrease the computational burden. Adaboost-based methods are the state-of-the-art feature selection methods in face recognition Scenario [11, 29, 28, 25, 22]. One possible problem of these methods is very time consuming in the training stage for the need of training and evaluating a classifier for each feature component. An alternative is the regularized method which sparsify with respect to a dictionary of features by the sparsity-enforced regularization techniques [9] and recently has been successful applied in face analysis [7]. The main merits of such a regularized approach are its effectiveness even in the presence of a very small number of data coupled with the fact that it is supported by well-grounded theory [9, 7]. In addition to feature selection, regularized methods are directly applied to face recognition by casting the recognition problem as one of finding a sparse representation of the test image in terms of the training set as a whole [24]. Although very impressive results are achieved on public databases, the algorithm fails to handle practical face variations such

as alignment and pose [27].

2. Notation and Setup

The concern of this work is mainly about how to select robust and discriminative local features for person specific face verification in unconstrained environments. In this case, although the face verification can be seen as a binary classification problem (accept or reject), it is in fact several binary classification problems (one for each client model) and thus its essence is by nature a multiple binary classification problem. Most existing approaches cast the multiple binary classification problem into a intra-personal and extrapersonal difference classification problem and train a generic model for all individuals [11, 29, 28]. However, these methods may fail to capture the variations among different individuals and therefore are suboptimal. Furthermore, after converting the original problem into a intro-personal and extra-personal binary classification problem, the scale of the training set will dramatically increase, resulting in significant increase of both computational time and memory space cost. Other approaches build person specific models for different individuals separately [8, 7] and often lead to overfitting due to the small sample size of each individual. Recently, the multi-task learning technique, which learns a problem together with other related problems at the same time for improved performance, is used to combat over the overfitting problem [25, 22]. Based on the sharing feature idea [19], Xiao et al. [25] propose a joint Boosting algorithm which explicitly exploits multi-class information. Wang et al. [22] also adopted the joint learning algorithm to learn classifiers for multiple people by sharing a few boosting classifiers rather than by sharing features. However, during the training it still use the intro-personal and extra-personal binary classification strategy and thus is suboptimal.

Suppose that there are L individuals to be verified. Given a training image set of size N , among them Nl images corresponds to the subject l, while the remaining images are of other subjects excluding the known L subjects. All images are used in building each person specific model. From each image we can obtain a d-dimensional feature vector f . Let X = [x1 , . . . , xd ] ∈ RN ×d be the data matrix with each row an input feature vector, and Y = [y1 , . . . , yL ] ∈ RN ×L be the corresponding label matrix where yl is a N -dimensional vector with its ith entry equal to 1 if the ith samples come from the subject l and else equal to −1. Therefore Y is a matrix of −1′ s and 1′ s with each row having at most a single 1. The matrix XΩ consists of the columns of X with indices i ∈ Ω where Ω ⊂ {1, · · · , d}. We write cl for the l-th column of the matrix C and cl for the l-th row.

In this work, we explore a multi-task generalization frame of regularized methods to select most informative features for person specific face verification. We reformulate the multi-task feature selection as a simultaneous sparse approximation problem and investigate two feature selection strategies based on two assumptions: one assumes that there is a positive correlation among different persons, while the other assumes the correlation is negative or competitive. The two strategies result in markedly different selected subsets. The former tends to select general or common features which are important for all or most persons while the later inclines to choose specific, but maybe not person specific, features which only contribute to few persons. We adopt the simultaneously orthogonal matching pursuit (SOMP) [20] and modify it to solve the related optimization problem. We also extend subspace pursuit (SP) [6] to solve the optimization problem. The performance of different feature selection strategies and different solvers for face verification are compared on the challenging LFW face database [10].

3. Person Specific Feature Selection via Sparse Approximation Assume that there is a linear relationship between the class label {−1, 1} and the components of feature. For class l, this linear relationship can be characterized in matrix notation: yl = Xcl + bl 1 (l = 1, . . . , L), where cl is a d-dimensional coefficient vector and bl is the bias in the model of class l respectively, while 1 being a vector with its entries equal to 1. In this work ,we mainly consider the square loss function J(cl , bl ) = kyl − Xcl − bl 1k22 . Learning a model independently via empirical risk minimization with l0 quasi-norm penalization function would yield the optimization problem: min J(cl , bl ) + λT (cl ), with T (cl ) = kcl k0 , cl ,bl

(1)

where k · k0 is the l0 quasi-norm which counts the nonzero entries of a vector. Provided the regularization coefficient λ is same across different individuals, then solving each of these problems independently is equivalent to solving the global problem obtained by summing the objectives:

min C,b

L X l=1

J(cl , bl )+λ

L X

T (cl ), with T (cl ) = kcl k0 , (2)

l=1

where C is the coefficient matrix with cl in columns and b = [b1 , . . . , bL ]T is the bias vector. It is a NP-hard combinatorial optimization problem which are often solved by some computationally tractable methods such as greedy methods and convex relaxation methods. Obviously, solving the optimization problem (2) will lead to person specific feature selection.

4. Common or Specific Feature Selection via Simultaneous Sparse Approximation 4.1. Multi-Task Feature Selection As mentioned before, the single-task feature selection method may suffer from the overfitting problem due to the small training sample size of each person. Multi-task feature selection alleviates this problem by imposing joint sparsity penalization across different individuals, which yields the following joint optimization problem min C,b

L X

J(cl , bl ) + λT0 (C), with T0 (C) = kCkrow−l0 ,

l=1

(3) where k · krow−l0 is the row-l0 quasi-norm [20] kCkrow−l0 = |

L [

supp(cl )|,

(4)

l=1

where supp(·) denotes the support of a vector. It is indeed a simultaneous sparse approximation problem and thus it is a more complicated NP-hard problem in general [20, 21]. Convex relaxation methods replace the row-l0 quasi-norm by a closely related convex function Tp,q (C) =

d X i=1

(kcl kq )p/q =

d X L X [ |cij |q ]p/q .

(5)

i=1 j=1

This relaxation can be done by first applying the lq norm to the rows of C and then applying the lp norm or quasi-norm to the resulting vector of norms.

4.2. Positive Correlation vs. Negative Correlation Since we require most rows of C should be zero, we have 0 ≤ p ≤ 1 in Eq. (5) and only the case p = 1 is convex. The choice of q depends on what kind of correlation assumption among tasks. The most usual assumption is that there is a positive correlation between different classes, namely different persons share common features which may contribute to as many as possible classes. This assumption leads to the choice of 1 < q ≤ ∞. Increasing q corresponds to allowing more classes to share the same feature. The well-known multi-task feature selection in [17, 14] corresponds to the case p = 1, q = 2. The rational behind this is that minimizing the l1 norm promotes row sparsity whereas minimizing the l2 norm promotes nonsparsity within the rows. Another assumption is that the correlation among persons is negative [32, 31], namely if a feature is deemed to be important for one or few persons, it becomes less important for the other persons. This assumption can be implemented by imposing sparsity penalization on the rows of C and thus the choice of q is 0 ≤ q ≤ 1. Therefore this assumption will

lead to the choice of specific, but maybe not person specific, features. The exclusive lasso [32] adopt this negative correlation assumption but without row-sparse constraints and corresponds to the case p = 2, q = 1. The choice of 0 ≤ p, q ≤ 1 would yield very sparse coefficient matrix C which is not merely row-sparse and at present, there is little empirical or theoretical analysis on this case.

4.3. Greedy Methods for Multi-Task Feature Selection Although the convex relaxation methods often provide robustness and uniform guarantees for sparse approximation, it can be shown that the solution is not always sparse. The sparsity is not explicitly controlled and good feature selection property requires strong assumptions [30]. Therefore we consider an alternative method, greedy method such as OMP and SP [6], due to its low complexity and simple geometric interpretation. Tropp et al. [20] proposed SOMP which extends OMP to deal with the simultaneous sparse approximation problem and thus can be used directly for feature selection. The main steps of SOMP are listed in algorithm 1. Step 1 is the greedy selection by maximizing Algorithm 1: Simultaneous OMP (SOMP) [20] Input: X, Y, a stopping criterion. Initialization: R0 = Y, Ω0 = ∅, i = 1. 1 while not meet the stopping criterion do 2 Choose an index t which satisfies k = arg maxk kRTi−1 xk kp , where p = 1; 3 Set Ωi = Ωi−1 ∪ {k} and ˆ = arg minZ kXΩ Z − YkF ; Z i ˆ and 4 Calculate the new approximation Yi = XΩi Z the residual Ri = Y − Yi ; 5 i = i + 1; 6 end Output: Sparse coefficient matrix Z and the selected indices Ω. the absolute correlations (p = 1). In fact, small values of p promote the selection of features that contribute to many persons. In contrast, larger values of p will favor the features that contribute a lot to one or a few persons. Obviously the former case corresponds to the positive correlation assumption while the latter corresponding to the negative correlation assumption. In our implementation, we set p = 1 for the positive correlation assumption and p = ∞ for the negative correlation assumption. Recently, a novel greedy algorithm, termed as subspace pursuit (SP) [6], have been proposed to bridge the gap between the convex relaxation methods and OMP. The SP method provides similar guarantees to convex relaxation methods as well as the speed of OMP algorithm. As SP se-

lects a group features rather than one single feature at each iteration, it is actually faster than OMP in the case of selecting only hundreds of features from tens of thousands of features. Borrowing the ideas of SOMP, we extend the SP algorithm to deal with our multi-task feature selection. Our extended method, named as simultaneous SP (SSP), is summarized in Algorithm 2. Readers can find that this algorithm reduces to the usual SP algorithm [6] when L = 1 and p = 1. Similarly, at each iteration to select the features, we set p = 1 and p = ∞ for the positive and negative correlation assumptions. Algorithm 2: Simultaneous SP (SSP) Input: X, Y, s, a stopping criterion Initialization: Ω0 ={s indices corresponding to the largest lp norm of {YT xk }k=1,··· ,d }, ˆ = arg minZ kXΩ Z − YkF , Z 0 ˆ i = 1. R0 = Y − XΩ0 Z, 1 while not meet the stopping criterion do 2 Ωi = Ωi−1 ∪ {s indices corresponding to the largest lp norm of {RTi−1 xk }k=1,··· ,d }; ˆ = arg minZ kXΩ Z − YkF ; 3 Set Z i 4 Ωi ={s indices corresponding to the largest lp ˆ norm of rows of Z}; ˆ = arg minZ kXΩ Z − YkF ; 5 Set Z i ˆ and 6 Calculate the new approximation Yi = XΩi Z the residual Ri = Y − Yi ; 7 i = i + 1; 8 end Output: Sparse coefficient matrix Z and the selected indices Ω. Both SOMP and SSP are iterative algorithm and therefore we must supply a stopping criterion. The most common possibilities for SOMP include 1) stopping after a fix number of iteration or 2) waiting until the Frobenius norm of the residual, kRi kF , declines to a specified threshold. In our implementation, the iteration is stopped whenever one of these two conditions is satisfied. In the original SP algorithm the iteration is halted when the l2 norm of the current residual is larger than that of the last time. However, in our experiments, this condition is often satisfied even in the very beginning iteration. Therefor we halt the iteration of SSP when the fix number iterations are finished or the absolute difference between the Frobenius norm of Ri and Ri−1 is below a predefined threshold. The computational complexity of SOMP and the proposed SSP can be easily estimated. For SOMP, the step 2 can always be implemented in O(LN d) time by computing all the inner products between the residual R and data matrix X. Solving the multiple least squares problem in Step 3 only takes O(LN s) because we may build on the

solution of the past iterations, where s is the number of the selected features. Thus the overall cost of SOMP becomes O(LN s(s + d)). As s is greatly less than d, the corresponding total cost is therefore O(LN sd). The computational cost analysis of the proposed SSP is similar. Step 2 is same and thus the computational complexity is O(LN d). The cost of Step 3 and Step 5 is of the order of O(LN s2 ) if only using the modified Gram-Schmidt algorithm. As a result, the total complexity is upper-bounded by O(LN s(d + s2 )). In our experiments, s2 is less than d and therefore the complexity of SSP is upper-bounded by O(LN sd).

5. Classifiers for Face Verification Both single-task and multi-task feature selection frame fit linear regression models to the class labels and thus the corresponding coefficients can be directly used for the face verification. However, as shown in Step 3 of Algorithm 1 and Step 5 of Algorithm 2, the coefficients are obtained by performing least squares fitting on the selected features XΩ and consequently often exhibit high variance, which may induce large classification error. For this reason, one can only consider it as a pure feature selection tool and adopt other common classifiers such as SVMs for verification. Instead of using other classifiers, we adopt the ridge regression [9] which shrinks the regression coefficients by imposing a penalty on their size to perform the classification.

6. Experimental Results and Analysis To evaluate the effectiveness of our proposed method for face verification in unconstrained environments, we carry out some experiments on the LFW face database [10]. This database contains 13, 233 labeled face images collected from news sites in the Internet. These images belong to 5, 749 different individuals and have high variations in position, pose, lighting, background, camera and quality. Therefore LFW database is appropriate to evaluating face verification methods in realistic and unconstrained environments. We select 158 people with at least 10 images in the database as the known people, i.e. L = 158. For each known people, we select the former 5 images for training and the remaining for testing. We also select 210 people with only one image in the database as the background person (or unknown person) for training. Hence we have a training set of size 1, 000 corresponding to 368 people and a testing set of size 3, 534 from the known 158 people. Note that the training set for each class is highly unbalanced (5 positive samples and 995 negative samples). In our experiments, each image is first converted into grey image and then rotated and scaled so that centers of the eyes are placed on specific pixels and finally was cropped to 64 × 64 pixels. Fig. 1 shows some normalized examples of faces from celebrities. We select Gabor feature as

Figure 1. Preprocessed face images of some celebrities.

to its neighbor locations may also be selected. Actually, as shown in Algorithm 2, the SP-based methods first select a set of indices by correlation maximization (Step 1) and then refine them (Step 3). If a feature is highly correlated to the residual and thus has been selected by correlation maximization, the neighbor features have high correlation to the residual and should be selected, too. The refine step can remove some, but not all, of them. The results is that some highly correlated features may included in the selected sets. On the other hand, the OMP-based methods only select just one feature at each iteration by correlation maximization. Due to the least squares estimation (Step 2 in Algorithm 1), if one feature is selected, then the residual is less correlated with the neighbor features of selected feature and thus will not be selected in the next iteration. In fact, the SP-based methods perform forward-backward feature selection while OMP-based methods perform forwardstepwise selection. Comparing to the OMP-based methods, the SMP-based methods incline to choose high correlated features and thus achieve worse performance for face verification.

the initial representation due to its peculiar ability to model the spatial summation properties of the receptive fields of the so called “bar cells” in the primary visual cortex. To obtain the Gabor feature, We use 40 Gabor filters with five scales {0, · · · , 4} and eight orientations {0, · · · , 7} which are common in face recognition area. The dimension d of the resulting feature is then 64 × 64 × 40 = 163, 840. We compare the adoption of positive correlation and negative correlation assumptions in the multi-task feature selection for the face verification and the performance of OMP-based or SP-based solvers. We use “P” and “N” to denote the methods adopting the positive correlation and negative correlation assumption respectively and the resulting methods are denoted as “MTL-P-SOMP”, “MTL-PSSP” and “MTL-N-SOMP”, “MTL-N-SSP”. We also compare the multi-task feature selection with single-task feature selection which is desired to select the person specific features. The single-task feature selection methods with different greedy solvers are then represented by “STL-OMP” and “STL-SSP”. As illustrated in Section 5, we apply the ridge regression to improve the classification performance and the resulting methods are hereby denoted by “STL-OMP-R”, “STL-SSP-R”, “MTL-P-SOMP-R”, “MTL-P-SSP-R”, “MTL-N-SOMP-R”, “MTL-N-SSP-R”. Meanwhile, we adopt the Adaboost-based method as the baseline and the latest FGM feature selection method [18] along with the SVM classification using all features which is denoted by SVM(ALL) for comparison.

With regard to the comparison between single-task feature selection and multi-task feature selection with positive or negative correlation assumption, Table 1 and Fig. 2 show that there are no methods which uniformly dominate the others. We adopt the average ROC curves to evaluate their performance across different persons. Since our objective is to perform person specific face verification, we adopt the vertical averaging scheme which is obtained by fixing the false positive rates and then averaging the corresponding true positive rates, rather than the threshold averaging scheme. Fig. 3 shows that the average ROC curves for 158 known persons with 150 Gabor features. Multi-task feature selection with positive correlation assumption perform the best and the single-task feature selection method outperforms the multi-task feature selection with negative correlation assumption in general.

6.1. OMP-Based Solvers vs. SP-Based Solvers

6.3. Results with Different Number of Features

Table 1 and Fig 2 illustrate the performance of different methods to verify four celebrities using 150 selected Gabor features. The performance is measures by both ROC curves and area under ROC curves (AUC). As explained in Section 5, the adoption of ridge regression do significantly improve the performance. Moreover, the OMP-based methods uniformly dominate the SP-based methods. We find that the SP-based methods often select group features which are highly correlated. For example, if the feature corresponding to one location is selected, then the features corresponding

Fig. 4 shows the performance of different methods with different number of features being allowed to selected. The average AUC and average true positive rate with the false positive rate being fixed at 0.1 are used to evaluate the performance. When the number of features increase, the performance of different methods varies very different. The ridge regression based methods perform better than the ones without using ridge regression and the OMP-based methods outperform the SP-based methods uniformly, while the methods based on multi-task feature selection with posi-

6.2. Single-Task Feature Selection vs. Multi-Task Feature Selection with Positive or Negative Correlation Assumption

Method STL-OMP STL-OMP-R STL-SSP STL-SSP-R MTL-P-SOMP MTL-P-SOMP-R MTL-P-SSP MTL-P-SSP-R MTL-N-SOMP MTL-N-SOMP-R MTL-N-SSP MTL-N-SSP-R Adaboost(Baseline)

Ariel Sharon 0.9497 0.9802 0.8565 0.9174 0.9607 0.9810 0.8704 0.9416 0.9736 0.9838 0.5913 0.8658 0.6594

Colin Powell 0.9359 0.9724 0.8492 0.9488 0.9019 0.9650 0.8522 0.9452 0.9167 0.9599 0.7228 0.8922 0.7015

George W Bush 0.8005 0.8820 0.6434 0.7881 0.8654 0.9081 0.7451 0.8393 0.7032 0.8362 0.6453 0.7861 0.6307

Bill Gates 0.9633 0.9633 0.6345 0.8413 0.9012 0.9612 0.7708 0.8907 0.9104 0.9704 0.8053 0.9252 0.8169

1

1

0.9

0.9

0.8

0.8

0.7

0.7

True Positive Rate

True Positive Rate

Table 1. AUC of verifying four celebrities using 150 features.

0.6

0.5 STL-OMP STL-OMP-R STL-SSP STL-SSP-R MTL-P-SOMP MTL-P-SOMP-R MTL-N-SOMP MTL-N-SOMP-R MTL-P-SSP MTL-P-SSP-R MTL-N-SSP MTL-N-SSP-R Adaboost

0.4

0.3

0.2

0.1

0.6

0.5 STL-OMP STL-OMP-R STL-SSP STL-SSP-R MTL-P-SOMP MTL-P-SOMP-R MTL-N-SOMP MTL-N-SOMP-R MTL-P-SSP MTL-P-SSP-R MTL-N-SSP MTL-N-SSP-R Adaboost

0.4

0.3

0.2

0.1

0

0 0

0.1

0.2

0.3

0.4 0.5 0.6 False Positive Rate

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.7

0.8

0.9

1

(b) Colin Powell

1

1

0.9

0.9

0.8

0.8

0.7

0.7

True Positive Rate

True Positive Rate

(a) Ariel Sharon

0.4 0.5 0.6 False Positive Rate

0.6

0.5 STL-OMP STL-OMP-R STL-SSP STL-SSP-R MTL-P-SOMP MTL-P-SOMP-R MTL-N-SOMP MTL-N-SOMP-R MTL-P-SSP MTL-P-SSP-R MTL-N-SSP MTL-N-SSP-R Adaboost

0.4

0.3

0.2

0.1

0.6

0.5 STL-OMP STL-OMP-R STL-SSP STL-SSP-R MTL-P-SOMP MTL-P-SOMP-R MTL-N-SOMP MTL-N-SOMP-R MTL-P-SSP MTL-P-SSP-R MTL-N-SSP MTL-N-SSP-R Adaboost

0.4

0.3

0.2

0.1

0

0 0

0.1

0.2

0.3

0.4 0.5 0.6 False Positive Rate

(c) George W Bush

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4 0.5 0.6 False Positive Rate

(d) Bill Gates

Figure 2. ROC curves of verifying four celebrities using 150 features.

0.7

0.8

0.9

1

Average ROC

1

0.9

0.95

0.8

0.9

0.7

0.85

0.6

Average AUC

True Positive Rate

1

0.5 STL-OMP STL-OMP-R STL-SSP STL-SSP-R MTL-P-SOMP MTL-P-SOMP-R MTL-N-SOMP MTL-N-SOMP-R MTL-P-SSP MTL-P-SSP-R MTL-N-SSP MTL-N-SSP-R Adaboost

0.4

0.3

0.2

0.1

0

0

0.1

0.2

0.3

0.4 0.5 0.6 False Positive Rate

0.7

0.8

0.9

0.8

0.75 STL-OMP STL-OMP-R STL-SSP STL-SSP-R MTL-P-SOMP MTL-P-SOMP-R MTL-N-SOMP MTL-N-SOMP-R MTL-P-SSP MTL-P-SSP-R MTL-N-SSP MTL-N-SSP-R

0.7

0.65

0.6

0.55

0

100

200

1

300 400 Number of Features

500

600

700

(a)

Figure 3. Average ROC curves of verifying images of 158 known people using 150 features.

0.9

0.85

6.4. Comparison with Competing Algorithms To further demonstrate the efficiency of the proposed algorithms, we further compare them with the latest FGM method [18] which achieves the state-of-the-art performance on 10 real world datasets for comparison. We also investigate the performance using linear SVM as classifier without feature selection. Again, the average ROC curves across different persons are used to evaluate their performance. The results are reported in Fig. 5 which show that the multi-task feature selection with positive correlation assumption and OMP-based solvers obtain better or competitive performance compared with using all features, while the FGM and Adaboost perform much worse.

0.8

Average True Positive Rate

tive correlation assumption, in general, outperform the ones based on multi-task feature selection with negative correlation assumption and single-task feature selection.

0.75

0.7

0.65 STL-OMP STL-OMP-R STL-SSP STL-SSP-R MTL-P-SOMP MTL-P-SOMP-R MTL-N-SOMP MTL-N-SOMP-R MTL-P-SSP MTL-P-SSP-R MTL-N-SSP MTL-N-SSP-R

0.6

0.55

0.5

0.45

0

100

200

300 400 Number of Features

500

600

700

(b)

Figure 4. Comparison of the performance of different methods as a function of number of features. (a) Performance is measured as the average AUC across all classes. (b) Performance is measured as the average true positive rate (TPR) when the false positive rate (FPR) is fixed at 0.1. Average ROC

7. Conclusions

1

0.9

0.8

0.7 True Positive Rate

We have explored the sparsity enforced feature selection method for person specific face verification. The personal specific models are jointly learned by sharing the training data and then the multi-task feature selection problem can be reformulated as a simultaneous sparse approximation problem. The positive and negative feature correlation assumptions across different persons are adopt to select common and specific features, respectively. We adopt the SOMP and modify it to solve the corresponding optimization problems. We have also proposed a names simultaneous subspace pursuit (SSP) to solve these problems. We have compared different feature select methods with different solvers. Our experimental results show that using common features outperforms using the specific or person specific features, while OMP-based solvers perform better SP-based solvers in terms of feature selection. Moreover,

0.6

0.5

0.4

0.3

0.2 MTL-P-SOMP MTL-P-SOMP-R Adaboost SVM(ALL) FGM(SVM)

0.1

0

0

0.1

0.2

0.3

0.4 0.5 0.6 False Positive Rate

0.7

0.8

0.9

Figure 5. Comparison with Competing Algorithms

1

the regularized methods with OMP-based solvers outperform the Adaboost methods and the state-of-the-art FGM method.

Acknowledgement This research is partially supported by National Natural Science Funds of China (No.60803024 and No.60970098), Specialized Research Fund for the Doctoral Program of Higher Education (No.200805331107 and No. 20090162110055), Fundamental Research Funds for the Central Universities (No.201021200062) and Open Project Program of the State Key Lab of CAD&CG (No.A0911 and No.A1011), Zhejiang University.

References [1] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face recognition. IEEE Trans. PAMI, 28:2037–2041, 2006. 1 [2] A. Albiol, D. Monzo, A. Martin, J. Sastre, and A. Albiol. Face recognition using HOG-EBGM. Pattern Recogn. Lett., 29:1537–1543, 2008. 1 [3] M. Bicego, A. Lagorio, E. Grosso, and M. Tistarelli. On the use of SIFT features for face authentication. In CVPRW, pages 35–41, 2006. 1 [4] M. Brown, G. Hua, and S. Winder. Discriminative learning of local image descriptors. IEEE Trans. PAMI, 33:43–57, 2011. 1 [5] Z. Cao, Q. Yin, X. Tang, and J. Sun. Face recognition with learning-based descriptor. In CVPR, pages 2707–2714, 2010. 1 [6] W. Dai and O. Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE Trans. Inf. Theory, 55:2230–2249, 2009. 2, 3, 4 [7] A. Destrero, C. Mol, F. Odone, and A. Verri. A regularized framework for feature selection in face detection and authentication. Int. J. Comput. Vis., 83:164–177, 2009. 1, 2 [8] G. Guo, H. Zhang, and S. Li. Pairwise face recognition. In ICCV, pages 282–287, 2001. 2 [9] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: Data mining, inference, and prediction (2nd edition). Springer, 2009. 1, 4 [10] G. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. University of Massachusetts, Amherst, Technical Report 07-49, 2007. 2, 4 [11] M. Jones and P. Viola. Face recognition using boosted local features. In ICCV, 2003. 1, 2 [12] C. Liu and H. Wechsler. Gabor feature based classification using the enhanced Fisher linear discriminant model for face recognition. IEEE Trans. Image Process., 11:467–476, 2002. 1 [13] C. Liu, J. Yuen, and A. Torralba. SIFT flow: Dense correspondence across scenes and its applications. IEEE Trans. PAMI, 33:978–994, 2011. 1

[14] J. Liu, S. Ji, and J. Ye. Multi-task feature learning via efficient l2,1 -norm minimization. In UAI, pages 339–348, 2009. 3 [15] S. Marcel, Y. Rodriguez, and G. Heusch. On the recent use of local binary patterns for face authentication. Int. J. Image Video Process, Special Issue on Facial Image Processing:1– 9, 2008. 1 [16] E. Meyers and L. Wolf. Using biologically inspired features for face processing. Int. J. Comput. Vis., 76:93–104, 2008. 1 [17] G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput., 20:231–252, 2009. 3 [18] M. Tan, C. Wang, and E. Tsang. Learning sparse SVM for feature selection on very high dimensional datasets. In ICML, pages 1047–1054, 2010. 5, 7 [19] A. Torralba, K. Murphy, and W. Freeman. Sharing visual features for multiclass and multiview object detection. IEEE Trans. PAMI, 29:854–867, 2007. 2 [20] J. Tropp, A. Gilbert, and M. Strauss. Algorithms for simultaneous sparse approximation. Part I: Greedy pursuit. Signal Processing, 86:572–588, 2006. 2, 3 [21] J. Tropp, A. Gilbert, and M. Strauss. Algorithms for simultaneous sparse approximation. Part II: Convex relaxation. Signal Processing, 86:589–602, 2006. 3 [22] X. Wang, C. Zhang, and Z. Zhang. Boosted multi-task learning for face verification with applications to web images and video search. In CVPR, pages 142–149. IEEE, 2009. 1, 2 [23] L. Wiskott, J. Fellous, N. Kruger, and C. der Malsburg. Face recognition by elastic bunch graph matching. IEEE Trans. PAMI, 19:775–779, 1997. 1 [24] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Trans. PAMI, 31(2):210–227, 2009. 1 [25] R. Xiao, W. Li, Y. Tian, and X. Tang. Joint boosting feature selection for robust face recognition. In CVPR, pages 1415– 1422, 2006. 1, 2 [26] S. Yan, H. Wang, X. Tang, and T. Huang. Exploring feature descriptors for face recognition. In ICASSP, pages 629–632, 2007. 1 [27] J. Yang, K. Yu, and T. Huang. Supervised translationinvariant sparse coding. In CVPR, pages 3517–3524, 2010. 1, 2 [28] P. Yang, S. Shan, W. Gao, S. Li, and D. Zhang. Face recognition using Ada-Boosted Gabor features. In AFGR, pages 356–361. IEEE, 2004. 1, 2 [29] G. Zhang, X. Huang, S. Li, Y.S.Wang, and X. Wu. Boosting local binary pattern (LBP)-based face recognition. In Proc. Advances in Biometric Person Authentication, pages 179– 186, 2004. 1, 2 [30] T. Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In NIPS, pages 1921– 1928, 2008. 3 [31] Y. Zhang and D. Yeung. A convex formulation for learning task relationships in multi-task learning. In UAI, pages 733– 742, 2010. 3 [32] Y. Zhou, R. Jin, and S. Hoi. Exclusive LASSO for multi-task feature selection. In AISTATS, pages 988–995, 2010. 3

Exploring Regularized Feature Selection for Person ...

Ariel Sharon Colin Powell George W Bush Bill Gates. STL-OMP. 0.9497. 0.9359. 0.8005. 0.9633. STL-OMP-R. 0.9802. 0.9724. 0.8820. 0.9633. STL-SSP. 0.8565. 0.8492. 0.6434. 0.6345. STL-SSP-R. 0.9174. 0.9488. 0.7881. 0.8413. MTL-P-SOMP. 0.9607. 0.9019. 0.8654. 0.9012. MTL-P-SOMP-R. 0.9810. 0.9650. 0.9081.

372KB Sizes 0 Downloads 195 Views

Recommend Documents

Feature Selection via Regularized Trees
Email: [email protected]. Abstract—We ... ACE selects a set of relevant features using a random forest [2], then eliminates redundant features using the surrogate concept [15]. Also multiple iterations are used to uncover features of secondary

Feature Selection via Regularized Trees
selecting a new feature for splitting the data in a tree node when that feature ... one time. Since tree models are popularly used for data mining, the tree ... The conditional mutual information, that is, the mutual information between two features

Feature Selection for SVMs
в AT&T Research Laboratories, Red Bank, USA. ttt. Royal Holloway .... results have been limited to linear kernels [3, 7] or linear probabilistic models [8]. Our.

Unsupervised Feature Selection for Biomarker ... - Semantic Scholar
Feature selection and weighting do both refer to the process of characterizing the relevance of components in fixed-dimensional ..... not assigned.no ontology.

Unsupervised Feature Selection for Biomarker ...
factor analysis of unlabeled data, has got different limitations: the analytic focus is shifted away from the ..... for predicting high and low fat content, are smoothly shaped, as shown for 10 ..... Machine Learning Research, 5:845–889, 2004. 2.

Unsupervised Feature Selection for Biomarker ...
The proposed framework allows to apply custom data simi- ... Recently developed metabolomic and genomic measuring technologies share the .... iteration number k; by definition S(0) := {}, and by construction |S(k)| = k. D .... 3 Applications.

Feature Selection for Ranking
uses evaluation measures or loss functions [4][10] in ranking to measure the importance of ..... meaningful to work out an efficient algorithm that solves the.

Implementation of genetic algorithms to feature selection for the use ...
Implementation of genetic algorithms to feature selection for the use of brain-computer interface.pdf. Implementation of genetic algorithms to feature selection for ...

Feature Selection for Density Level-Sets
approach generalizes one-class support vector machines and can be equiv- ... of the new method on network intrusion detection and object recognition ... We translate the multiple kernel learning framework to density level-set esti- mation to find ...

Markov Blanket Feature Selection for Support Vector ...
ing Bayesian networks from high-dimensional data sets are the large search ...... Bayesian network structure from massive datasets: The “sparse candidate” ...

Unsupervised Feature Selection for Outlier Detection by ...
v of feature f are represented by a three-dimensional tuple. VC = (f,δ(·),η(·, ·)) , ..... DSFS 2, ENFW, FPOF and MarP are implemented in JAVA in WEKA [29].

A New Feature Selection Score for Multinomial Naive Bayes Text ...
Bayes Text Classification Based on KL-Divergence .... 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 191–200, ...

Reconsidering Mutual Information Based Feature Selection: A ...
Abstract. Mutual information (MI) based approaches are a popu- lar feature selection paradigm. Although the stated goal of MI-based feature selection is to identify a subset of features that share the highest mutual information with the class variabl

Application to feature selection
[24] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions. N.Y.: Dover, 1972. [25] T. Anderson, An Introduction to Multivariate Statistics. N.Y.: Wiley,. 1984. [26] A. Papoulis and S. U. Pillai, Probability, Random Variables, and. Stoch

Canonical feature selection for joint regression and ...
Aug 9, 2015 - Department of Brain and Cognitive Engineering,. Korea University ... lyze the complex patterns in medical image data (Li et al. 2012; Liu et al. ...... IEEE Transactions. Cybernetics. Zhu, X., Suk, H.-I., & Shen, D. (2014a). Multi-modal

A New Feature Selection Score for Multinomial Naive ...
assumptions: (i) the number of occurrences of wt is the same in all documents that contain wt, (ii) all documents in the same class cj have the same length. Let Njt be the number of documents in cj that contain wt, and let. ˜pd(wt|cj) = p(wt|cj). |c

Multi-task GLOH Feature Selection for Human Age ...
public available FG-NET database show that the proposed ... Aging is a very complicated process and is determined by ... training data for each individual.

Trace Ratio Criterion for Feature Selection
file to frontal views. Images are down-sampled to the size of ... q(b1+b2+···+bk) b1+b2+···+bk. = ak bk . D. Lemma 2 If ∀ i, ai ≥ 0,bi > 0, m1 < m2 and a1 b1. ≥ a2.

a feature selection approach for automatic music genre ...
format [14]. The ID3 tags are a section of the compressed MP3 audio file that con- ..... 30-second long, which is equivalent to 1,153 frames in the MP3 file format. We argue that ...... in machine learning, Artificial Intelligence 97 (1997) 245–271