Proceedings of 2010 IEEE 17th International Conference on Image Processing
September 26-29, 2010, Hong Kong
NPDA/CS: IMPROVED NON-PARAMETRIC DISCRIMINANT ANALYSIS WITH CS DECOMPOSITION AND ITS APPLICATION TO FACE RECOGNITION Qingsong Zeng, Changdong Wang School of Information Science and Technology Sun Yat-sen University, Guangzhou, 510006, China
[email protected],
[email protected] ABSTRACT Fisher’s Linear Discriminant Analysis (FLDA) uses the parametric form of the scatter matrix which is based on the Gaussian distribution assumption, and requires the scatter matrices to be nonsingular, which can not always be satisfied. To overcome this problem, many scholars have recently proposed Non-parametric Discriminant Analysis (NPDA), addressing the non-Gaussian aspects of sample distributions. In this paper, from the nearest neighborhood perspective, a new formulation of scatter matrices is presented to improve the NPDA, simultaneously emphasizing the boundary information and local structure contained in the training set. Then, CS decomposition is incorporated to improve its performance. Experimental results on 4 databases demonstrate the effectiveness of the improved method. Index Terms— LDA, boundary information, local structure, non-parametric discriminant analysis, CS decomposition 1. INTRODUCTION Recently, discriminant feature extraction has received great attention in many pattern recognition applications, such as face recognition, action recognition, etc. Lots of linear feature extraction methods have been proposed. Among these algorithms, Fisher’s Linear Discriminant Analysis (FLDA) is the most popular one. It uses the parametric form of the scatter matrices based on the Gaussian distribution assumption, so it can be also regarded as one of the Parametric Discriminant Analysis (PDA). Moreover, FLDA requires the total scatter matrix Sw to be nonsingular. Unfortunately, this condition can not always be satisfied, since in many small sample size problem (SSSP) applications, all scatter matrices may be singular. By far, at least three main methods have been proposed to overcome this problem. The first one is to apply PCA to reduce the dimension of the original data before classical LDA is performed [1]. The second one is called Regularized LDA [2], which incorporates mechanism of regularization to This project was partially supported by the NSF-Guangdong (U083500).
978-1-4244-7994-8/10/$26.00 ©2010 IEEE
4537
deal with the singularity of Sw . The last one is Uncorrelated LDA(ULDA) [3], which extracts feature vectors with uncorrelated attribution, while uncorrelated features are desirable in many application, since they contain minimum redundancy. In deriving the FLDA’s formulation, there is an assumption that the class empirical mean is equal to its expectation. However, this assumption may not be valid in practice. FLDA makes data of the same class close to their corresponding class means. Since the number of samples for each class is always limited, the estimates of class means are not accurate, and this would degrade the effectiveness of Fisher criterion. Hence, Fukunaga [4] presented Non-parametric Discriminant Analysis (NPDA) to overcome the problem by introducing a new definition for the between-class scatter matrix, which explicitly emphasizes the samples near boundary. Then Bressan M. et al. [5] and Zhifeng Li et al. [6] improved the NPDA, in which propose a new formulation of scatter matrices to extend the two-class NPDA to multiclass cases. Based on their research, observing NPDA from a nearest neighborhood perspective, we introduce a modification of the original algorithm called Non-parametric Discriminant Analysis with Cosine-sine Decomposition (NPDA/CS). In this paper, we first propose a new formulation of between/within-class and total scatter matrix by emphasizing the boundary information and local structure contained in the training set to improve the NPDA algorithm, in which the three half scatter matrices have nonparametric form. Then we investigate the idea of simultaneously diagonalizing matrices Sb and Sw by Cosine-sine decomposition (CSD) [7, 8], so as to handle the problems that LDA suffers. The rest of this paper is organized as follows. In section 2, we introduce the related work prior to our algorithm. The improved NPDA/CS algorithm for dimensionality reduction is described in Section 3. The experimental results are presented in Section 4. Section 5 draws the conclusions of the work. Some words about our notation. Lower-case letters such as i, j, k, l, and c represent the numbers or indexes. Capital letters such as A, G, and X represent matrices. Lower-case bold letters such as x, y represent vectors (samples). Script letters such as C, X represent sets.
ICIP 2010
neighbor mean of xki from class Cl as follows: β(xki , xlj )xlj μ(xki , l, p) =
2. RELATED WORK 2.1. Fisher’s LDA Consider the problems of training a classifier with c classes. Suppose the data space is a compact vector space of dimension d, and a training set X = {xi : i = 1, . . . , n} consists of n samples with each point xi already being assigned to some class, say, xi ∈ Ck . Thus it can be further denoted as xkj , which means it is the jth sample from class Ck . Let nk denote the number of samples in class Ck , that is, nk = |Ck |. In FLDA, the between/within-class scatter matrices respectively c and total scatter matrix are cdefined nkas Skb = 1 1 T n (μ − μ)(μ − μ) , S = w k k k=1 k k=1 i=1 (xi − n n n μk )(xki −μk )T , St = i=1 (xi −μ)(xi −μ)T , where μk and μ denote the mean of the class Ck and total samples respectively. The goal of FLDA is to compute the optimal transformation matrix G that can find the most discriminative features by maximizing the ration of the determinant of the betweenclass scatter matrix to that of the within-class scatter matrix. −1
G = arg max tr((GT Sw G) G
(GT Sb G))
(1)
The optimal transformation can be readily computed by finding all the eigenvectors that satisfy Sb w = λSw w, for λ = 0.
(2)
xlj ∈Np (xk i ,l)
where β(xki , xlj ) is a weight function between xki and xlj sat isfying xl ∈Np (xk ,l) β(xki , xlj ) = 1. j i The non-parametric between-class scatter matrix for multi-class problem defined as follow [6]: SbN =
nk c c−1
w(xki , l, p)(μ(xki , k, p)−
k=1 l=k+1 i=1
(3)
μ(xki , l, p))(μ(xki , k, p) − μ(xki , l, p))T where w(xki , l, p) is a weighting function, defined as min{dα (xki , μ(xki , k, p)), dα (xki , μ(xki , l, p))} dα (xki , μ(xki , k, p)) + dα (xki , μ(xki , l, p)) (4) with α ∈ (0, ∞) controlling the changing speed of the weight with respect to the distance ratio, and d(u, v) being the Euclidean distance between u and v. For samples near the classification boundary, the weight approaches 0.5; if the samples are far away from the classification boundary the weight drops off to zero. w(xki , l, p) =
2.2. Alternative Expression of Scatter Matrices It should be pointed out that the above described FLDA has an alternative expression for their scatter matrices. Let’s first stack the samples in data set X into a partitioned matrix according to their class labels, that is X = [X1 , . . . , Xc ] with Xk ∈ Rd×nk being the data matrix consisting of all samples from Ck . Half between-class scatter matrix Hb , half withinclass scatter matrix Hw , and half total scatter matrix Ht are 1 , · · · , X c , Hb = defined respectively as: Hw = √1n X √ √ √1 n1 (μ1 − μ), · · · , nc (μc − μ) , Ht = √1n (X − n k = Xk − μk (ek )T ,ek = [1, . . . , 1]T ∈ Rnk , μeT ), where X and e = [1, . . . , 1]T ∈ Rn . Thus the three scatter matrix can be expressed as Sb = Hb HbT ,Sw = Hw HwT and St = Ht HtT . 2.3. Non-parametric Discriminant Analysis Non-parametric Discriminant Analysis (NPDA) addresses the non-Gaussian aspects of sample distributions, which introduce a non-parametric between-class matrix measuring between-class scatter on local basis in the neighborhood of the decision boundary. Fukunaga et al. [4] presented twoclass NPDA to overcome the problem by introducing a new definition for the between-class scatter matrix, which explicitly emphasizes the samples near boundary. And Zhifeng Li et al. [6] extended the two-class NPDA to multi-class cases by propose a new formulation of scatter matrices. Let Np (xki , l) denotes the subset consisting of p nearest neighbors of xki from class Cl . We define the local nearest
4538
3. NON-PARAMETRIC DISCRIMINANT ANALYSIS WITH COSINE-SINE DECOMPOSITION In this section, from nearest neighborhood perspective, a new formulation of scatter matrices is presented to improve the NPDA, simultaneously emphasizing the boundary information and local structure contained in the training set. Then, CS decomposition is incorporated to simultaneously diagonalize matrices Sb and Sw so as to handle the problems that LDA suffers. 3.1. Construction of the scatter matrices The main idea of the proposed scatter matrices lies in that if vectors near the classification boundary are only selected, then the scatter matrix constructed can specify the subspace into which the boundary information is embedded. Samples that are far away from the boundary may exert a considerable influence on the scatter matrix and distort the information of boundary structure. Let Np (xki ) denotes the subset consisting of p nearest neighbors of xki from any class. We define the local nearest neighbor mean of xki as follows: β(xki , x)x (5) μ(xki , p) = x∈Np (xk i)
If we defined two partitioned matrix A = [A1 , . . . , Ac ] and B = [B1 , . . . , Bc ] with Ak (:, i) = xki − μ(xki , k, p), and
Fig. 1. Non-parametric between-class scatter matrix. v: the local nearest neighbor mean connection vector between μ(x1i , 1, p) and μ(x1i , 2, p). Bk (:, i) = xki − μ(xki , p), then generalized non-parametric half between/within-class scatter matrices and half total scatter matrix are defined as nk c nk Hb (:, k) = w(xki , l, p)(μ(xki , k, p)−μ(xki , l, p)) n i=1 l=1 (6) 1 (7) Hw = √ [A1 , . . . , Ac ] n 1 Ht = √ [B1 , . . . , Bc ] n
(8)
where w(xki , l, p) defined in Eq. 4. As illustrated in Fig. 1, there are two advantages in the new design. The nonparametric between-class scatter matrix spans a subspace where the local structure is embedded. The other is the new weighting function can help emphasize the sample near the boundary of two classes and thus can capture the boundary structure information more effectively. 3.2. Non-parametric Discriminant Analysis with Cosinesine Decomposition With the help of generalized non-parametric half scatter matrices computed above, we can derive the Non-parametric discriminant analysis algorithm as summarized in Algorithm 1. The improved NPDA/CS algorithm is essentially an application of CS decomposition of half scatter matrices.
Algorithm 1 Non-parametric discriminant analysis with CSD Input: partitioned data matrix X = [X1 , . . . , Xc ]. Output: transformation matrix G. 1. Construct Hb , Hw , and Ht . 2. Compute SVD decomposition of Ht , that is Ht = U1 D1 V1T and let Upca = U1 (:, 1 : rank(U1 )), save Upca . 3. Apply PCA to Hb and Hw : T T Hb = Upca Hb , Hw = Upca Hw . T T T 4. Let F = [Hb ; Hw ] and apply QL decomposition to F F = QL, save Q and L. 5. Apply CS decomposition to Q so as to get matrix W . 6. Let Y = LT W , and compute orthogonal matrix Φ by QL decomposition of Y , that is Z = ΦL1 . And so we get Z = Φ(LT1 )−1 . 7. Let q = rank(Hb ). 8. Let G∗ = [Z1 , Z2 , . . . , Zq ], where Zi is the i-th column of Z. 9. Output the transformation matrix G = Upca G∗ .
OLDA [9] and ULDA [10]. The 4 databases used are Orl, Yale, YaleB and CMU PIE. The experimental results demonstrate the effectiveness of our improved method. 4.1. Methodology In all experiments, original images were normalized (in scale and orientation) by fixing the locations of two eyes. Then, the facial were cropped into the final image for matching. The size of each cropped image in the experiments was 32 × 32, with 256 gray levels per pixel. Thus each image can be represented by a 1024-dimensional vector in image space. The simple k-nearest neighbor (k-NN) classifier was used on all algorithms with k = 1. The number of nearest neighbors used to construct the half scatter matrices in NPDA is 5. On databases of Orl and Yale, 7 face images for each person were used as training set, while the remaining 3 face images were used as test set. On database of YaleB, 10 face images per person were used as training set, while the remaining face images were used as test set. Since the number of face images in PIE databases is larger than others, we conducted two experiments on it, namely, PIE1 and PIE2. In PIE1, 20 face images per person were used as training set, while in PIE2, the number of training face images was 60. 4.2. Results
All experiments were repeated 20 times. The means and standard deviations of 4 algorithms on all databases are shown in Table. 1. It is clear that the improved NPDA/CS algorithm achieves the best performance among the compared alIn this section, experimental results are presented to comgorithms on all databases. This may due to the advantages pare the improved NPDA method to 3 algorithms, over 4 databases. The three compared methods include PCA+LDA [1], of NPDA/CS in emphasizing boundary information and local 4. EXPERIMENTAL RESULTS
4539
Table 1. The means and standard deviations of 4 algorithms running 20 times on all databases. Methods Orl Yale YaleB PIE1 PIE2 PCA+LDA [1] 95.67 ± 1.79 80.25 ± 4.72 73.34 ± 6.73 78.47 ± 0.66 94.53 ± 0.23 OLDA [9] 97.50 ± 1.60 82.50 ± 4.17 79.02 ± 1.51 81.17 ± 0.64 94.80 ± 0.21 ULDA [10] 95.88 ± 1.94 80.75 ± 5.45 69.14 ± 9.53 77.38 ± 0.70 93.57 ± 0.26 NPDA/CS 97.63 ± 1.51 82.67 ± 3.99 79.31 ± 1.45 83.94 ± 0.64 96.17 ± 0.22
face database: ORL
face database: Yale
1
6. REFERENCES
0.9
0.9
0.8
0.8 0.7
[1] Yang J. and Yang J-Y., “Why can lda be performed in pca transformed space?,” Pattern Recognition, vol. 36, pp. 563–566, 2003.
recognition rate
recognition rate
0.7 0.6 0.5 0.4
0.6 0.5 0.4
0.3
0.1 0
0.3
OLDA ULDA LDA NPDA/CS
0.2
0
5
10
15 20 25 discriminant vectors
30
35
OLDA ULDA LDA NPDA/CS
0.2 0.1
40
0
2
(a) Orl database
4
12
[2] Dai D-Q and Yuen P. C., “Regularized discriminant analysis and its application to face recognition,” Pattern Recognition, vol. 36, pp. 845–847, 2003.
14
face database: PIE 0.9
0.7
0.8
[3] PARK C H Ye J. YE J, JANARDAN R, Janardan R., and Park C. H., “An optimization criterion for generalized discriminant analysis on undersampled problems,” IEEE Trans. PAMI, vol. 26, pp. 982–994, 2004.
0.7
0.6
0.6
0.5
recognition rate
recognition rate
10
(b) Yale database
face database: YaleB 0.8
0.4 0.3
0.5 0.4 0.3
0.2
OLDA ULDA LDA NPDA/CS
0.1 0
6 8 discriminant vectors
0
5
10
15 20 25 discriminant vectors
30
35
0.1
40
0
[4] Fukunaga K., Introduction to statistical pattern recognition, Boston: Academic Press, 1990.
OLDA ULDA LDA NPDA/CS
0.2
0
10
(c) YaleB database
20
30 40 discriminant vectors
50
60
70
(d) Pie database
Fig. 2. Performance of 4 algorithms on all databases with different number of discriminant features used.
structure contained in the training set. We made a further investigation in the performance of the compared algorithms when different number of discriminant features was used. Fig. 2 plots the compared results on all databases. It’s obvious that the improved NPDA/CS achieves the best results in all size of discriminant features used.
[5] Bressan M. and Vitria J., “Nonparametric discriminant analysis and nearest neighbor classification,” Pattern Recognition, vol. 24, pp. 2743–2749, 2003. [6] Li Z., Lin D., and Tang X., “Nonparametric discriminant analysis for face recognition,” IEEE Trans. PAMI, vol. 31, pp. 755–761, 2009. [7] Paige C. C. and Saunders M. A., “Towards a generalized singular value decomposition,” SIAM Journal on Numerical Analysis., vol. 18, pp. 298–405, 1981. [8] Golub G. and Van Loan C., Matrix computations, Johns Hopkins University Press, 1996. [9] Ye J., “Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems,” J. Mach. Learn. Res., vol. 6, pp. 483–502, 2005.
5. CONCLUSIONS In this paper, we have presented a new algorithm using Cosine-sine decomposition to improve the Non-parametric Discriminant Analysis algorithm for discriminant feature analysis and applied it to face recognition. The improved method explicitly emphasizes boundary information and local structure contained in the training set. Moreover, CS decomposition of the half scatter matrices is adopted to improve the discriminant effectiveness. Compared with 3 state-of-the-art algorithms on 4 databases, experimental results on NPDA/CS is encouraging.
4540
[10] Ye J., Li T., and Xiong T., “Using uncorrelated discriminant analysis for tissue classification with gene expression data,” IEEE Trans. Computational Biology and Bioinformatics, vol. 1, pp. 181–190, 2004.