Spectral Embedded Clustering Feiping Nie1,2 , Dong Xu2 , Ivor Wai-Hung TSANG2 and Changshui Zhang1 1 State Key Laboratory of Intelligent Technology and Systems, Department of Automation, Tsinghua University, Beijing 100080, China 2 School of Computer Engineering, Nanyang Technological University, Singapore
[email protected];
[email protected];
[email protected];
[email protected] Theorem 1. If rank(Sb ) = c − 1 and rank(St ) = rank(Sw ) + rank(Sb ), then the true cluster assignment matrix can be represented by a low dimensional linear mapping of the data, that is, there exist W ∈ Rd×c and b ∈ Rc×1 such that Y = X T W + 1n bT .
A Proof of Theorem 1 Suppose · the eigenvalue-decomposition of St is St = ¸ Λ2t 0 −1 T T [U1 , U0 ] [U1 , U0 ] . Let B = Λ−1 t U1 Sb U1 Λt . 0 0 Suppose the eigenvalue-decomposition of B is B = Vb Λb VbT , and let P = [U1 Λ−1 t Vb , U0 ]. According to [Ye, 2007], we know that if ·rank(St )¸ = It 0 rank(Sw ) + rank(Sb ) holds, then P T St P = = 0 0 · ¸ Ib 0 Dt and P T Sb P = = Db , where It ∈ Rrt ×rt and 0 0 Ib ∈ Rrb ×rb are identity matrices, rt is the rank of St , rb is the rank of Sb , and rb ≤ rt . According to [Ye, 2005], we know St+ = P Dt P T . Then, we have Sw St+ Sb = St St+ Sb − Sb St+ Sb = P −T P T St St+ Sb P P −1 − P −T P T Sb St+ Sb P P −1 = P −T Dt Dt Db P −1 − P −T Db Dt Db P −1 = 0. Note that Sb = XGGT X T . Therefore, we have Sw St+ XGGT X T = 0 ⇒ Sw St+ XG(Sw St+ XG)T = 0 ⇒ Sw St+ XG = 0 ⇒ Sw St+ XY = 0. Let us define W0 to be W0 = St+ XY , then we have Sw W0 = 0. Therefore, W0 is in the null space of Sw and all the data that belongs to the same class will be projected onto the same point under the projection W0 [Cevikalp et al., 2005], thus we have: ¯Tj W0 , ∀i, yi = [0, ..., 0, 1, 0, ..., 0]T ⇒ xTi W0 = x | {z } | {z } j−1
c−j
(1)
where yiT is the i-th row of the true cluster assignment matrix Y and x ¯j is the mean of the data that belongs to class ¯ c = [¯ ¯ c = XY Σ, where j. Denote X x1 , ..., x ¯c ]. Note that X Σ ∈ Rc×c is a diagonal matrix with the i-th diagonal element as 1/ni , ni is the number of the data that belongs to class ¯ cT W0 ) = rank(ΣY T X T (XX T )+ XY ) = i. Then rank(X T + rank((XX ) XY ) = rank(Sb ) = c − 1. Denote ¯ cT W0 + 1c 1Tc . Q=X ¯ cT W0 1c = 0 Note that Y 1c = 1n and X1n = 0, so X T −1 ¯ T and 1c Σ Xc W0 = 0. Thus vector 1c is linearly inde¯ cT W0 . Suppose pendent of the rows or the columns of X ¯ cT W0 = Q1 QT , ¯ cT W0 is X the full rank decomposition of X 2 c×c−1 where Q1 , Q2 ∈ R are column full rank matrices. Then Q = Q1 QT2 + 1c 1Tc = [Q1 , 1c ][Q2 , 1c ]T . As [Q1 , 1c ] and [Q2 , 1c ] both are full rank matrices, Q is invertible. Hence ¯ T W0 + 1c 1T )Q−1 = I ⇒ we have QQ−1 = I ⇒ (X c c T −1 −T T ¯ Xc W0 Q + 1c (Q 1c ) = I. Let W = W0 Q−1 , and
b = Q−T 1c .
¯ T W + 1c bT ) = I. According to (1), we have Then (X c T X W + 1n bT = Y . Therefore, If rank(Sb ) = c − 1 and rank(St ) = rank(Sw ) + rank(Sb ), there exist W ∈ Rd×c and b ∈ Rc×1 such that Y = X T W + 1n bT . ¤
References [Cevikalp et al., 2005] Hakan Cevikalp, Marian Neamtu, Mitch Wilkes, and Atalay Barkana. Discriminative common vectors for face recognition. IEEE Trans. Pattern Anal. Mach. Intell., 27(1):4–13, 2005. [Ye, 2005] Jieping Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483–502, 2005. [Ye, 2007] Jieping Ye. Least squares linear discriminant analysis. In ICML, pages 1087–1093, 2007.