株式会社ニコン
On Bayesian PCA:Automatic Dimensionality Selection and Analytic Solution Shinichi Nakajima (Nikon Corporation) Masashi Sugiyama (Tokyo Tech.) S. Derin Babacan (University of Illinois)
1
NIKON CORPORATION Core Technology Center July 1, 2011
Contents ‣ Probabilistic PCA ‣ Approximate Bayesian Inference ‣ Theoretical Result (PB vs SimpleVB) ‣ Experimental Result (SimpleVB vs VB) ‣ Conclusion
2
NIKON CORPORATION Core Technology Center July 1, 2011
Contents ‣ Probabilistic PCA ‣ Approximate Bayesian Inference ‣ Theoretical Result (PB vs SimpleVB) ‣ Experimental Result (SimpleVB vs VB) ‣ Conclusion
3
3
NIKON CORPORATION Core Technology Center July 1, 2011
Principal component analysis (PCA) Y = (y 1 · · · y M ) ∈ RL×M : observation
M samples of y ∈ RL
4
NIKON CORPORATION Core Technology Center July 1, 2011
Principal component analysis (PCA) Y = (y 1 · · · y M ) ∈ RL×M : observation
B ∈ RL×H : loading matrix
H�L
max VAR(B Y ) B
�
5
NIKON CORPORATION Core Technology Center July 1, 2011
Probabilistic PCA [Tipping&Bishop99] Y = BX + E
Y = (y 1 · · · y M ) ∈ RL×M : observation
B ∈ RL×H : loading matrix
X = (x1 · · · xM ) ∈ RH×M : hidden variable
E ∈ RL×M : noise
6
NIKON CORPORATION Core Technology Center July 1, 2011
Probabilistic PCA [Tipping&Bishop99] Y = BX + E
Y = (y 1 · · · y M ) ∈ RL×M : observation
B ∈ RL×H : loading matrix
X = (x1 · · · xM ) ∈ RH×M : hidden variable
E ∈ RL×M : noise
Xh,m ∼ N1 (0, 12 )
El,m ∼ N1 (0, 12 )
7
NIKON CORPORATION Core Technology Center July 1, 2011
Probabilistic PCA [Tipping&Bishop99] �
�Y
− BX�2Fro 2σ 2
likelihood : p(Y |X, B) ∝ exp − � � 2 �X�Fro prior : φX (X) ∝ exp − 2
X→A
�
�
8
NIKON CORPORATION Core Technology Center July 1, 2011
Bayesian Matrix Factorization [Salakhutdinov&Mnih08] �
�Y
− BA� �2Fro 2σ 2
likelihood : p(Y |A, B) ∝ exp − � � 2 �A�Fro priors : φA (A) ∝ exp − 2 � −1 � � tr(BCB B ) φB (B) ∝ exp − 2
A ∈ RM ×H
B ∈ RL×H
�
CB = diag(c2b1 , . . . , c2bH ) L : original dimensionality H : dimensionality of latent feature M : number of samples 9
NIKON CORPORATION Core Technology Center July 1, 2011
Contents ‣ Probabilistic PCA ‣ Approximate Bayesian Inference ‣ Theoretical Result (PB vs SimpleVB) ‣ Experimental Result (SimpleVB vs VB) ‣ Conclusion
10
10
NIKON CORPORATION Core Technology Center July 1, 2011
Bayesian Inference Bayes posterior : p(A, B|Y ) ∝ p(Y |A, B)φA (A)φA (B)
� Bayes = �BA� �p(A,B|Y ) Bayes Estimator : U Computationally intractable!
11
NIKON CORPORATION Core Technology Center July 1, 2011
Free Energy Trial posterior : r(A, B)
Free energy : F (r) =
�
r(A, B) log p(Y |A, B)φA (A)φB (B)
�
= KL (r(A, B)�p(A, B|Y )) + const. p(A, B|Y ) = argmin F (r) r
12
NIKON CORPORATION Core Technology Center July 1, 2011
Approximate Bayesian Inference Free energy : F (r) =
�
log
r(A, B) p(Y |A, B)φA (A)φB (B)
�
Free energy minimization under some constraint on posterior. � approx = �BA� �r(A,B|Y ) Approximate estimator : U
where r(A, B|Y ) = argmin F (r) r
s.t. r(A, B|Y ) ∈ G
G
specifies approximation methods.
13
Bayes p osterior (V = 0) 3
2
2 0.
0. 1
0.2
3
3
1
0.
0.
0.1
0
2
0.1
0.1
0.3
0.2
0. 2
1
0.2
0.3
B
1 B
0
0.1 0.1
−1
1
1 0.
0.
2
2
0.
0. 3
−1
0.
Bayes Posterior
CORPORATION Figure 4: BayesNIKON posteriors with ca = cb Core Technology Center July 1, 2011 solutions, and the dashed lines ca = cb = c → ∞).
−2 −3 −3
−2
−2
−1
0 A
1
2
3
−3 −3
Figure 5: Bayes posteriors with ca = c identical to those in Figure 4.
When L = M = H = 1, Eq.(19) yie
p(A, B|V ) ∝ exp
!
Figure 4 shows the contour of the abov the noise variance is σ 2 = 1 and the14 hyp
NIKON CORPORATION Core Technology Center July 1, 2011
Variational Bayesian Methods VB - Matrix-wise independence assumption
B
[Bishop2001,Lim&Teh2007]
rVB (A, B|Y ) = r(A)r(B)
A
15
NIKON CORPORATION Core Technology Center July 1, 2011
Variational Bayesian Methods VB - Matrix-wise independence assumption
B
[Bishop2001,Lim&Teh2007]
rVB (A, B|Y ) = r(A)r(B) SimpleVB - Column-wise independence assumption
A
[Raiko et al.2007]
rSimpleVB (A, B|Y ) =
H �
r(ah )
h=1
A = (a1 , . . . , aH ) B = (b1 , . . . , bH )
H �
r(bh )
h=1
Analytic-form solution has been obtained [Nakajima et al. NIPS2010] 16
NIKON CORPORATION Core Technology Center July 1, 2011
Partially Bayesian Methods B PB-A - A is integrated out, and B is point-estimated [Tipping&Bishop99] Posterior on B is restricted to δ -function
rPB-A (A, B|Y ) = r(A)δ(B; B ∗ )
A
17
NIKON CORPORATION Core Technology Center July 1, 2011
Partially Bayesian Methods B PB-A - A is integrated out, and B is point-estimated [Tipping&Bishop99] Posterior on B is restricted to δ -function
A
rPB-A (A, B|Y ) = r(A)δ(B; B ∗ ) PB-B - B is integrated out, and A is point-estimated. Posterior on A is restricted to δ -function
B
rPB-B (A, B|Y ) = δ(A; A∗ )r(B)
A
18
NIKON CORPORATION Core Technology Center July 1, 2011
Partially Bayesian Methods B PB-A - A is integrated out, and B is point-estimated [Tipping&Bishop99] Posterior on B is restricted to δ -function
A
rPB-A (A, B|Y ) = r(A)δ(B; B ∗ ) PB-B - B is integrated out, and A is point-estimated. Posterior on A is restricted to δ -function
B
rPB-B (A, B|Y ) = δ(A; A∗ )r(B)
PB - Choose the one giving lower free energy: � rPB-A (A, B|Y ) PB r (A, B|Y ) = rPB-B (A, B|Y )
A
if F (rPB-A ) ≤ F (rPB-B ) if F (rPB-A ) > F (rPB-B ) 19
NIKON CORPORATION Core Technology Center July 1, 2011
MAP estimation MAP - A and B are point-estimated.
B
A
Posteriors on A and B are restricted to δ -function.
rMAP (A, B|Y ) = δ(A; A∗ )δ(B; B ∗ )
20
NIKON CORPORATION Core Technology Center July 1, 2011
We discuss two issues
�
−1 � tr(BCB B ) φB (B) ∝ exp − 2
�
Two major differences between VB and PB-A. In (empirical) VB,
- Not only A but also B are integrated out (PB-A ! VB). - Automatic relevance determination (ARD) through empirical Bayesian procedure ( CB is also estimated from observation.)
1.Are these properties essential for automatic dimensionality selection (ADS)? B
B
A
PB-A [Tipping&Bishop2001]
A
(Empirical) VB [Bishop2001] 21
NIKON CORPORATION Core Technology Center July 1, 2011
We discuss two issues
�
−1 � tr(BCB B ) φB (B) ∝ exp − 2
�
Two major differences between VB and PB-A. In (empirical) VB,
- Not only A but also B are integrated out (PB-A ! VB). - Automatic relevance determination (ARD) through empirical Bayesian procedure ( CB is also estimated from observation.)
No!
1.Are these properties essential for automatic dimensionality selection (ADS)? B
(shown theoretically)
B
A
PB-A [Tipping&Bishop2001]
A
(Empirical) VB [Bishop2001] 22
NIKON CORPORATION Core Technology Center July 1, 2011
We discuss two issues VB needs iteration, SimpleVB has analytic solution [Nakajima et al. NIPS2010].
2. Does column-wise independence degrade approximation quality?
rVB (A, B|Y ) = r(A)r(B)
r
SimpleVB
(A, B|Y ) =
H �
h=1
VB [Bishop2001,Lim&Teh2007]
r(ah )
H �
r(bh )
h=1
SimpleVB [Raiko et al.2007]
23
NIKON CORPORATION Core Technology Center July 1, 2011
We discuss two issues VB needs iteration, SimpleVB has analytic solution [Nakajima et al. NIPS2010].
No!
2. Does column-wise independence degrade approximation quality?
rVB (A, B|Y ) = r(A)r(B)
r
SimpleVB
(A, B|Y ) =
(shown empirically)
H �
h=1
VB [Bishop2001,Lim&Teh2007]
r(ah )
H �
r(bh )
h=1
SimpleVB [Raiko et al.2007]
24
NIKON CORPORATION Core Technology Center July 1, 2011
Table 2 Properties of approximate PPCA methods. Upper meth-
ods have weaker constraints on the posterior distribu-
Which method the positively best?possesses the cortion. ‘!’ meansis the method responding property, ‘"’ means weakly, and ‘×’ means not.
dimensionality selection
VB simple-VB PB PB-A (PB-B) MAP
PB-PC
Howev
advant
useless Automatic
Method
ready
! ! ! " ×
Empirical Analytic-form Bayes
? !
solution ×
!
!
×
!
× ×
On
betwee
solutio
simple
!
tion 4
!
counte
strongly elliptical in the bh space when M < = L. 5. 4 Relation between PB-PCA and ARD Prior for Linear Models Here we show that PB-PCA has the automatic relevance
perfor
ings, w
tive P
Our
compl 25 value
NIKON CORPORATION Core Technology Center July 1, 2011
Contents ‣ Probabilistic PCA ‣ Approximate Bayesian Inference ‣ Theoretical Result (PB vs SimpleVB) ‣ Experimental Result (SimpleVB vs VB) ‣ Conclusion
26
26
NIKON CORPORATION Core Technology Center July 1, 2011
PB solution �
− BA� �2Fro 2σ 2
�Y likelihood : p(Y |A, B) ∝ exp − � � 2 �A�Fro priors : φA (A) ∝ exp − 2 �
−1 � tr(BCB B ) φB (B) ∝ exp − 2
�
�
PB-A solution was derived for cbh = ∞ [Tipping&Bishop99].
Here, we derive PB solution for general case, and show similarity to SimpleVB when σ 2 and CB are given.
27
NIKON CORPORATION Core Technology Center July 1, 2011
� 1 if true, θ(·) = 0 otherwise.
PB solution [Theorem 2] Y =
Let
H �
γh ω bh ω � ah
be the SVD of Y.
h=1
PB solution (space spanned by loading vectors) is given by
S PB = span where
γ PB =σ h
�
�
H �
� θ(γh > γ PB )ω ω b ah h h
h=1
�
,
max(L, M ) + σ 2 /c2bh .
span(W ) : Space spanned by column vectors of W.
28
NIKON CORPORATION Core Technology Center July 1, 2011
PB solution [Theorem 2] Y =
Let
H �
γh ω bh ω � ah
be the SVD of Y.
h=1
PB solution (space spanned by loading vectors) is given by
S PB = span where
γ PB =σ h
�
�
H �
� θ(γh > γ PB )ω ω b ah h h
h=1
�
,
max(L, M ) + σ 2 /c2bh .
Space spanned by singular vectors with small singular values is neglected. ADS occurs with PB without empirical Bayes! 29
NIKON CORPORATION Core Technology Center July 1, 2011
SimpleVB Solution [Nakajima et al. NIPS2010] [Theorem 3] SimpleVB solution is given by
S SimpleVB = span
�
where
γ SimpleVB =σ h
H �
� θ(γh > γ SimpleVB )ω ω b ah h h
h=1
�
κh +
�
�
,
κ2h − LM ,
L+M σ2 κh = + 2 . 2 2cbh
30
NIKON CORPORATION Core Technology Center July 1, 2011
L : original dimensionality H : dimensionality of latent feature
PB vs SimpleVB
M : number of samples L=20,M =50
L=20,M =10 15
simple-VB PB PB-A MAP
SimpleVB
10
Threshold
Threshold
15
simple-VB PB PB-A MAP
10
PB, PB-A MAP
5
0 0
Figure 2
0.5
1 cb h
cbh
1.5
its
5
PB PB-A 0 0
2
0.5
1 cb h
c
1.5
2
and
PB and SimpleVB behave similarly.
hold
8 6
10
simple-VB PB PB-A MAP
8 6
an
sta
no
no
L=20,c b h=1.0
hold
10
sm
MAP
bh simple-VB PB Behavior of the thresholds, γ h , γ h , γ PB-A , h 2 = 1. γ MAP . The noise variance is set to σ h L=20,c b h = ∞
sim
giv
SimpleVB
γh
γh
an
simple-VB PB PB-A MAP
PC
31
NIKON CORPORATION Core Technology Center July 1, 2011
0 0
0.5
1 cb h
1.5
0 0
2
PB vs Figure 2 SimpleVB 2
0.5
L:
1 cb h original
1.5
2
dimensionality
sm
and
PB-A , and sta Behavior of the thresholds, γ simple-VB , γ PB , γ H : dimensionality of latent feature h h h 2 = 1.of samples M :σ number γ MAP . The noise variance is set to non h L=20,c b h = ∞
10
SimpleVB
6
γh
4
PB
2 0 0
8
Threshold
Threshold
γh
10
simple-VB PB PB-A MAP
8
20 M
SimpleVB
MAP 30
40
simple-VB PB PB-A MAP
6
0 0
A
PC
old
4
PB-A
2
PB-A 10
nom
L=20,c b h=1.0
10
20 M
PB MAP 30
40
M M of samM of the thresholds when the number Figure 3 Behavior ples is changed.
PB and SimpleVB coincide when cbh → ∞ .
Wh ize
ior
the
ing
5. 2 PB-A Comparison behaves between differentlyPB-A-PCA, when M < L. PB-PCA, and simple-VB-PCA
In
The threshold (27) becomes simpler when the prior is flat:
is32 g
of t
NIKON CORPORATION Core Technology Center July 1, 2011
Property of PB Solution ‣ ADS occurs. ‣ ‣
Integrating out a part of parameters can lead to ADS. Empirical Bayes is not necessary for inducing ADS.
However, ‣ Empirical PB (where hyperparameters are estimated from observation) cannot be applied. (The global solution is trivial one.�b∗h = 0, cbh → 0 ) � F PB-A
‣ σ2
H ∗2 2 2 ∗2 b 1� b h γh σah 2 2 h − M log σ = L log cbh + 2 − a h 2 cbh σ4
+ const.
h=1
tends to be underestimated.
VB (or SimpleVB) is more convenient. 33
NIKON CORPORATION Core Technology Center July 1, 2011
Contents ‣ Probabilistic PCA ‣ Approximate Bayesian Inference ‣ Theoretical Result (PB vs SimpleVB) ‣ Experimental Result (SimpleVB vs VB) ‣ Conclusion
34
34
s o
-
)
n
)
:
)
d
p330 e331 -332 333 334 ) 335 336 337 338 339 340 341 342 343
297 A B 298 NIKON CORPORATION Core Technology Center 299 With this constraint, an iterative algorithm for minimizing July 1, 2011where 300 2.2.3. VARIATIONAL BAYESIAN PCA (VB-PCA) 301 the free energy (6) was derived 302 (Bishop, 1999b; Lim & Teh, In the VB approximation, the independence between the 303 p(A entangled parameter matrices A and B is assumed: 2007). 304 rVB (A, B) = rAVB (A)rBVB (B). (13) 305 The VBVB posterior can be 306 written as Z With this constraint, an iterative algorithm for minimiz307 ing the free energy (6) was derived (Bishop, 1999a; 308 M L Y Y Lim & Teh, 2007). 309 VB r (A, B) = NH (e am ; µ NH (e bl ; µebl , ΣBe ), e) 310a e m , ΣA The VB posterior can be written as 311 m=1 l=1 M L ) ) 312 Note that (1 VB r (A, B) = NH (* am ; µeam , ΣAe) NH (* bl ; µebl , ΣBe ), 313 necessarily satisfy m=1 where the parameters l=1 314 a constant w Iterative algorithm needed. “ ” 315 where the parameters necessarily satisfy ΣAe depends on r % = µaem µeb1 , . . . , µ316 (14) e ΣAe $ bL y m , 2 317 µeam = 2 µeb1 , . . . , µebL y m , σ (14) σ 318 is minimized ` , ΣBe + Σ e 319 ´ B * µebl = 2 µea1 , . . . , µµ y , (15) e aM l el , µe1 , . . . , µ320 (15) e σ a eM y bl = 2 ( a ' L σ −1 % 321 &$ 2 ! 2 !−1 Dimensionality Selection and Analytic Solution ΣAe = σ µebl On µeb Bayesian + ΣBe PCA: + σ IAutomatic , (16) 322 H L “ ” l X 2 !323 2 ! l=1 %−1 M 385 is minimized when+ σ IH Σ = σ µ µ + Σ , (16) " # $ e e e e 324 2 ∗ A B b b −1 2 ! 2 l l Bye a=pseudo-Dirac function, we mean an extremely Σ σ µeadelta µ + Σ + σ C . (17) 386for any B . e “ ” e a B B A m m ∗ 2 325 "B−B " r(APB-A (A) = p(A|Y, B ∗ ) (22) localized density function, e.g., δ(B; B ∗ ) ∝ exp −l=12c2 Fro l=1 387 326 2 388thus an estim with butastrictly positive variance c , such to thatthe its patail Wea very maysmall obtain local minimizer with respect ∗ for327 any B ∗ . With (22), the first term in (18) vanishes, and ∗ 389 effect can be neglected, while χ = −#log δ(B; B )$ B ! rameters by iterating (14)–(17) until convergence.δ(B;B After) 328 thus an estimator for B ∗ is given by −1 M remains finite. “ ” 390 X convergence, the VB-PCA solution is given as 329 2 ! 2 −1 391 ' PB-A Σ &e = σ µaem µaem + ΣAeB(PB-A+=σargmin C (17) (B ∗ |Y.), BF 392 S VB = span B(µeb1 , . . . , µebL )! . ∗ B 393 l=1 where 394 2.3. Empirical Bayesian Procedure 395where We may obtain a local minimizer with the paramF PB-A (B|Y ) = −respect log Z(Y |B)φto (23) B (B) + χB . PPCA has a hyperparameter CB in the prior (4), which con396 trols the sparsity of the result (i.e., the PCA dimensions). 397 eters by iterating (14)–(17)Weuntil convergence. call Eq.(23) the PB-A free energy.After converA popular way to set the hyperparameter in the Bayesian 398 35 F PB-A In Section 3, we derive an analytic-form solution of PBPCA for arbitrary {cbh }.
VB Solution [Bishop99, Lim&Teh07] r
(A, B) = r(A)r(B)
In Section 3, weWith derive an 0.8 analytic-form solution of PBthis constraint, an PCA for arbitrary {cbh }. Running time
s o
297
iterative algorithm 298 b H b simple where Data set for minimizing M L # Classes H NIKON CORPORATION
Core Technology Center 299 July 1, 2011 Chart1999b; Lim 600 & 60 6 11 10 300 (Bishop, the free energy (6) was derived Teh, 2.2.3. VARIATIONAL BAYESIAN PCA (VB-PCA) 0.4 301 Glass 214 9 6 7 7p(A 302 2007).0.2 the independence between the In the VB approximation, Wine 178 13 3 7 8 303 entangled parameter matrices A and B is assumed: 0 304 35 40 45 50 15 VB 20 25 30 35 40 45 50 The posterior can be written asOptical Digits 5620 64 10 56 56 Z ) 25True 30dimensions ∗ H ∗ r VB (A, B) = r VB (A)r VB (B). True dimensions H(13) 305 A B EVB-PCA vs. simple-EVB-PCA. Performance of the Satellite 6435 36 6 32 31 306 n M L With this constraint, an iterative algorithmY for minimizY 307 PCA dimensionality selection is highly comparable to VBwas derived Segmentation 2310 19 7 12 13 e ing the free energy r(6) (A, B) =(Bishop,N1999a; (e a ; µ , Σ ) N ( b ; µ , Σ ), H m 308a H l e e e e A B bl m eachLim other (left), & Teh, 2007). while simple-EVB-PCA is computaLetter 20000 16 26 15 15 309 m=1 l=1 ) 310 tionally more efficient (right). The VB posterior can be than writtenEVB-PCA as Note that (1 311 VB M where r LB) = r(A)r(B) the(A, parameters necessarily satisfy levels of correlation, different number of samples M , a ) ) 312 Empirical VB: VB a constant w * r (A, B) = NH (* am ; µeam , ΣAe) NH (bl ; µebl , ΣBe ), : 313 ” “ ferent dimensionality L), and empirically observed s m=1 l=1 ΣAe Experimental 314 Iterative Comparison algorithm needed. depends on r µ = µ , . . . , µ y (14) e e a em satisfy 2 m, b1 bL 315 where thewe parameters necessarily ) section, trends as Figure 1. experimentally compare the perforσ 316 % is minimized ΣAe $ ` In our final experiment, we tested both methods on Σ 317 ´ EVB-PCA and simple-EVB-PCA on artificial and e µ = µ , . . . , µ y , (14) e e B e a m d m b b 2 1 L σ el , µebl = 2 µae1 , . . . , µ318 (15) a eM y datasets taken from the UCI repository (Asuncion & , experiments, ΣBe + σ d datasets. Through all the H is set to 319 *l , µebl = 2 µea1 , . . . , µeaM y (15) !−1 σ 320 2 man, 2007). The dataset descriptions as well as the n L variance ll rank (H = min(L, M )), and the noise σ ' L ( “ ” −1 X % 2 321 &$ ! 2 ! On Automatic Selection and Analytic ΣAe = σ2 µBayes +emethod ΣB= +σσ 2 I(see , Dimensionality (16) of PCA dimensions by EVB-PCA and simple ΣBayesian µebl 2. µe3). + Σ + σ ISolution ,retained (16) 322 H e e PCA: H e ted by the empirical Section bl µe bl A ∗ B bl for any B . l=1 323 ! % −1 l=1 p330 M PCA are shown in Table 1. As with the synthetic dat 385 is minimized when " B-PCA, hyperparameters are updated in each iter# $ 324 2 2 −1an extremely Bye a=pseudo-Dirac function, we mean e331 Σ σ2 µeadelta µe! . ∗ (17) 386thus an estim e + σ C“ ” am + ΣA B B m 2 325 "B−B " PB-A of the two∗ methods is very similar Fro behavior to each ot r ( (A) = p(A|Y, B ) (22) e.g., δ(B; B ∗ ) ∝ exp − l=1 -332 localized density function, 387 2 A 2c ! 326 −1 2M 333 with 388 ” (22), butastrictly positive variance cX , such“ thatthe its patail Wea very maysmall obtain local minimizer with respect to In summary, weinobserved stronger indepen ∗ ` ´ for327 any B ∗ . With the first term (18) vanishes,that and the389 2 ! 2 −1 ∗ 334 effect can be neglected, while χ = −#log δ(B; B )$ 2 2 B δ(B;B ) rameters by iterating (14)–(17) until convergence. After ∗ 328 ) cbhfinite. = !µbh ! Σ /LB σe hh , as µaem (31) µaethus + Σassumption + Bσ isCgiven . (17) an estimator by e+= Σ e for B 335 remains 390 change the p A m (28) does not significantly convergence, the VB-PCA solutionBis given 329 336 391 & ' l=1 ( PB-A = argmin F PB-A (B ∗ |Y ), B VB ! mance. Given that simple-EVB-PCA 337 392 and EVB-PCA S = span (µ , . . . , µ ) . ∗ B obtained as a stationaryeb1 pointebLof the free energy 338 393where We may obtain a noise localvariance minimizer withsimilarly, respectwetoconclude the param2 form that simple-EVB-PCA is where 339 to 394 pect {c } (Bishop, 1999b). The bh 2.3. Empirical Bayesian Procedure 340 395 PB-Aattractive than EVB-PCA because it has an analyti eters by iterating (14)–(17) until convergence. After converF (B|Y ) = − log Z(Y |B)φ (B) + χ . (23) mated by using the following update rule in each B B PB-A PPCA has a hyperparameter C in the prior (4), which con341 396 B F trols the sparsity of the result (i.e., the PCA dimensions). 342 solution can be computed very397efficiently. We call the PB-A free energy. the VB-PCA solution isEq.(23) given aswhich A popular way gence, to set the hyperparameter in the Bayesian 343 398 36 0.6
Empirical VB [Bishop99, Lim&Teh07]
-
297 In Section 3, weWith derive an 0.8 analytic-form solution of PBthis constraint, an iterative 298 Thanks to Theorem 2, the solution of PB-PCA can be PCA for arbitrary {cbh }. 299 Running time
s o
computed analytically in a very efficient way. On the other 0.6
From their result, the following theorem can be immedi-
algorithm ately obtained. b H b simple where Data set for minimizing M L # Classes H NIKON CORPORATION
Technology Center [Theorem 3] The simple-VB-PCA solution is Core given by1, (9) July 2011
600 & 60 theBfree energy derived (Bishop, 1999b; Lim Teh, with Chart the following threshold: VB-PCA requires a (6) numberwas of iterations to 300 find a lo2.2.3. VARIATIONALhand, AYESIAN PCA (VB-PCA) 0.4 301 cal optimal solution (see Section 2. 2. 3). Thus, PB-PCA is
Glass
r
214 q
9
6
11
10
6
7
7p(A
Empirical VB [Bishop99, Lim&Teh07]
302 2007). γ simple-VB = σ κh + κ2h − LM , (29) In the VB approximation, between the 0.2 the independence computationally more attractive than VB-PCA. h Wine 178 13 3 7 8 303 entangled parameter matrices is assumed: TheoremA2and alsoBshows that PB-PCA has a thresholding 2 2 κh = (L + M )/2 + σ /(2cbh ). 0 304 35 40 45 50 15 VB 25 30 40 45 50 The posterior can be written asOptical Digits 5620 64 10 56 56 Z ) 25True 30dimensions i.e., the20 components with 35 singular ∗ values smaller than VB VB True dimensions H ∗ r VB (A,effect, H B) = rA (A)rB (B). (13) 305 PB Theorem 3 shows that the solution of simple-VB-PCA can γ h are eliminated. Moreover, this thresholding effect reEVB-PCA vs. simple-EVB-PCA. Performance of the Satellite 6435 36 6 32 31 306 n M L analytically in a very efficient way. On the be computed mains even when the prior is flat (c → ∞). Actually, such bh With this constraint, an iterative algorithmY for minimizY 307 PCA dimensionalitya thresholding selection is highly comparable to VBwas derived 2310requires 19 a number7of it12 13 otherSegmentation hand, the ordinary VB-PCA e also observed1999a; existing work (see ing the free energy r(6) (Bishop, (A,effect B)is= NinHthe(e a ; µ , Σ ) N ( b ; µ , Σ ), m 308a H l e e e e A B blsolution (see Section 2. 2. 3). m erations to find a local optimal eachLim other (left), simple-EVB-PCA Proposition 1) for PB-A-PCA withiscbhcomputa→ ∞. However, & Teh, 2007). while Letter 20000 16 26 15 15 309 it m=1 l=1 ) Thus, simple-VB-PCA is computationally more attractive was regarded as an artifact in the original paper by (Tipping 310 tionally more efficient (right). The VB posterior can be than writtenEVB-PCA as Note Hthat (1 than the ordinary VB-PCA. & Bishop, 1999). H 311 � � VB SimpleVB M where L the parameters necessarily satisfy 4. 2 Simple Empirical VB-PCA (Simple-EVB(Bishop, 1999b) proposed VB-PCA (see Section 2. 2. 3) levels of correlation, different number of samples r (A, B) = r(A)r(B) ) ) r (A, B|Y ) = r(a ) r(bM 312 Empirical NVB: h h ), a Empirical SimpleVB: a constant w * rVB (A, B) = amclaimed ; µeam , Σ ) N ( b ; µ , Σ ), H (* H l : e e e PCA) and that the thresholding effect is a unique feature A B bl h=1 h=1 313 ” “ ferent dimensionality L), and empirically observed s m=1 l=1 ΣA Experimental As opposed to PB-PCA, the empirical Bayesian variant of e analysis above shows 314 of VB-PCA. However, our that PBIterative Comparison algorithm needed. Analytic solution is available. depends on r µ = µ , . . . , µ y , (14) e eL in msimple-VB-PCA a em is well-defined, and its global solution can PCA also has satisfy the thresholding effect. b1 We further315 2 where thewe parameters necessarily ) section, trends as Figure 1. experimentally compare the perfor-bshow σ et al.2010] be obtained in a closed[Nakajima form as follows: Section 5. that the thresholding effect of PB-PCA316 actually % is minimized ΣAe $ ` ´ In our final experiment, we tested methods on is givenboth by Σ comes from the same mechanism as automatic relevance 317 de- [Theorem 4] The simple-EVB-PCA solution EVB-PCA and simple-EVB-PCA on artificial and e µ = µ , . . . , µ y , (14) e e B e a m d m b b 2 1 L σ e l ,Eq.(9) with the following threshold: (15) µebl(ARD). = 2 µae1 , . . . , µ318 termination a eM y datasets taken 8from the UCI repository (Asuncion & , experiments, ΣBe + σ d datasets. Through all the H is set to 319 PB-PCA √ √ * µebl = 2 µea1 , . .3.. 2, µeaEmpirical y , (15) l M <σ( L + M ) if ∆h < 0, ! σ −1 320 2 simple-EVB The A critical drawback of PB-PCA appears when the hyperpaman, 2007). dataset descriptions γh = (30) as well as the n L variance ll rank (H = min(L, M )), and% the noise σ ' L ( “ ” −1 X :∞ 321 otherwise, & $rameter CB is estimated. 2 Specifically, the empirical !Bayesian 2 ! On Automatic Selection and Analytic ΣAe = σ2 µBayes +for ΣPB-PCA +σσ 2always I(see , Dimensionality (16) of PCA dimensions by EVB-PCA and simple ΣBayesian = µtheebltrivial µe3). + Σ + σ ISolution ,retained (16) 322 H e e PCA: H emethod e ted by the empirical Section 2. procedure results in solution, B bl µe bl A B where bl for any B ∗ . l=1 323 ! % i.e., c the first term of the free energy −1 bh → 0. This is because l=1 p330 M PCA in Table 1. “ As with the synthetic dat “ γ ” ” is minimized when are shown " B-PCA, hyperparameters in each(i.e., iter# (23), as! wellare $updated γh simple-VB385 324 h simple-VB 2 2 2not−1 as (25), is lower-bounded it tends to ∆ = M log γ ˘ + 1 + L log γ ˘ + 1 By a pseudo-Dirac delta function, we mean an extremely h e331 h h ΣBe = σ µeam µeam + ΣAe + σ C“ . (17) 386thus an estim ” M σ2 Lσ 2 B 325 "B−B ∗ "2 “ =the ” Fro to obtain a mean-behavior two is very 387 similar to each ot generally, r(APB-A (A) p(A|Y, B ∗simple-VB )methods (22) e.g.,infinity). δ(B; B ∗More ) ∝ exp − in2corder l=1 minus -332 localized density function, 1 of 2 2 ! + 2 LM c˘bh − 2γ−1 ˘h , 326 hγ ingful solution when a hyperparameter is estimated though σ 2M 333 with 388 ” (22), butastrictly positive variance cX , such“ thatthe its patail Wea very maysmall obtain local minimizer with respect to p ∗ In summary, weσin4observed that stronger indepen 327 2 ∗ corresponding ` ´ for any B . With the first term (18) vanishes, and 2the389 τ τ − 4LM the empirical Bayes procedure, the parameter h + 2 2 2 ! 2 −1 ∗ 334 effect h can be neglected, while χ = −#log δ(B; B )$ 2 2 B δ(B;B ) rameters by iterating (14)–(17) until convergence. After c ˘ = , τ = γ − (L + M )σ , h b h ∗ 328 ) cbhfinite. = !µbh !must /LB , Σ =ΣσBeisout. µaem (31) µaethus + Σassumption + Bσ isCgiven . (17) an estimator by 2LM e+ e hfor B be integrated 335 remains 390 change the p A hh as m (28) does not significantly convergence, the VB-PCA solution given 329 simple-VB 336 391 Consequently, despite l=1 the ˘h = %&a h &&bh &' & ' existence of the ADS effect, PB- B r bsimple-VB (A,B) is the simple-VB PB-Aγ (and = argmin F PB-A (B ∗ |Y ), VB ! mance. Given that simple-EVB-PCA 337 392 and EVB-PCA S = span (µ , . . . , µ ) . ∗ PCA cannot unfortunately be a practical alternative to VBsolution forBcbh = c˘bh . obtained as a stationaryeb1 pointebLof the free energy 338 393where Below,respect we experimentally show that the above analyticWePCA. may obtain a noise localvariance minimizer with toconclude the param2 form similarly, we that simple-EVB-PCA is where 339 to 394 pect {c } (Bishop, 1999b). The bh form solution of simple-VB-PCA works as well as the ordi2.3. Empirical Bayesian Procedure 4. Behavior of Simplified VB-PCA (Simple340 395 PB-Aattractive nary VB-PCA. than EVB-PCA because it has an analyti eters by iterating (14)–(17) until convergence. After converF (B|Y ) = − log Z(Y |B)φ (B) + χ . (23) mated by using the following update rule in each B B PB-A VB-PCA) PPCA has a hyperparameter C in the prior (4), which con341 396 B F trols the sparsity of the result (i.e., the PCA dimensions). 342 solution can be computed very397efficiently. In thisthe section,VB-PCA we introduce a simplified variant of We call Eq.(23) the PB-A free energy. solution isVBgiven aswhich A popular way gence, to set the hyperparameter in the Bayesian 343 398 37
NIKON CORPORATION Core Technology Center July 1, 2011
VB vs SimpleVB 1 Estimated dimension and computation time on artificial dataset M = 300, L = 100
60
1.2
EVB-PCA Simple-EVB-PCA Running times (s)
ˆ Estimated dimensions H
50
40
30
EmpiricalVB
20
Empirical SimpleVB
10
15
20
25
30
35
40 ∗
True dimensions H
45
50
Table 1
M = 300, L = 100
1.4
EVB-PCA Simple-EVB-PCA
1
0.8
Da
0.6
EmpiricalVB 0.4
Empirical SimpleVB
0.2
0 15
20
25
30
35
40
45
50
True dimensions H ∗
Ch
Gl
W
Op
Figure 1 EVB-PCA vs. simple-EVB-PCA. Performance of the
Sa
PCA dimensionality selection is highly comparable to
Se
each
Le
- The solutions coincides. other (left), while simple-EVB-PCA - SimpleVB is faster.
is computa-
tionally more efficient than EVB-PCA (right).
levels of 38
NIKON CORPORATION Core Technology Center July 1, 2011
VB vs SimpleVB 2 b Table 1 Estimated effective data dimensions of real datasets. H dimension on benchmark datasets bEstimated and H simple denote the PCA dimensions estimated by
EVB-PCA and simple-EVB-PCA, respectively. EmpiricalVB Empirical SimpleVB b H
b simple H
3
7
8
64
10
56
56
6435
36
6
32
31
Segmentation
2310
19
7
12
13
Letter
20000 16
26
15
15
Data set
M
L
# Classes
Chart
600
60
6
11
Glass
214
9
6
7
Wine
178
13
Optical Digits
5620
Satellite
ble to
puta-
45
50
f the
10 7
levels of correlation, different number of samples M , and difThe solutions almost coincide. ferent dimensionality L), and empirically observed similar
rforand
Simple VB is more advantageous because analytic solution is available! In our final experiment, we tested both methods on seven
trends as Figure 1.
datasets taken from the UCI repository (Asuncion & New-
39
NIKON CORPORATION Core Technology Center July 1, 2011
Contents ‣ Probabilistic PCA ‣ Approximate Bayesian Inference ‣ Theoretical Result (PB vs SimpleVB) ‣ Experimental Result (SimpleVB vs VB) ‣ Conclusion
40
40
NIKON CORPORATION Core Technology Center July 1, 2011
Conclusion ‣ We derived analytic PB solution for matrix factorization, ‣ ‣
and theoretically showed its ADS effect. We empirically showed that simple VB gives similar performance to VB. Table 2 Properties of approximate PPCA methods. Upper methready equipped w ods have constraints VB! on the posterior distribuWe recommend to weaker use simple PB-PCA behaves tion. ‘!’ means the method positively possesses the corresponding property, ‘"’ means weakly, and ‘×’ means not.
dimensionality
simple-VB PB PB-A (PB-B) MAP
Empirical Analytic-form Bayes
solution
!
!
×
!
×
selection VB
advantage that it useless.
Automatic Method
However, As show
! " ×
!
!
×
!
×
strongly elliptical in the bh space when M < = L.
On the other ha
between PB-PCA solution (Section
simple-EVB-PCA
!
tion 4. 2). Experi
!
counterpart, EVB
perform similarly
ings, we conclude
41 tive PPCA metho
42