PB solution

Viewer
Transcript

株式会社ニコン

On Bayesian PCA:Automatic Dimensionality Selection and Analytic Solution Shinichi Nakajima (Nikon Corporation) Masashi Sugiyama (Tokyo Tech.) S. Derin Babacan (University of Illinois)

1

NIKON CORPORATION Core Technology Center July 1, 2011

Contents ‣ Probabilistic PCA ‣ Approximate Bayesian Inference ‣ Theoretical Result (PB vs SimpleVB) ‣ Experimental Result (SimpleVB vs VB) ‣ Conclusion

2

NIKON CORPORATION Core Technology Center July 1, 2011

Contents ‣ Probabilistic PCA ‣ Approximate Bayesian Inference ‣ Theoretical Result (PB vs SimpleVB) ‣ Experimental Result (SimpleVB vs VB) ‣ Conclusion

3

3

NIKON CORPORATION Core Technology Center July 1, 2011

Principal component analysis (PCA) Y = (y 1 · · · y M ) ∈ RL×M : observation

M samples of y ∈ RL

4

NIKON CORPORATION Core Technology Center July 1, 2011

Principal component analysis (PCA) Y = (y 1 · · · y M ) ∈ RL×M : observation

B ∈ RL×H : loading matrix

H�L

max VAR(B Y ) B

�

5

NIKON CORPORATION Core Technology Center July 1, 2011

Probabilistic PCA [Tipping&Bishop99] Y = BX + E

Y = (y 1 · · · y M ) ∈ RL×M : observation

B ∈ RL×H : loading matrix

X = (x1 · · · xM ) ∈ RH×M : hidden variable

E ∈ RL×M : noise

6

NIKON CORPORATION Core Technology Center July 1, 2011

Probabilistic PCA [Tipping&Bishop99] Y = BX + E

Y = (y 1 · · · y M ) ∈ RL×M : observation

B ∈ RL×H : loading matrix

X = (x1 · · · xM ) ∈ RH×M : hidden variable

E ∈ RL×M : noise

Xh,m ∼ N1 (0, 12 )

El,m ∼ N1 (0, 12 )

7

NIKON CORPORATION Core Technology Center July 1, 2011

Probabilistic PCA [Tipping&Bishop99] �

�Y

− BX�2Fro 2σ 2

likelihood : p(Y |X, B) ∝ exp − � � 2 �X�Fro prior : φX (X) ∝ exp − 2

X→A

�

�

8

NIKON CORPORATION Core Technology Center July 1, 2011

Bayesian Matrix Factorization [Salakhutdinov&Mnih08] �

�Y

− BA� �2Fro 2σ 2

likelihood : p(Y |A, B) ∝ exp − � � 2 �A�Fro priors : φA (A) ∝ exp − 2 � −1 � � tr(BCB B ) φB (B) ∝ exp − 2

A ∈ RM ×H

B ∈ RL×H

�

CB = diag(c2b1 , . . . , c2bH ) L : original dimensionality H : dimensionality of latent feature M : number of samples 9

NIKON CORPORATION Core Technology Center July 1, 2011

Contents ‣ Probabilistic PCA ‣ Approximate Bayesian Inference ‣ Theoretical Result (PB vs SimpleVB) ‣ Experimental Result (SimpleVB vs VB) ‣ Conclusion

10

10

NIKON CORPORATION Core Technology Center July 1, 2011

Bayesian Inference Bayes posterior : p(A, B|Y ) ∝ p(Y |A, B)φA (A)φA (B)

� Bayes = �BA� �p(A,B|Y ) Bayes Estimator : U Computationally intractable!

11

NIKON CORPORATION Core Technology Center July 1, 2011

Free Energy Trial posterior : r(A, B)

Free energy : F (r) =

�

r(A, B) log p(Y |A, B)φA (A)φB (B)

�

= KL (r(A, B)�p(A, B|Y )) + const. p(A, B|Y ) = argmin F (r) r

12

NIKON CORPORATION Core Technology Center July 1, 2011

Approximate Bayesian Inference Free energy : F (r) =

�

log

r(A, B) p(Y |A, B)φA (A)φB (B)

�

Free energy minimization under some constraint on posterior. � approx = �BA� �r(A,B|Y ) Approximate estimator : U

where r(A, B|Y ) = argmin F (r) r

s.t. r(A, B|Y ) ∈ G

G

specifies approximation methods.

13

Bayes p osterior (V = 0) 3

2

2 0.

0. 1

0.2

3

3

1

0.

0.

0.1

0

2

0.1

0.1

0.3

0.2

0. 2

1

0.2

0.3

B

1 B

0

0.1 0.1

−1

1

1 0.

0.

2

2

0.

0. 3

−1

0.

Bayes Posterior

CORPORATION Figure 4: BayesNIKON posteriors with ca = cb Core Technology Center July 1, 2011 solutions, and the dashed lines ca = cb = c → ∞).

−2 −3 −3

−2

−2

−1

0 A

1

2

3

−3 −3

Figure 5: Bayes posteriors with ca = c identical to those in Figure 4.

When L = M = H = 1, Eq.(19) yie

p(A, B|V ) ∝ exp

!

Figure 4 shows the contour of the abov the noise variance is σ 2 = 1 and the14 hyp

NIKON CORPORATION Core Technology Center July 1, 2011

Variational Bayesian Methods VB - Matrix-wise independence assumption

B

[Bishop2001,Lim&Teh2007]

rVB (A, B|Y ) = r(A)r(B)

A

15

NIKON CORPORATION Core Technology Center July 1, 2011

Variational Bayesian Methods VB - Matrix-wise independence assumption

B

[Bishop2001,Lim&Teh2007]

rVB (A, B|Y ) = r(A)r(B) SimpleVB - Column-wise independence assumption

A

[Raiko et al.2007]

rSimpleVB (A, B|Y ) =

H �

r(ah )

h=1

A = (a1 , . . . , aH ) B = (b1 , . . . , bH )

H �

r(bh )

h=1

Analytic-form solution has been obtained [Nakajima et al. NIPS2010] 16

NIKON CORPORATION Core Technology Center July 1, 2011

Partially Bayesian Methods B PB-A - A is integrated out, and B is point-estimated [Tipping&Bishop99] Posterior on B is restricted to δ -function

rPB-A (A, B|Y ) = r(A)δ(B; B ∗ )

A

17

NIKON CORPORATION Core Technology Center July 1, 2011

Partially Bayesian Methods B PB-A - A is integrated out, and B is point-estimated [Tipping&Bishop99] Posterior on B is restricted to δ -function

A

rPB-A (A, B|Y ) = r(A)δ(B; B ∗ ) PB-B - B is integrated out, and A is point-estimated. Posterior on A is restricted to δ -function

B

rPB-B (A, B|Y ) = δ(A; A∗ )r(B)

A

18

NIKON CORPORATION Core Technology Center July 1, 2011

Partially Bayesian Methods B PB-A - A is integrated out, and B is point-estimated [Tipping&Bishop99] Posterior on B is restricted to δ -function

A

rPB-A (A, B|Y ) = r(A)δ(B; B ∗ ) PB-B - B is integrated out, and A is point-estimated. Posterior on A is restricted to δ -function

B

rPB-B (A, B|Y ) = δ(A; A∗ )r(B)

PB - Choose the one giving lower free energy: � rPB-A (A, B|Y ) PB r (A, B|Y ) = rPB-B (A, B|Y )

A

if F (rPB-A ) ≤ F (rPB-B ) if F (rPB-A ) > F (rPB-B ) 19

NIKON CORPORATION Core Technology Center July 1, 2011

MAP estimation MAP - A and B are point-estimated.

B

A

Posteriors on A and B are restricted to δ -function.

rMAP (A, B|Y ) = δ(A; A∗ )δ(B; B ∗ )

20

NIKON CORPORATION Core Technology Center July 1, 2011

We discuss two issues

�

−1 � tr(BCB B ) φB (B) ∝ exp − 2

�

Two major differences between VB and PB-A. In (empirical) VB,

- Not only A but also B are integrated out (PB-A ! VB). - Automatic relevance determination (ARD) through empirical Bayesian procedure ( CB is also estimated from observation.)

1.Are these properties essential for automatic dimensionality selection (ADS)? B

B

A

PB-A [Tipping&Bishop2001]

A

(Empirical) VB [Bishop2001] 21

NIKON CORPORATION Core Technology Center July 1, 2011

We discuss two issues

�

−1 � tr(BCB B ) φB (B) ∝ exp − 2

�

Two major differences between VB and PB-A. In (empirical) VB,

- Not only A but also B are integrated out (PB-A ! VB). - Automatic relevance determination (ARD) through empirical Bayesian procedure ( CB is also estimated from observation.)

No!

1.Are these properties essential for automatic dimensionality selection (ADS)? B

(shown theoretically）

B

A

PB-A [Tipping&Bishop2001]

A

(Empirical) VB [Bishop2001] 22

NIKON CORPORATION Core Technology Center July 1, 2011

We discuss two issues VB needs iteration, SimpleVB has analytic solution [Nakajima et al. NIPS2010].

2. Does column-wise independence degrade approximation quality?

rVB (A, B|Y ) = r(A)r(B)

r

SimpleVB

(A, B|Y ) =

H �

h=1

VB [Bishop2001,Lim&Teh2007]

r(ah )

H �

r(bh )

h=1

SimpleVB [Raiko et al.2007]

23

NIKON CORPORATION Core Technology Center July 1, 2011

We discuss two issues VB needs iteration, SimpleVB has analytic solution [Nakajima et al. NIPS2010].

No!

2. Does column-wise independence degrade approximation quality?

rVB (A, B|Y ) = r(A)r(B)

r

SimpleVB

(A, B|Y ) =

(shown empirically）

H �

h=1

VB [Bishop2001,Lim&Teh2007]

r(ah )

H �

r(bh )

h=1

SimpleVB [Raiko et al.2007]

24

NIKON CORPORATION Core Technology Center July 1, 2011

Table 2 Properties of approximate PPCA methods. Upper meth-

ods have weaker constraints on the posterior distribu-

Which method the positively best?possesses the cortion. ‘!’ meansis the method responding property, ‘"’ means weakly, and ‘×’ means not.

dimensionality selection

VB simple-VB PB PB-A (PB-B) MAP

PB-PC

Howev

advant

useless Automatic

Method

ready

! ! ! " ×

Empirical Analytic-form Bayes

? !

solution ×

!

!

×

!

× ×

On

betwee

solutio

simple

!

tion 4

!

counte

strongly elliptical in the bh space when M < = L. 5. 4 Relation between PB-PCA and ARD Prior for Linear Models Here we show that PB-PCA has the automatic relevance

perfor

ings, w

tive P

Our

compl 25 value

NIKON CORPORATION Core Technology Center July 1, 2011

Contents ‣ Probabilistic PCA ‣ Approximate Bayesian Inference ‣ Theoretical Result (PB vs SimpleVB) ‣ Experimental Result (SimpleVB vs VB) ‣ Conclusion

26

26

NIKON CORPORATION Core Technology Center July 1, 2011

PB solution �

− BA� �2Fro 2σ 2

�Y likelihood : p(Y |A, B) ∝ exp − � � 2 �A�Fro priors : φA (A) ∝ exp − 2 �

−1 � tr(BCB B ) φB (B) ∝ exp − 2

�

�

PB-A solution was derived for cbh = ∞ [Tipping&Bishop99].

Here, we derive PB solution for general case, and show similarity to SimpleVB when σ 2 and CB are given.

27

NIKON CORPORATION Core Technology Center July 1, 2011

� 1 if true, θ(·) = 0 otherwise.

PB solution [Theorem 2] Y =

Let

H �

γh ω bh ω � ah

be the SVD of Y.

h=1

PB solution (space spanned by loading vectors) is given by

S PB = span where

γ PB =σ h

�

�

H �

� θ(γh > γ PB )ω ω b ah h h

h=1

�

,

max(L, M ) + σ 2 /c2bh .

span(W ) : Space spanned by column vectors of W.

28

NIKON CORPORATION Core Technology Center July 1, 2011

PB solution [Theorem 2] Y =

Let

H �

γh ω bh ω � ah

be the SVD of Y.

h=1

PB solution (space spanned by loading vectors) is given by

S PB = span where

γ PB =σ h

�

�

H �

� θ(γh > γ PB )ω ω b ah h h

h=1

�

,

max(L, M ) + σ 2 /c2bh .

Space spanned by singular vectors with small singular values is neglected. ADS occurs with PB without empirical Bayes! 29

NIKON CORPORATION Core Technology Center July 1, 2011

SimpleVB Solution [Nakajima et al. NIPS2010] [Theorem 3] SimpleVB solution is given by

S SimpleVB = span

�

where

γ SimpleVB =σ h

H �

� θ(γh > γ SimpleVB )ω ω b ah h h

h=1

�

κh +

�

�

,

κ2h − LM ,

L+M σ2 κh = + 2 . 2 2cbh

30

NIKON CORPORATION Core Technology Center July 1, 2011

L : original dimensionality H : dimensionality of latent feature

PB vs SimpleVB

M : number of samples L=20,M =50

L=20,M =10 15

simple-VB PB PB-A MAP

SimpleVB

10

Threshold

Threshold

15

simple-VB PB PB-A MAP

10

PB, PB-A MAP

5

0 0

Figure 2

0.5

1 cb h

cbh

1.5

its

5

PB PB-A 0 0

2

0.5

1 cb h

c

1.5

2

and

PB and SimpleVB behave similarly.

hold

8 6

10

simple-VB PB PB-A MAP

8 6

an

sta

no

no

L=20,c b h=1.0

hold

10

sm

MAP

bh simple-VB PB Behavior of the thresholds, γ h , γ h , γ PB-A , h 2 = 1. γ MAP . The noise variance is set to σ h L=20,c b h = ∞

sim

giv

SimpleVB

γh

γh

an

simple-VB PB PB-A MAP

PC

31

NIKON CORPORATION Core Technology Center July 1, 2011

0 0

0.5

1 cb h

1.5

0 0

2

PB vs Figure 2 SimpleVB 2

0.5

L:

1 cb h original

1.5

2

dimensionality

sm

and

PB-A , and sta Behavior of the thresholds, γ simple-VB , γ PB , γ H : dimensionality of latent feature h h h 2 = 1.of samples M :σ number γ MAP . The noise variance is set to non h L=20,c b h = ∞

10

SimpleVB

6

γh

4

PB

2 0 0

8

Threshold

Threshold

γh

10

simple-VB PB PB-A MAP

8

20 M

SimpleVB

MAP 30

40

simple-VB PB PB-A MAP

6

0 0

A

PC

old

4

PB-A

2

PB-A 10

nom

L=20,c b h=1.0

10

20 M

PB MAP 30

40

M M of samM of the thresholds when the number Figure 3 Behavior ples is changed.

PB and SimpleVB coincide when cbh → ∞ .

Wh ize

ior

the

ing

5. 2 PB-A Comparison behaves between differentlyPB-A-PCA, when M < L. PB-PCA, and simple-VB-PCA

In

The threshold (27) becomes simpler when the prior is flat:

is32 g

of t

NIKON CORPORATION Core Technology Center July 1, 2011

Property of PB Solution ‣ ADS occurs. ‣ ‣

Integrating out a part of parameters can lead to ADS. Empirical Bayes is not necessary for inducing ADS.

However, ‣ Empirical PB (where hyperparameters are estimated from observation) cannot be applied. (The global solution is trivial one.�b∗h = 0, cbh → 0 ) � F PB-A

‣ σ2

H ∗2 2 2 ∗2 b 1� b h γh σah 2 2 h − M log σ = L log cbh + 2 − a h 2 cbh σ4

+ const.

h=1

tends to be underestimated.

VB (or SimpleVB) is more convenient. 33

NIKON CORPORATION Core Technology Center July 1, 2011

Contents ‣ Probabilistic PCA ‣ Approximate Bayesian Inference ‣ Theoretical Result (PB vs SimpleVB) ‣ Experimental Result (SimpleVB vs VB) ‣ Conclusion

34

34

s o

-

)

n

)

:

)

d

p330 e331 -332 333 334 ) 335 336 337 338 339 340 341 342 343

297 A B 298 NIKON CORPORATION Core Technology Center 299 With this constraint, an iterative algorithm for minimizing July 1, 2011where 300 2.2.3. VARIATIONAL BAYESIAN PCA (VB-PCA) 301 the free energy (6) was derived 302 (Bishop, 1999b; Lim & Teh, In the VB approximation, the independence between the 303 p(A entangled parameter matrices A and B is assumed: 2007). 304 rVB (A, B) = rAVB (A)rBVB (B). (13) 305 The VBVB posterior can be 306 written as Z With this constraint, an iterative algorithm for minimiz307 ing the free energy (6) was derived (Bishop, 1999a; 308 M L Y Y Lim & Teh, 2007). 309 VB r (A, B) = NH (e am ; µ NH (e bl ; µebl , ΣBe ), e) 310a e m , ΣA The VB posterior can be written as 311 m=1 l=1 M L ) ) 312 Note that (1 VB r (A, B) = NH (* am ; µeam , ΣAe) NH (* bl ; µebl , ΣBe ), 313 necessarily satisfy m=1 where the parameters l=1 314 a constant w Iterative algorithm needed. “ ” 315 where the parameters necessarily satisfy ΣAe depends on r % = µaem µeb1 , . . . , µ316 (14) e ΣAe $ bL y m , 2 317 µeam = 2 µeb1 , . . . , µebL y m , σ (14) σ 318 is minimized ` , ΣBe + Σ e 319 ´ B * µebl = 2 µea1 , . . . , µµ y , (15) e aM l el , µe1 , . . . , µ320 (15) e σ a eM y bl = 2 ( a ' L σ −1 % 321 &$ 2 ! 2 !−1 Dimensionality Selection and Analytic Solution ΣAe = σ µebl On µeb Bayesian + ΣBe PCA: + σ IAutomatic , (16) 322 H L “ ” l X 2 !323 2 ! l=1 %−1 M 385 is minimized when+ σ IH Σ = σ µ µ + Σ , (16) " # $ e e e e 324 2 ∗ A B b b −1 2 ! 2 l l Bye a=pseudo-Dirac function, we mean an extremely Σ σ µeadelta µ + Σ + σ C . (17) 386for any B . e “ ” e a B B A m m ∗ 2 325 "B−B " r(APB-A (A) = p(A|Y, B ∗ ) (22) localized density function, e.g., δ(B; B ∗ ) ∝ exp −l=12c2 Fro l=1 387 326 2 388thus an estim with butastrictly positive variance c , such to thatthe its patail Wea very maysmall obtain local minimizer with respect ∗ for327 any B ∗ . With (22), the first term in (18) vanishes, and ∗ 389 effect can be neglected, while χ = −#log δ(B; B )$ B ! rameters by iterating (14)–(17) until convergence.δ(B;B After) 328 thus an estimator for B ∗ is given by −1 M remains finite. “ ” 390 X convergence, the VB-PCA solution is given as 329 2 ! 2 −1 391 ' PB-A Σ &e = σ µaem µaem + ΣAeB(PB-A+=σargmin C (17) (B ∗ |Y.), BF 392 S VB = span B(µeb1 , . . . , µebL )! . ∗ B 393 l=1 where 394 2.3. Empirical Bayesian Procedure 395where We may obtain a local minimizer with the paramF PB-A (B|Y ) = −respect log Z(Y |B)φto (23) B (B) + χB . PPCA has a hyperparameter CB in the prior (4), which con396 trols the sparsity of the result (i.e., the PCA dimensions). 397 eters by iterating (14)–(17)Weuntil convergence. call Eq.(23) the PB-A free energy.After converA popular way to set the hyperparameter in the Bayesian 398 35 F PB-A In Section 3, we derive an analytic-form solution of PBPCA for arbitrary {cbh }.

VB Solution [Bishop99, Lim&Teh07] r

(A, B) = r(A)r(B)

In Section 3, weWith derive an 0.8 analytic-form solution of PBthis constraint, an PCA for arbitrary {cbh }. Running time

s o

297

iterative algorithm 298 b H b simple where Data set for minimizing M L # Classes H NIKON CORPORATION

Core Technology Center 299 July 1, 2011 Chart1999b; Lim 600 & 60 6 11 10 300 (Bishop, the free energy (6) was derived Teh, 2.2.3. VARIATIONAL BAYESIAN PCA (VB-PCA) 0.4 301 Glass 214 9 6 7 7p(A 302 2007).0.2 the independence between the In the VB approximation, Wine 178 13 3 7 8 303 entangled parameter matrices A and B is assumed: 0 304 35 40 45 50 15 VB 20 25 30 35 40 45 50 The posterior can be written asOptical Digits 5620 64 10 56 56 Z ) 25True 30dimensions ∗ H ∗ r VB (A, B) = r VB (A)r VB (B). True dimensions H(13) 305 A B EVB-PCA vs. simple-EVB-PCA. Performance of the Satellite 6435 36 6 32 31 306 n M L With this constraint, an iterative algorithmY for minimizY 307 PCA dimensionality selection is highly comparable to VBwas derived Segmentation 2310 19 7 12 13 e ing the free energy r(6) (A, B) =(Bishop,N1999a; (e a ; µ , Σ ) N ( b ; µ , Σ ), H m 308a H l e e e e A B bl m eachLim other (left), & Teh, 2007). while simple-EVB-PCA is computaLetter 20000 16 26 15 15 309 m=1 l=1 ) 310 tionally more efficient (right). The VB posterior can be than writtenEVB-PCA as Note that (1 311 VB M where r LB) = r(A)r(B) the(A, parameters necessarily satisfy levels of correlation, different number of samples M , a ) ) 312 Empirical VB: VB a constant w * r (A, B) = NH (* am ; µeam , ΣAe) NH (bl ; µebl , ΣBe ), : 313 ” “ ferent dimensionality L), and empirically observed s m=1 l=1 ΣAe Experimental 314 Iterative Comparison algorithm needed. depends on r µ = µ , . . . , µ y (14) e e a em satisfy 2 m, b1 bL 315 where thewe parameters necessarily ) section, trends as Figure 1. experimentally compare the perforσ 316 % is minimized ΣAe $ ` In our final experiment, we tested both methods on Σ 317 ´ EVB-PCA and simple-EVB-PCA on artificial and e µ = µ , . . . , µ y , (14) e e B e a m d m b b 2 1 L σ el , µebl = 2 µae1 , . . . , µ318 (15) a eM y datasets taken from the UCI repository (Asuncion & , experiments, ΣBe + σ d datasets. Through all the H is set to 319 *l , µebl = 2 µea1 , . . . , µeaM y (15) !−1 σ 320 2 man, 2007). The dataset descriptions as well as the n L variance ll rank (H = min(L, M )), and the noise σ ' L ( “ ” −1 X % 2 321 &$ ! 2 ! On Automatic Selection and Analytic ΣAe = σ2 µBayes +emethod ΣB= +σσ 2 I(see , Dimensionality (16) of PCA dimensions by EVB-PCA and simple ΣBayesian µebl 2. µe3). + Σ + σ ISolution ,retained (16) 322 H e e PCA: H e ted by the empirical Section bl µe bl A ∗ B bl for any B . l=1 323 ! % −1 l=1 p330 M PCA are shown in Table 1. As with the synthetic dat 385 is minimized when " B-PCA, hyperparameters are updated in each iter# $ 324 2 2 −1an extremely Bye a=pseudo-Dirac function, we mean e331 Σ σ2 µeadelta µe! . ∗ (17) 386thus an estim e + σ C“ ” am + ΣA B B m 2 325 "B−B " PB-A of the two∗ methods is very similar Fro behavior to each ot r ( (A) = p(A|Y, B ) (22) e.g., δ(B; B ∗ ) ∝ exp − l=1 -332 localized density function, 387 2 A 2c ! 326 −1 2M 333 with 388 ” (22), butastrictly positive variance cX , such“ thatthe its patail Wea very maysmall obtain local minimizer with respect to In summary, weinobserved stronger indepen ∗ ` ´ for327 any B ∗ . With the first term (18) vanishes,that and the389 2 ! 2 −1 ∗ 334 effect can be neglected, while χ = −#log δ(B; B )$ 2 2 B δ(B;B ) rameters by iterating (14)–(17) until convergence. After ∗ 328 ) cbhfinite. = !µbh ! Σ /LB σe hh , as µaem (31) µaethus + Σassumption + Bσ isCgiven . (17) an estimator by e+= Σ e for B 335 remains 390 change the p A m (28) does not significantly convergence, the VB-PCA solutionBis given 329 336 391 & ' l=1 ( PB-A = argmin F PB-A (B ∗ |Y ), B VB ! mance. Given that simple-EVB-PCA 337 392 and EVB-PCA S = span (µ , . . . , µ ) . ∗ B obtained as a stationaryeb1 pointebLof the free energy 338 393where We may obtain a noise localvariance minimizer withsimilarly, respectwetoconclude the param2 form that simple-EVB-PCA is where 339 to 394 pect {c } (Bishop, 1999b). The bh 2.3. Empirical Bayesian Procedure 340 395 PB-Aattractive than EVB-PCA because it has an analyti eters by iterating (14)–(17) until convergence. After converF (B|Y ) = − log Z(Y |B)φ (B) + χ . (23) mated by using the following update rule in each B B PB-A PPCA has a hyperparameter C in the prior (4), which con341 396 B F trols the sparsity of the result (i.e., the PCA dimensions). 342 solution can be computed very397efficiently. We call the PB-A free energy. the VB-PCA solution isEq.(23) given aswhich A popular way gence, to set the hyperparameter in the Bayesian 343 398 36 0.6

Empirical VB [Bishop99, Lim&Teh07]

-

297 In Section 3, weWith derive an 0.8 analytic-form solution of PBthis constraint, an iterative 298 Thanks to Theorem 2, the solution of PB-PCA can be PCA for arbitrary {cbh }. 299 Running time

s o

computed analytically in a very efficient way. On the other 0.6

From their result, the following theorem can be immedi-

algorithm ately obtained. b H b simple where Data set for minimizing M L # Classes H NIKON CORPORATION

Technology Center ［Theorem 3］ The simple-VB-PCA solution is Core given by1, (9) July 2011

600 & 60 theBfree energy derived (Bishop, 1999b; Lim Teh, with Chart the following threshold: VB-PCA requires a (6) numberwas of iterations to 300 find a lo2.2.3. VARIATIONALhand, AYESIAN PCA (VB-PCA) 0.4 301 cal optimal solution (see Section 2. 2. 3). Thus, PB-PCA is

Glass

r

214 q

9

6

11

10

6

7

7p(A

Empirical VB [Bishop99, Lim&Teh07]

302 2007). γ simple-VB = σ κh + κ2h − LM , (29) In the VB approximation, between the 0.2 the independence computationally more attractive than VB-PCA. h Wine 178 13 3 7 8 303 entangled parameter matrices is assumed: TheoremA2and alsoBshows that PB-PCA has a thresholding 2 2 κh = (L + M )/2 + σ /(2cbh ). 0 304 35 40 45 50 15 VB 25 30 40 45 50 The posterior can be written asOptical Digits 5620 64 10 56 56 Z ) 25True 30dimensions i.e., the20 components with 35 singular ∗ values smaller than VB VB True dimensions H ∗ r VB (A,effect, H B) = rA (A)rB (B). (13) 305 PB Theorem 3 shows that the solution of simple-VB-PCA can γ h are eliminated. Moreover, this thresholding effect reEVB-PCA vs. simple-EVB-PCA. Performance of the Satellite 6435 36 6 32 31 306 n M L analytically in a very efficient way. On the be computed mains even when the prior is flat (c → ∞). Actually, such bh With this constraint, an iterative algorithmY for minimizY 307 PCA dimensionalitya thresholding selection is highly comparable to VBwas derived 2310requires 19 a number7of it12 13 otherSegmentation hand, the ordinary VB-PCA e also observed1999a; existing work (see ing the free energy r(6) (Bishop, (A,effect B)is= NinHthe(e a ; µ , Σ ) N ( b ; µ , Σ ), m 308a H l e e e e A B blsolution (see Section 2. 2. 3). m erations to find a local optimal eachLim other (left), simple-EVB-PCA Proposition 1) for PB-A-PCA withiscbhcomputa→ ∞. However, & Teh, 2007). while Letter 20000 16 26 15 15 309 it m=1 l=1 ) Thus, simple-VB-PCA is computationally more attractive was regarded as an artifact in the original paper by (Tipping 310 tionally more efficient (right). The VB posterior can be than writtenEVB-PCA as Note Hthat (1 than the ordinary VB-PCA. & Bishop, 1999). H 311 � � VB SimpleVB M where L the parameters necessarily satisfy 4. 2 Simple Empirical VB-PCA (Simple-EVB(Bishop, 1999b) proposed VB-PCA (see Section 2. 2. 3) levels of correlation, different number of samples r (A, B) = r(A)r(B) ) ) r (A, B|Y ) = r(a ) r(bM 312 Empirical NVB: h h ), a Empirical SimpleVB: a constant w * rVB (A, B) = amclaimed ; µeam , Σ ) N ( b ; µ , Σ ), H (* H l : e e e PCA) and that the thresholding effect is a unique feature A B bl h=1 h=1 313 ” “ ferent dimensionality L), and empirically observed s m=1 l=1 ΣA Experimental As opposed to PB-PCA, the empirical Bayesian variant of e analysis above shows 314 of VB-PCA. However, our that PBIterative Comparison algorithm needed. Analytic solution is available. depends on r µ = µ , . . . , µ y , (14) e eL in msimple-VB-PCA a em is well-defined, and its global solution can PCA also has satisfy the thresholding effect. b1 We further315 2 where thewe parameters necessarily ) section, trends as Figure 1. experimentally compare the perfor-bshow σ et al.2010] be obtained in a closed[Nakajima form as follows: Section 5. that the thresholding effect of PB-PCA316 actually % is minimized ΣAe $ ` ´ In our final experiment, we tested methods on is givenboth by Σ comes from the same mechanism as automatic relevance 317 de- ［Theorem 4］ The simple-EVB-PCA solution EVB-PCA and simple-EVB-PCA on artificial and e µ = µ , . . . , µ y , (14) e e B e a m d m b b 2 1 L σ e l ,Eq.(9) with the following threshold: (15) µebl(ARD). = 2 µae1 , . . . , µ318 termination a eM y datasets taken 8from the UCI repository (Asuncion & , experiments, ΣBe + σ d datasets. Through all the H is set to 319 PB-PCA √ √ * µebl = 2 µea1 , . .3.. 2, µeaEmpirical y , (15) l M <σ( L + M ) if ∆h < 0, ! σ −1 320 2 simple-EVB The A critical drawback of PB-PCA appears when the hyperpaman, 2007). dataset descriptions γh = (30) as well as the n L variance ll rank (H = min(L, M )), and% the noise σ ' L ( “ ” −1 X :∞ 321 otherwise, & $rameter CB is estimated. 2 Specifically, the empirical !Bayesian 2 ! On Automatic Selection and Analytic ΣAe = σ2 µBayes +for ΣPB-PCA +σσ 2always I(see , Dimensionality (16) of PCA dimensions by EVB-PCA and simple ΣBayesian = µtheebltrivial µe3). + Σ + σ ISolution ,retained (16) 322 H e e PCA: H emethod e ted by the empirical Section 2. procedure results in solution, B bl µe bl A B where bl for any B ∗ . l=1 323 ! % i.e., c the first term of the free energy −1 bh → 0. This is because l=1 p330 M PCA in Table 1. “ As with the synthetic dat “ γ ” ” is minimized when are shown " B-PCA, hyperparameters in each(i.e., iter# (23), as! wellare $updated γh simple-VB385 324 h simple-VB 2 2 2not−1 as (25), is lower-bounded it tends to ∆ = M log γ ˘ + 1 + L log γ ˘ + 1 By a pseudo-Dirac delta function, we mean an extremely h e331 h h ΣBe = σ µeam µeam + ΣAe + σ C“ . (17) 386thus an estim ” M σ2 Lσ 2 B 325 "B−B ∗ "2 “ =the ” Fro to obtain a mean-behavior two is very 387 similar to each ot generally, r(APB-A (A) p(A|Y, B ∗simple-VB )methods (22) e.g.,infinity). δ(B; B ∗More ) ∝ exp − in2corder l=1 minus -332 localized density function, 1 of 2 2 ! + 2 LM c˘bh − 2γ−1 ˘h , 326 hγ ingful solution when a hyperparameter is estimated though σ 2M 333 with 388 ” (22), butastrictly positive variance cX , such“ thatthe its patail Wea very maysmall obtain local minimizer with respect to p ∗ In summary, weσin4observed that stronger indepen 327 2 ∗ corresponding ` ´ for any B . With the first term (18) vanishes, and 2the389 τ τ − 4LM the empirical Bayes procedure, the parameter h + 2 2 2 ! 2 −1 ∗ 334 effect h can be neglected, while χ = −#log δ(B; B )$ 2 2 B δ(B;B ) rameters by iterating (14)–(17) until convergence. After c ˘ = , τ = γ − (L + M )σ , h b h ∗ 328 ) cbhfinite. = !µbh !must /LB , Σ =ΣσBeisout. µaem (31) µaethus + Σassumption + Bσ isCgiven . (17) an estimator by 2LM e+ e hfor B be integrated 335 remains 390 change the p A hh as m (28) does not significantly convergence, the VB-PCA solution given 329 simple-VB 336 391 Consequently, despite l=1 the ˘h = %&a h &&bh &' & ' existence of the ADS effect, PB- B r bsimple-VB (A,B) is the simple-VB PB-Aγ (and = argmin F PB-A (B ∗ |Y ), VB ! mance. Given that simple-EVB-PCA 337 392 and EVB-PCA S = span (µ , . . . , µ ) . ∗ PCA cannot unfortunately be a practical alternative to VBsolution forBcbh = c˘bh . obtained as a stationaryeb1 pointebLof the free energy 338 393where Below,respect we experimentally show that the above analyticWePCA. may obtain a noise localvariance minimizer with toconclude the param2 form similarly, we that simple-EVB-PCA is where 339 to 394 pect {c } (Bishop, 1999b). The bh form solution of simple-VB-PCA works as well as the ordi2.3. Empirical Bayesian Procedure 4. Behavior of Simplified VB-PCA (Simple340 395 PB-Aattractive nary VB-PCA. than EVB-PCA because it has an analyti eters by iterating (14)–(17) until convergence. After converF (B|Y ) = − log Z(Y |B)φ (B) + χ . (23) mated by using the following update rule in each B B PB-A VB-PCA) PPCA has a hyperparameter C in the prior (4), which con341 396 B F trols the sparsity of the result (i.e., the PCA dimensions). 342 solution can be computed very397efficiently. In thisthe section,VB-PCA we introduce a simplified variant of We call Eq.(23) the PB-A free energy. solution isVBgiven aswhich A popular way gence, to set the hyperparameter in the Bayesian 343 398 37

NIKON CORPORATION Core Technology Center July 1, 2011

VB vs SimpleVB 1 Estimated dimension and computation time on artificial dataset M = 300, L = 100

60

1.2

EVB-PCA Simple-EVB-PCA Running times (s)

ˆ Estimated dimensions H

50

40

30

EmpiricalVB

20

Empirical SimpleVB

10

15

20

25

30

35

40 ∗

True dimensions H

45

50

Table 1

M = 300, L = 100

1.4

EVB-PCA Simple-EVB-PCA

1

0.8

Da

0.6

EmpiricalVB 0.4

Empirical SimpleVB

0.2

0 15

20

25

30

35

40

45

50

True dimensions H ∗

Ch

Gl

W

Op

Figure 1 EVB-PCA vs. simple-EVB-PCA. Performance of the

Sa

PCA dimensionality selection is highly comparable to

Se

each

Le

- The solutions coincides. other (left), while simple-EVB-PCA - SimpleVB is faster.

is computa-

tionally more efficient than EVB-PCA (right).

levels of 38

NIKON CORPORATION Core Technology Center July 1, 2011

VB vs SimpleVB 2 b Table 1 Estimated effective data dimensions of real datasets. H dimension on benchmark datasets bEstimated and H simple denote the PCA dimensions estimated by

EVB-PCA and simple-EVB-PCA, respectively. EmpiricalVB Empirical SimpleVB b H

b simple H

3

7

8

64

10

56

56

6435

36

6

32

31

Segmentation

2310

19

7

12

13

Letter

20000 16

26

15

15

Data set

M

L

# Classes

Chart

600

60

6

11

Glass

214

9

6

7

Wine

178

13

Optical Digits

5620

Satellite

ble to

puta-

45

50

f the

10 7

levels of correlation, different number of samples M , and difThe solutions almost coincide. ferent dimensionality L), and empirically observed similar

rforand

Simple VB is more advantageous because analytic solution is available! In our final experiment, we tested both methods on seven

trends as Figure 1.

datasets taken from the UCI repository (Asuncion & New-

39

NIKON CORPORATION Core Technology Center July 1, 2011

Contents ‣ Probabilistic PCA ‣ Approximate Bayesian Inference ‣ Theoretical Result (PB vs SimpleVB) ‣ Experimental Result (SimpleVB vs VB) ‣ Conclusion

40

40

NIKON CORPORATION Core Technology Center July 1, 2011

Conclusion ‣ We derived analytic PB solution for matrix factorization, ‣ ‣

and theoretically showed its ADS effect. We empirically showed that simple VB gives similar performance to VB. Table 2 Properties of approximate PPCA methods. Upper methready equipped w ods have constraints VB! on the posterior distribuWe recommend to weaker use simple PB-PCA behaves tion. ‘!’ means the method positively possesses the corresponding property, ‘"’ means weakly, and ‘×’ means not.

dimensionality

simple-VB PB PB-A (PB-B) MAP

Empirical Analytic-form Bayes

solution

!

!

×

!

×

selection VB

advantage that it useless.

Automatic Method

However, As show

! " ×

!

!

×

!

×

strongly elliptical in the bh space when M < = L.

On the other ha

between PB-PCA solution (Section

simple-EVB-PCA

!

tion 4. 2). Experi

!

counterpart, EVB

perform similarly

ings, we conclude

41 tive PPCA metho

42

Jul 1, 2011 - and Analytic Solution. Shinichi Nakajima (Nikon Corporation). Masashi Sugiyama (Tokyo Tech.) S. Derin Babacan (University of Illinois). 1 ...

Download PDF

3MB Sizes 2 Downloads 223 Views

Report

Pb IXD_604 - IXYS Power

NOTIS PB WARDEN.pdf

EP9315-PB-qpmqr.pdf

chocolate pb oat.pdf

'11-02-07 PB hearing {PB transcript (selected pages).pdf ...

CFSUS-PB-BCR-Luke-Hartigan.pdf

pb pledge-08182016122202.pdf

GUE-PB-GSD_Boiler_160615.pdf

cs1501-1601-pb-qrsolo.pdf

jel-ring-pb-and-j.pdf

THE BRIEF OF PB X.pdf

RHoK City Solution Name Solution Description Problem ...

Final Exam Solution - nanoHUB

Solution BA.pdf

Solution Health Check - Gensym

solution

Final Exam Solution - nanoHUB

The Solution for startups

The Solution for startups

Challenge Solution Results

PB solution

Pb IXD_604 - IXYS Power

NOTIS PB WARDEN.pdf

EP9315-PB-qpmqr.pdf

chocolate pb oat.pdf

'11-02-07 PB hearing {PB transcript (selected pages).pdf ...

CFSUS-PB-BCR-Luke-Hartigan.pdf

pb pledge-08182016122202.pdf

GUE-PB-GSD_Boiler_160615.pdf

cs1501-1601-pb-qrsolo.pdf

jel-ring-pb-and-j.pdf

THE BRIEF OF PB X.pdf

RHoK City Solution Name Solution Description Problem ...

Final Exam Solution - nanoHUB

Solution BA.pdf

Solution Health Check - Gensym

solution

Final Exam Solution - nanoHUB

The Solution for startups

The Solution for startups

Challenge Solution Results

PB solution

Recommend Documents