EM for Probabilistic LDA Niko Brümmer February 2010

1

Model

Let observation j of speaker i be mij and let it be modeled as: mij = Vyi + Uxij + zij

(1)

where yi ∼ N (0, I) xij ∼ N (0, I) zij ∼ N (0, D−1 )

(2) (3) (4)

where the dimensions of x and y may be smaller than that of m and where D is a diagonal precision matrix. The model parameter that we want to estimate via the EM algorithm is λ = (V, U, D); and the hidden variables are represented by all the yi and xij . Note that zij is not also hidden, because if mij , yi and xij are given, then zij is determined.

1.1

Data

We are given N observations of the form mij , for K speakers, so that i = 1 · · · K. There are ni observations per speaker, so that j =1 · · · ni . We denote the matrix of all the observations for speaker i as Mi = mi1 · · · mini . The zero-order PK statistic for speaker i is ni and the global zero-order statsistic is N = i=1 ni . The first-order statistic for speaker i is: fi =

ni X

mij

(5)

j=1

and the global second-order statistic is: X S= mij m0ij . ij

1

(6)

1.2

Prior

The joint prior for the hidden variables for a speaker i is:   1 0 1 0 p(yi , Xi ) = p(yi )p(Xi ) ∝ exp − yi yi − tr(Xi Xi ) , 2 2   where Xi = xi1 · · · xini .

1.3

(7)

Likelihood

The complete-data log-likelihood, for speaker i is: p(Mi |yi , Xi , λ) =

ni Y

N (mij |Vyi + Uxij , D−1 )

(8)

j=1

! ni 1X ∝ exp − (mij − Vyi − Uxij )0 D(mij − Vyi − Uxij ) 2 j=1 ∝ exp

ni  X j=1

1 − m0ij Dmij + m0ij DVyi + m0ij DUxij 2

(9)

(10)

 1 1 − yi0 V0 DVyi − yi0 V0 DUxij − x0ij U0 DUxij 2 2

1.4

Joint p(Mi , yi , Xi |λ)

(11) n

∝ exp

i X 1 1 − yi0 Li yi + − m0ij Dmij + m0ij DVyi + m0ij DUxij 2 2 j=1 ! 1 0 0 − xij Jyi − xij Kxij 2

(12)

where J = U0 DV K = U0 DU + I Li = ni V0 DV + I

(13) (14) (15)

2

1.5

Posterior

We assemble the joint posterior from two factors: p(yi , Xi |Mi λ) = p(Xi |yi , Mi , λ)p(yi |Mi λ),

(16)

which we find below: 1.5.1

Outer posterior

The conditional posterior for Xi is: p(Xi |Mi , yi , λ) ∝ p(Xi , Mi , yi |λ) ! ni X 1 ∝ exp x0ij (U0 Dmij − Jyi ) − x0ij Kxij 2 j=1 ! ni X 1 x0ij (K˜ ∝ exp xij − Jyi ) − x0ij Kxij 2 j=1 ! ni X 1 x0ij Kˆ ∝ exp xij − x0ij Kxij 2 j=1 Y ∝ N (xij |ˆ xij , K−1 ),

(17) (18)

(19)

(20) (21)

j

where K˜ xij = U0 Dmij , Kˆ xij = K˜ xij − Jyi , 1.5.2

˜ ij = K−1 U0 Dmij , x ˆ ij = x ˜ ij − K−1 Jyi . x

(22) (23)

Inner posterior p(yi , Xi , Mi |λ) p(yi |Mi , λ) ∝ p(yi , Mi |λ) = p(Xi |yi , Mi , λ) Xi =0   P exp − 12 yi0 Li yi + j m0ij DVyi   ∝ P 0 1 ˆ ij Kˆ exp − 2 j x xij

(24)

(25)

Now expand: 1X 0 1X 0 ˆ ij Kˆ xij − Jyi ) x xij = (˜ xij − yi0 J0 K−1 )(K˜ 2 j 2 j ni ˜ i + const. = + yi0 J0 K−1 Jyi − yi0 J0 x 2 3

(26) (27)

where ˜i = x

X

˜ ij . x

(28)

j

Now use this in (25):  1 0 p(yi |Mi , λ) ∝ exp − yi P i y i 2 ∝ N (yi |ˆ yi , P−1 ), i 

ˆi yi0 Pi y

(29) (30)

where Pi = ni (V0 DV − J0 K−1 J) + I, ˆ i = V0 Dfi − J0 x ˜i Pi y

1.6

(31) (32)

Marginal (EM Objective) p(Mi |Xi , yi , λ)p(Xi )p(yi ) p(Mi |λ) = p(Xi |yi , Mi , λ)p(yi |Mi , λ) yi =0,Xi =0

2

(33)

EM algorithm

In this section we derive formulas for an EM algorithm (with minimumdivergence) for the model described in the previous section. The EM algorithm finds a maximum-likelihood (ML) estimate for the parameter λ of the model. We devote subsections to the E-step, the M-step and the (minimdivergence) MD-step.

2.1

EM auxiliary

˜= Q

* X

+ log p(Mi |yi , Xi , λ) + const

(34)

i

=

* X1 ij

=

+

* X1 ij

=

1 log |D| − (mij − Wzij )0 D(mij − Wzij ) 2 2

1 1 log |D| − m0ij Dmij − z0ij W0 DWzij + m0ij DWzij 2 2 2

N 1 1 log |D| − tr(SD) − tr(RW0 DW) + tr(TDW) 2 2 2 4

(35) + (36) (37)

where 



 xij zij = , yi X

R= zij z0ij ,



W= U V , X S= mij m0ij ij

T=

(39)

ij

X

hzij i m0ij ,

N=

ij

2.2

(38)

X

ni .

(40)

i

M-step

Differentiating w.r.t W and setting to zero gives (independently of D): W0 = R−1 T.

(41)

Differentiating w.r.t. D, setting to zero and solving gives: 1 (S + WRW0 − 2WT) N 1 = (S − WT), N

D−1 =

(42) (43)

where we used (41) for simplification. We can zero the off-diagonals, to make D diagonal1 . If we want to further constrain D, to be isotropic, so that D = dI, then we find: 1 1 = tr(S + WRW0 − 2WT) d ND 1 = tr(S − WT), ND

(44) (45)

where D is the dimensionality.

2.3

Expectations

To complete the M-step, we need to express T and R in terms of the posteriors that we found in section 1.5:   X xij  X Tx 0 0 T= hzij i mij = mij = , (46) yi Ty ij

1

ij

See Tom Minka’s Matrix Calculus tutorial.

5

and where Ty =

X

Tx =

X

ˆ i m0ij = y

X

ij

ˆ i fi0 , y

(47)

i

hˆ xij (y)i m0ij =

ij

X

K−1 (U0 Dmij − Jˆ yi )m0ij

(48)

ij

= K−1 U0 D

X

mij m0ij − K−1 J

X

ij −1

ˆ i fi0 y

(49)

i

0

= K (U DS − JTy ).

(50)

Finally:    X xij    Rxx Rxy 0 0 xij yi = R= , R0xy Ryy yi

(51)

ij

where Ryy =

X

hyi yi0 i =

X

ij

X

(52)

i

hxij yi0 i =

X

K−1 (U0 Dmij − Jyi )yi0



(53)

 =K U − JRyy , X

X = xij x0ij = N K−1 + hˆ xij (y)ˆ xij (y)0 i

(54)

Rxy =

ij

ij

−1

Rxx

ˆiy ˆ i0 ), ni (P−1 i +y

0

DT0y

ij

where X

(55)

ij

hˆ xij (y)ˆ xij (y)0 i

(56)

ij

=

X

K−1 (U0 Dmij − Jyi )(m0ij DU − yi0 J0 )K−1



(57)

ij

= K−1

X

U0 Dmij m0ij DU − U0 Dmij yi0 J0 − Jyi m0ij DU + Jyi yi0 J0 K−1

ij

(58) −1

0

0

= K (U DSDU − U

2.4

DT0y J0

0

−1

− JTy DU + JRyy J )K

(59)

MD-step

Here we temporarily allow a more general prior for the hidden variables: p(yi ) = N (yi |0, Y), p(xij |yi ) = N (x|Gyi , X ) 6

(60) (61)

and then maximize the following complementary auxiliary w.r.t. to the new prior parameters: * + X X ˘= Q log N (yi |0, Y) + log N (xij |Gy, X ) (62) i

=

X

j

hlog N (yi |0, Y)i +

XX

i

i

hlog N (xij |Gyi , X )i

(63)

j

This maximization gives: Y=

K 1 X −1 ˆiy ˆ i0 , P +y K i=1

0 G0 = R−1 yy Rxy , 1 X = (Rxx − GR0xy ). N

(64) (65) (66)

These non-standard priors can now be transformed back to standard form, by absorbing their effects into U and V: U → U chol(X )0 , V → V chol(Y)0 + UG,

(67) (68)

where U on the RHS of (68) is the new value and where chol(X ) chol(X )0 = X denotes Cholesky decomposition.

7

EM for Probabilistic LDA

2 tr(XiXi). ) ,. (7) where Xi = [xi1 ···xini]. 1.3 Likelihood. The complete-data log-likelihood, for speaker i is: p(Mi|yi,Xi,λ) = ni. ∏ j=1. N(mij|Vyi + Uxij,D−1). (8). ∝ exp.

208KB Sizes 0 Downloads 292 Views

Recommend Documents

Link-PLSA-LDA
Machine Learning Department,. Carnegie ..... ploy the mean-field variational approximation for the pos- .... size of the pruned corpus is quite small compared to the orig- .... business. TOP BLOG POSTS ON TOPIC billmon.org willisms.com.

Perturbation LDA
bDepartment of Electronics & Communication Engineering, School of Information Science & Technology, .... (scatter) matrix respectively defined as follows:.

Labeled LDA: A supervised topic model for credit ...
A significant portion of the world's text is tagged by readers on social bookmark- ing websites. Credit attribution is an in- herent problem in these corpora ...

1D-LDA versus 2D-LDA: When Is Vector-based Linear ...
Nov 26, 2007 - Security, P. R.. China. 4Center for Biometrics and. Security Research & ...... in Frontal view, Above in Frontal view and two Surveillance Views, ...

LDA from vowpal wabbit - GitHub
born --- 0.0975 career --- 0.0441 died --- 0.0312 worked --- 0.0287 served --- 0.0273 director --- 0.0209 member --- 0.0176 years --- 0.0167 december --- 0.0164.

Probabilistic performance guarantees for ... - KAUST Repository
is the introduction of a simple algorithm that achieves an ... by creating and severing edges according to preloaded local rules. ..... As an illustration, it is easy.

TRIDIMENSIONAL PROBABILISTIC TRACKING FOR ...
[1] J. Pers and S. Kovacic, “Computer vision system for ... 362–365. [4] E.L. Andrade, E. Khan, J.C. Woods, and M. Ghan- bari, “Player identification in interactive sport scenes us- ... [16] Chong-Wah Ngo, “A robust dissolve detector by suppo

Probabilistic performance guarantees for ... - KAUST Repository
of zm (let it be a two-vertex assembly if that is the largest). The path to zm for each of the ...... Intl. Conf. on Robotics and Automation, May 2006. [19] R. Nagpal ...

TRIDIMENSIONAL PROBABILISTIC TRACKING FOR ...
cept of visual rhythm, transforming the tracking problem into a segmentation problem, solved by a ... of the scene as base data for tracking. This approach is not.

Feature LDA: a Supervised Topic Model for Automatic ...
Knowledge Media Institute, The Open University. Milton Keynes, MK7 6AA, ... Web API is highly heterogeneous, so as its content and level of details [10]. Therefore, a ... haps the most popular directory of Web APIs is ProgrammableWeb which, as of. Ju

Link-PLSA-LDA: A new unsupervised model for ... - Semantic Scholar
The output of the new model on blog data reveals very inter- ... modeling topics and topic specific influence of blogs. Introduction ..... 07/04/2005 and 07/24/2005.

Adaptable Probabilistic Transmission Framework for ...
same time, minimizing sensor query response time is equally ... Maintaining acceptable query response time and high energy .... Using spatial relationships between the sensor and the monitoring area, the sensor independently calculates the ID of the

Generalizing relevance weighted LDA
Rapid and brief communication ... Key Lab of Optoelectronic Technology and Systems of Education Ministry of China, Chongqing University, Chongqing 400044 ...

Probabilistic Algorithms for Geometric Elimination
Applying all these tools we build arithmetic circuits which have certain nodes ... arithmic height respectively (with logarithmic height we refer to the maximal bi- ...... submatrices of the matrix A and the comparison of the last digits of the numbe

Probabilistic performance guarantees for ... - KAUST Repository
[25] H. Young, Individual Strategy and Social Structure: An Evolutionary. Theory of ... Investigator Award (1992), the American Automatic Control Council Donald.

Probabilistic Collocation - Jeroen Witteveen
Dec 23, 2005 - is compared with the Galerkin Polynomial Chaos method, the Non-Intrusive Polynomial. Chaos method ..... A second-order central finite volume ...

if you can't beat 'em, join 'em: implications for new york's ...
Feb 27, 2004 - Tickets for these games ranked among the most expensive items in .... promoters are charging market rates for their best tickets—prices ...... consumers often experience delays in connecting—to an operator or to a server—.

GA-Fisher: A New LDA-Based Face Recognition Algorithm With ...
GA-Fisher: A New LDA-Based Face Recognition. Algorithm With Selection of Principal Components. Wei-Shi Zheng, Jian-Huang Lai, and Pong C. Yuen. Abstract—This paper addresses the dimension reduction problem in Fisherface for face recognition. When t

EM!7TER(S
Another object of the present invention is that this data ... thus simplifying the data analysis. ... personnel. yet would require relatively little training or skill.

Probabilistic Models for Agents' Beliefs and Decisions
observed domain variables and the agent's men- tal states. 1 Introduction. When an intelligent system interacts with other agents, it frequently needs to reason ...