Bayesian Two-Covariance Model Integrating Out the Speaker Space Distribution Jes´ us Villalba, Niko Br¨ ummer October 20, 2010

1

Introduction

The two-covariance model presented in [1] for speaker verification is a two steps process: • Given a development database, make a ML point estimate of the 2-covariance model by EM iterations. • Given a train and test segment and the estimated model, calculate the target vs non-target posterior probability. In this work, we intend to do a Bayesian treatment of the two-covariance model as described on [2]. Using the posterior of the model given the data, we are going to integrate out the parameters describing the speakers distribution.

2

The Model

2.1

Notation

We are going to introduce some notation: • Let Φd be the development i-vectors dataset. • Let Φt = {l, r} be the test i-vectors. • Let Φ = Φd ∪ Φt • Let Φ be any of the previous datasets. • Let θd be the labelling of the development dataset. It partitions the Nd i-vectors into Md speakers. • Let θt be the labelling of the test set, so that θt ∈ {T , N }, where T is the hypothesis that l and r belong to the same speaker and N is the hypothesis that they belong to different speakers. • Let θ = θd ∧ θt the labelling of Φ • Let θ be any of the previous labellings. • Let Si be the i-vectors belonging to the speaker i. • Let Yd be the speaker identity variables of the development set. We will have as many identity variables as speakers. • Let Yt be the speaker identity variables of the test set. • Let Y = Yd ∪ Yt • Let Y be any of the previous speaker identity variables sets. • Let π = (PT , PN ) be the hypothesis prior where PT is the prior probability of a target trial and PN = 1 − PT be the prior probability of a non target trial. • Let M = (µ, B, W) be the set of all the parameters of the model and My = (µ, B) 1

2.2

The two-covariance model

We take a linear-Gaussian generative model M. We suppose that an i-vector φ of speaker s can be written as: φ = ys + z (1) where ys is the speaker identity variable and z is a channel offset. We assume the following probability distributions:  P (y|M) = N y|µ, B−1  P (φ|y, M) = N φ|y, W−1

(2) (3)

where N denotes a Gaussian distribution; µ is the speakers mean; B−1 is the between class covariance matrix, W−1 is the within class covariance matrix; and B and W are the precision matrices.

3

Bayesian two-covariance model

Following a Bayesian treatment, instead of assuming fixed values for µ and B we are going to work with a probability distribution for model parameters. For this, we need a prior for the model P (My |Π) and calculate the posterior distribution of the model given the data, the labelling, W and the prior: P (My |Φ, θ, W, Π)

(4)

In Figure 1 we show the graphical representation of this model. Φd , Φt and θd are observed variables; µ, B and θt are hidden variables; W is assumed to be known; and π and Π are the priors over the hidden variables. We can use the rules described in [3] on this graphical model to determine the dependencies between variables.

d



t

d

Yd

W

,B

t

Yt



Figure 1: Graphical model of the Bayesian two-covariance model. We are going to use a non informative prior (Jeffreys prior) for the parameters µ and B of the speaker Gaussian distribution as in [4]. We are assuming that W is known for simplicity, we will use the ML point estimate on the experiments.

4

Likelihood ratio given a known model

Given a model M we can calculate the ratio of the posterior probabilities of target and non target as shown in [1]: P (T |Φt , M, π) PT P (Φt |T , M) PT = = R (Φt , M) P (N |Φt , M, π) PN P (Φt |N , M) PN

(5)

where we have defined the plug-in likelihood ratio R (Φt , M). To get this ratio we need to calculate P (Φ|θ, M). Given a model M, the y1 , y2 , . . . , yM ∈ Y are sampled independently from P (y|M).

2

Besides, given the M and a speaker i the set Si of i-vectors produced by that speaker are drown independently from P (Φ|yi , M). Using these independence assumptions we can write: P (Φ|θ, M) =

M Y

P (Si |M)

(6)

i=1

P (Si |y, M) =

Y

P (φ|y, M)

(7)

φ∈Si

Using Bayes rule we can write: P (Si , y0 |M) = P (Si |y0 , M) P (y0 |M) = P (y0 |Si , M) P (Si |M) and then P (Si |M) =

P (Si |y0 , M) P (y0 |M) P (y0 |Si , M)

(8)

(9)

where y0 is whatever value we want as long as the denominator is not zero. Note that y0 is not in the LHS of the equation so the result does not depend on it. Then, for the complete dataset: P (Φ|θ, M) =

M Y P (Si |y0 , M) P (y0 |M) = K(Φ)L(θ|Φ) P (y0 |Si , M) i=1

(10)

QN where K(Φ) = i=1 P (φj |y0 , M) is a term that only dependent on the dataset, not θ, so it vanishes when doing the ratio and we do not need to calculate it. What we need to calculate is: L(θ|Φ) =

M Y

Q (Si )

(11)

P (y0 |M) P (y0 |Si , M)

(12)

i=1

Q (Si ) = and the likelihood ratio is: R (Φt , M) =

Q ({l, r}) Q ({l}) Q ({r})

(13)

Distribution 2 is a conjugate prior for distribution 3 so the posterior is again Gaussian distributed:  P (y|S, M) = N y|L−1 γ, L−1 (14) L = B + nW (15) X γ = Bµ + W φ (16) φ∈S

where n is the number of i-vectors of S. Making y0 = 0 we can use this posterior to calculate Q (S) ln Q (S) =

5

 1 ln |B| − µT Bµ − ln |L| + γ T L−1 γ 2

(17)

Bayesian likelihood ratio

In the Bayesian case, given the development and test data and W we can calculate the ratio of the posterior probabilities of target and non target as shown in [1]:  PT P (Φt |T , Φd , θd , W, Π) PT P (T |Φt , Φd , θd , W, Π, π) = = R Φ, θd , W, Π P (N |Φt , Φd , θd , W, Π, π) PN P (Φt |N , Φd , θd , W, Π) PN 3

(18)

 where R Φ, θd , W, Π is the Bayesian likelihood ratio. Using Bayes rule we can write P (Φt , My |θt , Φd , θd , W, Π) = = P (Φt |My , θt , Φd , θd , W, Π) P (My |θt , Φd , θd , W, Π)  = P My |Φ, θ, W, Π P (Φt |θt , Φd , θd , W, Π)

Using the conditional independence assumptions given by the graphic model we have:  P (Φt |My , θt , W)P (My |Φd , θd , W, Π) = P My |Φ, θ, W, Π P (Φt |θt , Φd , θd , W, Π)

(19)

(20)

Then, we can write the ratio in 18 like:

  P (Φt |My , T , W) P My |Φ, θd , N , W, Π  R Φ, θd , W, Π = P (Φt |My , N , W) P My |Φ, θd , T , W, Π  P My |Φ, θd , N , W, Π  = R (Φt , My , W) P My |Φ, θd , T , W, Π

(21) (22)

So the Bayesian likelihood ratio is equal to the plug-in likelihood ratio multiplied by a correction factor. Note that the LHS of the equation does not dependent on the model My used in the computation so we can use whatever My that we find convenient.  Therefore, now we need to calculate P My |Φ, θ, W, Π that is Z    (23) P My |Y, Π P Y|Φ, θ, W, Π dY P My |Φ, θ, W, Π = Y

Solving this integral is complicated so we are going to use the Variational Bayes method to estimate the posterior distribution of My .

6

VB likelihood ratio

6.1

Likelihood ratio approximation

The Variational Bayes approach [3] is based on the approximation: P (My , Y|Φ, θ, W, Π) ≈ q (My , Y) = q (My ) q (Y) (24)  We need to write R Φ, θd , W, Π as a function of P (My , Y|Φ, θ, W, Π). Again we use Bayes rule: P (My , Y|Φ, θ, W, Π) = P (Y|My , Φ, θ, W, Π) P (My |Φ, θ, W, Π)

(25)

Simplifying with the conditional independence rules: P (My , Y|Φ, θ, W, Π) = P (Y|My , Φ, θ, W) P (My |Φ, θ, W, Π)

(26)

If we use 26 in 21:    P My , Y|Φ, θd , N , W, Π P Y|My , Φ, θd , T , W   R Φ, θd , W, Π = R (Φt , My , W) P My , Y|Φ, θd , T , W, Π P Y|My , Φ, θd , N , W

(27)

Now we can use the fact that, given the model My , the y of each speaker are drown independently among then: P (Y|My , Φ, θ, W) =

N Y

P (yi |My , Si , W)

(28)

i=1

and

 P Y|My , Φ, θ, W = P (Yt |My , Φt , θt , W) P (Yd |My , Φd , θd , W) 4

(29)

If we plug 29 into 27 we have   P My , Y|Φ, θd , N , W, Π P (Yt |My , Φt , T , W)  R Φ, θd , W, Π = R (Φt , My , W) P My , Y|Φ, θd , T , W, Π P (Yt |My , Φt , N , W)

(30)

Note again that the LHS does not depend on My or Y so we can use whatever we want. Now we make y = y0 for all speakers in Φ, what we denote as Y = Y0 . We take equation 28 and the plug-in likelihood in equation 11 and substitute them in 30. Finally simplifying we get:

 R Φ, θd , W, Π = and the Variational Bayes version:

 P My , Y0 |Φ, θd , N , W, Π 1  P (y0 |My ) P My , Y 0 |Φ, θd , T , W, Π

  R Φ, θd , W, Π ≈ RV B Φ, θd , W, Π =

 qN M y , Y 0 1  P (y0 |My ) qT My , Y 0

(31)

(32)

  where qT My , Y0 and qN My , Y0 are the variational posteriors assuming that θt = {T } and θt = {N } respectively.

6.2

Variational distributions

In order to formulate the variational version of this model we need to write down the joint distribution of the random variables: P (Φ, My , Y|θ, W, Π) = P (Φ|My , Y, θ, W, Π) P (Y|My , θ, W, Π) P (My |θ, W, Π)

(33)

and simplifying: P (Φ, My , Y|θ, W, Π) = P (Φ|Y, θ, W) P (Y|My ) P (My |Π)

(34)

Now we consider the variational distribution: q (My , Y) = q (My ) q (Y)

(35)

The optimum for the factor q (Y) is given by ln q ∗ (Y) = EMy [ln P (Φ, My , Y|θ, W, Π)] + const

(36)

Using decomposition 34 we have: ln q ∗ (Y) = EMy [ln P (Φ|Y, θ, W)] + EMy [ln P (Y|My )] + const

(37)

Substituting equations 2, 3 and 7 into 37 and absorbing any term that does not depend on Y into the additive constant we have

5

ln q ∗ (Y) = =

M X X

EMy [ln P (φ|yi , W)] +

i=1 φ∈Si

=−

M X

EMy [ln P (yi |My )] + const

(38)

i=1

M M     1X 1XX EMy (yi − µ)T B(yi − µ) + const EMy (φ − yi )T W(φ − yi ) − 2 i=1 2 i=1

(39)

φ∈Si

=

M X

M

yiT W

i=1

X

φ−

φ∈Si

1X ni yiT Wyi 2 i=1

(40)

M M X 1X T yiT EMy [Bµ] + const yi EMy [B] yi + 2 i=1 i=1   M X X  1 T − yi EMy [B] + ni W yi + yiT EMy [Bµ] + W = φ + const 2 i=1



(41)

φ∈Si

Equation 41 has the form of the sum of logarithms of Gaussian distributions so finally:

q ∗ (Y) =

M Y

q ∗ (yi )

(42)

i=1 −1 q ∗ (yi ) = N yi |L−1 i γi , L i

Li = EMy [B] + ni W γi = EMy [Bµ] + W



X

(43) (44) φ

(45)

φ∈Si

The optimum for the factor q (My ) is given by ln q ∗ (My ) = EY [ln P (Φ, My , Y|θ, W, Π)] + const

(46)

Using decomposition 34 we have: ln q ∗ (My ) = EY [ln P (Y|My )] + EY [ln P (My |Π)] + const

(47)

Substituting equations 2, and 187 into 47 and absorbing any term that does not depend on µ or B into the additive constant we have ln q ∗ (My ) = =

M X

EY [ln P (yi |My )] + EY [ln P (µ, B|Π)] + const

(48)

i=1

=

M

  1 M 1X d+1 ln |B| − ln |B| + const EY (yi − µ)T B(yi − µ) + ln |B| − 2 2 i=1 2 2

(49)

Now we define: M 1 X y= EY [yi ] M i=1

Sy =

M X i=1

(50)

M i X h   T EY yi yiT − M yyT EY (yi − y) (yi − y) = i=1

6

(51)

Now we can write 49 as ln q ∗ (My ) = ! M i h X 1 1 d+1 M T + ln |B| − EY (yi − µ) (yi − µ) ln |B| − tr B ln |B| + const (52) = 2 2 2 2 i=1 ! M i h X M 1 T (53) = EY ((yi − y) − (µ − y)) ((yi − y) − (µ − y)) ln |B| − tr B 2 2 i=1 1 d+1 ln |B| − ln |B| + const 2 2 !! M i h X M 1 T T = ln |B| − tr B EY (yi − y) (yi − y) + M (µ − y) (µ − y) 2 2 i=1 +

(54) (55)

d+1 1 ln |B| − ln |B| + const 2 2     1 M d+1 1 M T = ln |B| − (µ − y) B(µ − y) + ln |B| − ln |B| − tr (BSy ) + const (56) 2 2 2 2 2 +

If we compare equation 56 with equation 204 we can see that q ∗ (My ) Gaussian-Wishart distributed:    −1 W B|S−1 if M > d (57) q ∗ (My ) = N µ|y, (M B) y ,M

In order to estimate the parameters of the distributions, we need to evaluate the expec factoring  tations EMy [B], EMy [Bµ], EY [yi ] and EY yi yiT . Using the properties of the Gaussian and Wishart distributions [3] we have EMy [B] = M S−1 y

(58)

Z Z

   −1 W B|S−1 BµN µ|y, (M B) y , M dµ dB B µ Z Z    −1 = B µN µ|y, (M B) dµ W B|S−1 y , M dB B µ Z  BW B|S−1 = y , M dB y

EMy [Bµ] =

(59) (60) (61)

B

= M S−1 y y

EY [yi ] = L−1 i γi

(63)

  T EY yi yiT = L−1 i + EY [yi ] EY [yi ]

(64)

=

6.3

(62)

L−1 i

+

T −1 L−1 i γi γi L i

(65)

Variational lower bound

In this section, we are going to evaluate the variational lower bound to use to check the convergence of our algorithm. L=

Z Z Z Y

B

q (µ, B, Y) ln µ



P (Φ, µ, B, Y|θ, W, Π) q (µ, B, Y)



dµ dB dY

= EMy ,Y [ln P (Φ, µ, B, Y|θ, W, Π)] − EMy ,Y [ln q (µ, B, Y)] = EY [ln P (Φ|Y, θ, W)] + EMy ,Y [ln P (Y|My )] + EMy [ln P (My |Π)] − EMy [ln q (µ, B)] − EY [ln q (Y)] 7

(66) (67) (68)

The simplifications that are done in the following equations assume that the lower bound is calculated after the q (My ) re-estimation. We are going to define:

φi =

1 X φ ni

(69)

φ∈Si

Rφ =

N X

φj φTj

(70)

j=1

Sφ =

M X X

i=1 φ∈Si

= Rφ +

i h T EY (φ − yi ) (φ − yi )

M X i=1

(71)

    T T ni EY yi yiT − EY [yi ] φi − φi EY [yi ]

(72)

We evaluate EY [ln P (Φ|Y, θ, W)]: EY [ln P (Φ|Y, θ, W)] =

M X X

EY [ln P (φ|yi , W )]

(73)

i=1 φ∈Si

M   Nd 1XX N ln |W| − ln(2π) − EY (φ − yi )T W(φ − yi ) 2 2 2 i=1 φ∈Si   M i h N Nd 1  XX T EY (φ − yi ) (φ − yi )  = ln |W| − ln(2π) − tr W 2 2 2 i=1

=

(74)

(75)

φ∈Si

=

N Nd 1 ln |W| − ln(2π) − tr (WSφ ) 2 2 2

(76)

Now, we evaluate EMy ,Y [ln P (Y|My )]: EMy ,Y [ln P (Y|My )] =

M X

EMy ,Y [ln P (yi |µ, B)]

(77)

i=1

=

M

  Md 1X M EB [ln |B|] − ln(2π) − EMy ,Y (yi − µ)T B(yi − µ) 2 2 2 i=1

M   M Md 1X EMy ,Y yiT Byi EB [ln |B|] − ln(2π) − 2 2 2 i=1  T   T  −2EMy ,Y yi Bµ + EMy µ Bµ M Md = EB [ln |B|] − ln(2π) + M yT EMy [Bµ] 2 2 ! M X     1 T T − tr EMy [B] EY yi yi + M EMy Bµµ 2 i=1

=

(78)

(79)

(80)

To evaluate 80 we need to calculate some new expectations. Using the properties of the Gaussian and Wishart distributions [3] we have ˜ ≡ EB [ln |B|] = ln B

d X i=1

ψ



M +1−i 2

8



+ d ln 2 + ln S−1 y

(81)

Z Z

   −1 W B|S−1 BµµT N µ|y, (M B) y , M dµ dB B µ Z Z    −1 dµ W B|S−1 = B µµT N µ|y, (M B) y , M dB B µ Z    −1 = B (M B) + yyT W B|S−1 y , M dB B Z  −1 T =M I+ BW B|S−1 y , M dB yy

  EMy BµµT =

(82) (83) (84) (85)

B

T = M −1 I + M S−1 y yy

(86)

Now we can plug equations 51, 58, 62, 81 and 86 into 80 EMy ,Y [ln P (Y|My )] =

=

M ˜ − M d ln(2π) + M yT M S−1 y ln B y 2 2 ! M X    1 −1 −1 −1 T T EY yi yi + M M I + M Sy yy − tr M Sy 2 i=1 M ˜ − M d ln(2π) ln B 2 2 1 − tr I + M S−1 y 2

(87)

(88) M X i=1

M ˜ − M d ln(2π) − ln B = 2 2 Md M ˜ ln B − ln(2π) − = 2 2 M ˜ − M d ln(2π) − = ln B 2 2

  EY yi yiT − M yyT

!!

 1 tr I + M S−1 y Sy 2 1 tr ((M + 1)I) 2 (M + 1)d 2

(90)

d ˜ d ln(2π) − ln B 2 2

(92)

(89)

(91)

Now, we evaluate EMy [ln P (My |Π)] EMy [ln P (My |Π)] = ln α −

Now, we evaluate EMy [ln q (µ, B)] i h    −1 + EB ln W B|S−1 EMy [ln q (µ, B)] = EMy ln N µ|y, (M B) y ,M     d 1 ˜ M M = ln + ln B − EMy (µ − y)T B(µ − y) − H [q (B)] 2 2π 2 2

where H [q (B)] is the Entropy of the Wishart distribution [3]   H [q (B)] = H W B|S−1 y ,M  M −d−1 ˜ + Md ln B = − ln B S−1 y ,M − 2 2 1 −N/2 B(W, N ) = N d/2 |W| 2 ZN d

(93) (94)

(95) (96) (97)

and     EMy (µ − y)T B(µ − y) = EMy µT Bµ − 2yT EMy [Bµ] + yT EB [B] y   = tr EMy BµµT − yT EB [B] y  T = tr M −1 I + M S−1 − yT M S−1 y yy y y = dM

−1

9

(98) (99) (100) (101)

If we plug 101 in 94 d EMy [ln q (µ, B)] = ln 2



M 2π



+

1 ˜ d ln B − − H [q (B)] 2 2

(102)

Now, we evaluate EY [ln q (Y)] EY [ln q (Y)] =

M X i=1

=−

  −1 EY ln N yi |L−1 i γi , L i M

(104)

M

h  T i 1X 1X  Md −1 ln(2π) + γ y − L γ ln |Li | − tr LEY yi − L−1 i i i i i 2 2 i=1 2 i=1 M

=−

M

  Md 1X 1X −1 T ln(2π) + ln |Li | − EY (yi − L−1 i γi ) Li (yi − Li γi ) 2 2 i=1 2 i=1 M

=−

(103)

(105)

M

 Md 1X 1X ln(2π) + ln |Li | − tr LL−1 2 2 i=1 2 i=1

(106)

M

=−

Md Md 1 X ln(2π) − + ln |Li | 2 2 2 i=1

(107)

Finally, simplifying

L=

M 1 1X M −1 N ln |W| − tr (WSφ ) − ln Sy ln |Li | + 2 2 2 i=1 2



7 7.1

(108)

Nd d Md ln(2π) − ln M + ln ZM d + (1 + ln 2) 2 2 2

VB likelihood ratio with conjugate priors Likelihood ratio approximation

The approach taken in section 6 has a great computational cost due to the fact that in each iteration we need to re-estimate the q (yi ) distributions for all the speakers of the development database. In this section, we are going to follow a different approximation for not having to use the development i-vectors on the calculus of the likelihood ratio. We intend to estimate the posterior distribution of the model My given Φd , θd , W and the non-informative prior Π; and then, using this posterior as a prior Πd for the calculus of the likelihood ratio. To estimate this posterior we are going to use the following approximation: P (My |Πd ) = P (My |Φd , θd , W, Π) ≈ qd (My )

(109)

where qd (My ) is the variational factor of the My given only the development data. This factor is Gaussian-Wishart distributed as we got in equation 57.     −1 if νd > d (110) W B|S−1 qd (My ) = N µ|yd , (βd B) dy , νd

where βd = νd = Md . So with this approximation the likelihood ratios is

P (Φt |T , W, Πd ) P (Φt |N , W, Πd ) P (Φt |My , T , W) P (My |Φt , N , W, Πd ) = P (Φt |My , N , W) P (My |Φt , T , W, Πd ) P (My |Φt , N , W, Πd ) = R (Φt , My , W) P (My |Φt , T , W, Πd ) P (My , Yt |Φt , N , W, Πd ) P (Yt |My , Φt , T , W) = R (Φt , My , W) P (My , Yt |Φt , T , W, Πd ) P (Yt |My , Φt , N , W)

R (Φt , W, Πd ) =

10

(111) (112) (113) (114)

Now we make y = y0 for all speakers in Φt , what we denote as Yt = Yt0 . We take equation 28 and the plug-in likelihood in equation 11 and substitute them in 114. Finally simplifying we get:

R (Φt , W, Πd ) =

1 P (My , Yt0 |Φt , N , W, Πd ) P (y0 |My ) P (My , Yt0 |Φt , T , W, Πd )

(115)

and the Variational Bayes version: R (Φt , W, Πd ) ≈ RV B (Φt , W, Πd ) =

qN (My , Yt0 ) 1 P (y0 |My ) qT (My , Yt0 )

(116)

where qT (My , Yt0 ) and qN (My , Yt0 ) are the variational posteriors assuming that θt = {T } and θt = {N } respectively.

7.2

Variational distributions

We are going to get the variational factors using the conjugate prior. We write again the joint distribution of the random variables: P (Φ, My , Y|θ, W, Πd ) = P (Φ|Y, θ, W) P (Y|My ) P (My |Πd )

(117)

Now, we consider the variational distribution: q (My , Y) = q (My ) q (Y)

(118)

The optimum for the factor q (Y) is the same as in the previous section q ∗ (Y) =

M Y

q ∗ (yi )

(119)

i=1 −1 q ∗ (yi ) = N yi |L−1 i γi , L i

Li = EMy [B] + ni W γi = EMy [Bµ] + W



X

(120) (121) φ

(122)

φ∈Si

The optimum for the factor q (My ) is given by ln q ∗ (My ) = EY [ln P (Φ, My , Y|θ, W, Πd )] + const

(123)

Using decomposition 117 we have: ln q ∗ (My ) = EY [ln P (Y|My )] + EY [ln P (My |Πd )] + const

(124)

Substituting equations 2, and 110 into 124 and absorbing any term that does not depend on µ or B into the additive constant we have ln q ∗ (My ) = =

M X

EY [ln P (yi |My )] + EY [ln P (µ, B|Πd )] + const

(125)

i=1

=

M

  1X M ln |B| − EY (yi − µ)T B(yi − µ) 2 2 i=1 +

(126)

βd 1 1 νd − d − 1 ln |B| − (µ − yd )T B(µ − yd ) + ln |B| − tr (BSdy ) + const 2 2 2 2 (127)

Now, we define 11

Now we define: y=

Sy =

M 1 X EY [yi ] M i=1

M X i=1

(128)

M i X h   T EY yi yiT − M yyT EY (yi − y) (yi − y) =

(129)

i=1

β ′ = βd + M

(130)



ν = νd + M 1 y′ = ′ (βd yd + M y) β βd M T (y − yd ) (y − yd ) S′y = Sdy + Sy + β′

(131) (132) (133)

Now we can write 49 as ln q ∗ (My ) = M i h X M 1 T T = ln |B| − tr B EY (yi − y) (yi − y) + M (µ − y) (µ − y) 2 2 i=1

!!

(134)

1 νd − d − 1 βd 1 ln |B| − (µ − yd )T B(µ − yd ) + ln |B| − tr (BSdy ) + const 2 2 2 2  1   T T = − tr B M (µ − y) (µ − y) + βd (µ − yd ) (µ − yd ) (135) 2 νd + M − d − 1 1 1 ln |B| − tr (B (Sdy + Sy )) + const + ln |B| + 2 2 2   1 T T T T = − tr B M µµ − µy − yµ + yy + βd µµT − µyTd − yd µT + yd yTd (136) 2 1 ν′ − d − 1 1 + ln |B| + ln |B| − tr (B (Sdy + Sy )) + const 2  2 2 1 1 1 T µ (M y + βd yd ) − (M y + βd yd ) µT = − tr B (M + βd ) µµT − 2 M + βd M + βd   1 T T M yy + βd yd yd (137) + M + βd ν′ − d − 1 1 1 ln |B| − tr (B (Sdy + Sy )) + const + ln |B| + 2 2 2 1 β′ (138) = ln |B| − (µ − y′ )T B(µ − y′ ) 2   2 1 1 T − tr B M yyT + βd yd yTd − (M y + βd yd ) (M y + βd yd ) 2 M + βd ν′ − d − 1 1 + ln |B| − tr (B (Sdy + Sy )) + const 2 2 β′ 1 ′ T (139) = ln |B| − (µ − y ) B(µ − y′ ) 2 2    1 βd M ν′ − d − 1 T + const ln |B| − tr B Sdy + Sy + (y − yd ) (y − yd ) + 2 2 M + βd      1 ν′ − d − 1 β′ 1 = (140) ln |B| − (µ − y′ )T B(µ − y′ ) + ln |B| − tr BS′y + const 2 2 2 2 +

see that q ∗ (My ) Gaussian-Wishart distributed:    −1 ′ W B|S′−1 q ∗ (My ) = N µ|y′ , (β ′ B) y ,ν 12

if ν ′ > d

(141)

In order to estimate the parameters of the distributions, we need to evaluate the expecta  factoring tions EMy [B], EMy [Bµ], EY [yi ] and EY yi yiT . Using the properties of the Gaussian and Wishart distributions [3] we have

7.3

EMy [B] = ν ′ S′−1 y

(142)

′ EMy [Bµ] = ν ′ S′−1 y y

(143)

EY [yi ] = L−1 i γi

(144)

  T EY yi yiT = L−1 i + EY [yi ] EY [yi ]

(145)

Variational lower bound

In this section, we are going to evaluate Ade variational lower bound to use to check the convergence of our algorithm.

L=

Z Z Z Y

B

q (µ, B, Y) ln µ



P (Φ, µ, B, Y|θ, W, Π) q (µ, B, Y)



dµ dB dY

(146)

= EMy ,Y [ln P (Φ, µ, B, Y|θ, W, Π)] − EMy ,Y [ln q (µ, B, Y)]

(147)

= EY [ln P (Φ|Y, θ, W)] + EMy ,Y [ln P (Y|My )] + EMy [ln P (My |Π)] − EMy [ln q (µ, B)] − EY [ln q (Y)]

(148)

We are going to define:

φi =

1 X φ ni

(149)

φ∈Si

Rφ =

N X

φj φTj

(150)

j=1

Sφ =

M X X

i=1 φ∈Si

= Rφ +

i h T EY (φ − yi ) (φ − yi )

M X i=1

    T T ni EY yi yiT − EY [yi ] φi − φi EY [yi ]

(151)

(152)

We evaluate EY [ln P (Φ|Y, θ, W)]: EY [ln P (Φ|Y, θ, W)] =

N Nd 1 ln |W| − ln(2π) − tr (WSφ ) 2 2 2

(153)

Now, we evaluate EMy ,Y [ln P (Y|My )]: EMy ,Y [ln P (Y|My )] =

Md M EB [ln |B|] − ln(2π) + M yT EMy [Bµ] 2 2 ! M X     1 T T − tr EMy [B] EY yi yi + M EMy Bµµ 2 i=1 13

(154)

To evaluate 154 we need to calculate some new expectations. Using the properties of the Gaussian and Wishart distributions [3] we have ˜ ≡ EB [ln |B|] = ln B

d X

ψ

i=1



ν′ + 1 − i 2



+ d ln 2 + ln S′−1 y

Z Z

   −1 ′ W B|S′−1 dµ dB BµµT N µ|y′ , (β ′ B) y ,ν B µ Z Z    −1 ′ = B µµT N µ|y′ , (β ′ B) dµ W B|S′−1 dB y ,ν B µ Z    −1 ′ B (β ′ B) + y′ y′T W B|S′−1 = dB y ,ν B Z  ′ BW B|S′−1 dB y′ y′T = β ′−1 I + y ,ν

  EMy BµµT =

= β ′−1 I +

B ′ ′T ν ′ S′−1 y y y

(155)

(156) (157) (158) (159) (160)

Now we can plug equations 129, 142, 143, 155 and 160 into 154 EMy ,Y [ln P (Y|My )] =

=

=

M ˜ − M d ln(2π) + M yT ν ′ S′−1 y′ ln B y 2 2 ! M X    1 ′−1 ′ ′−1 ′ ′T ′ ′−1 T − tr ν Sy EY y i y i + M β I + ν S y y y 2 i=1

M ˜ − M d ln(2π) + M yT ν ′ S′−1 y′ ln B y 2 2 !! M X   1 M ′ ′T ′ ′−1 T − tr I + ν Sy EY y i y i + M y y 2 β′ i=1 M ˜ − M d ln(2π) − M d ln B 2 2 2β ′ ′  ν′ Mν ′ ′ − (y − y)T S′−1 tr S′−1 y (y − y) − y Sy 2 2

Now, we evaluate EMy [ln P (My |Πd )] i h  h  i −1 EMy [ln P (My |Πd )] = EMy ln N µ|yd , (βd B) + EB ln W B|S−1 dy , νd     1 ˜ βd βd d + ln B − EMy (µ − yd )T B(µ − yd ) = ln 2 2π 2 2  ν −d−1   d ˜ − 1 tr Sdy EM [B] ln B + ln B S−1 y dy , νd + 2 2

(161)

(162)

(163)

(164) (165)

where

    EMy (µ − yd )T B(µ − yd ) = EMy µT Bµ − 2yTd EMy [Bµ] + yTd EB [B] yd   ′ T ′ ′−1 = tr EMy BµµT − 2yTd ν ′ S′−1 y y + y d ν Sy y d  ′ ′T ′ T ′ ′−1 = tr β ′−1 I + ν ′ S′−1 − 2yTd ν ′ S′−1 y y y y y + y d ν Sy y d d ′ = ′ + ν ′ (y′ − yd )T S′−1 y (y − yd ) β

(166) (167) (168) (169)

If we plug 169 in 165 1 ˜ dβd βd ν ′ ′ ′ ln B − − (y − yd )T S′−1 y (y − yd ) 2 2β ′ 2  ν −d−1  ′  d ˜ − ν tr Sdy S′−1 ln B + + ln B S−1 , ν d y dy 2 2

d EMy [ln P (My |Πd )] = ln 2



βd 2π



+

14

(170)

Now, we evaluate EMy [ln q (µ, B)] i h    −1 ′ + EB ln W B|S′−1 EMy [ln q (µ, B)] = EMy ln N µ|y′ , (β ′ B) y ,ν  ′   d 1 ˜ β′ β = ln + ln B − EMy (µ − y′ )T B(µ − y′ ) − H [q (B)] 2 2π 2 2

where H [q (B)] is the Entropy of the Wishart distribution [3]   ′ H [q (B)] = H W B|S′−1 y ,ν ′  ν′ − d − 1 ′ ˜+νd = − ln B S′−1 , ν − ln B y 2 2 1 −N/2 |W| B(W, N ) = N d/2 2 ZN d

(171) (172)

(173) (174) (175)

and     EMy (µ − y′ )T B(µ − y′ ) = EMy µT Bµ − 2y′T EMy [Bµ] + y′T EB [B] y′   = tr EMy BµµT − y′T EB [B] y′  ′ ′T ′ = tr β ′−1 I + ν ′ S′−1 − y′T ν ′ S′−1 y y y y y = dβ

′−1

(176) (177) (178) (179)

If we plug 179 in 172 d EMy [ln q (µ, B)] = ln 2



β′ 2π



+

1 ˜ d ln B − − H [q (B)] 2 2

(180)

Now, we evaluate EY [ln q (Y)] EY [ln q (Y)] =

M X i=1

=−

  −1 EY ln N yi |L−1 i γi , L i

(181)

M

Md Md 1 X ln(2π) − + ln |Li | 2 2 2 i=1

(182)

Finally, simplifying M

L=

 N 1 1X ν′ ln |W| − tr (WSφ ) − ln |Li | − tr S′−1 y (Sy + Sdy ) 2 2 2 i=1 2

(183)

βd ν ′ ′ M ν′ ′ ′ ′ (y − y)T S′−1 (y − yd )T S′−1 y (y − y) − y (y − yd ) 2 2 νd −1 ν ′ ′−1 − ln S + ln Sy 2  dy 2 Nd Md ν′d βd d − ln(2π) − ln Zνd d + ln Zν ′ d + (1 + ln 2) + + ln ′ 2 β 2 2 2 −

A

Inferring a Gaussian distribution

In [4], we can find a detailed explanation of how to estimate a Gaussian distribution using non informative priors. Here, we are going to write some those equations using Precision matrices instead of Covariance matrices. A Gaussian distribution is defined by:    Λ 1/2 1 −1 T P (x|m, Λ) = N x|m, Λ = exp − (x − m) Λ(x − m) (184) 2π 2 We assume a non informative prior Π for m and Λ (Jeffrey’s Prior): 15

P (m, Λ|Π) = P (m|Λ, Π) P (Λ|Π)   −1 = lim N m|m0 , (kΛ) W (Λ|W0 /k, k) k→0 1/2 Λ −(d+1)/2 = α |Λ| 2π

Now, we get the posteriors for m and Λ, given a set observations X = {x1 , x2 , . . . , xN } ! N/2 N 1X P (m, Λ|Π) Λ T exp − P (m, Λ|X, Π) = (xi − m) Λ(xi − m) P (X|Π) 2π 2 i=1 N/2     N P (m, Λ|Π) Λ 1 T exp − (m − x) Λ(m − x) exp − tr (SΛ) = P (X|Π) 2π 2 2

(185) (186) (187)

(188) (189)

where

N 1 X xi x= N i=1

S=

N X

(190)

(xi − x) (xi − x)

T

(191)

i=1

The marginal posterior for the precision Λ: Z P (Λ|X, Π) = P (m, Λ|X, Π) dm

(192)

m

N/2   1 α 1 Λ 1 = exp − tr (SΛ) P (X|Π) |Λ|(d+1)/2 N d/2 2π 2 1/2   Z NΛ exp − N (m − x)T Λ(m − x) dm 2 m 2π N/2   1 Λ 1 α 1 = tr (SΛ) exp − P (X|Π) |Λ|(d+1)/2 N d/2 2π 2 1 = P (X|Λ) P (Λ|Π) P (X|Π)

Now, we use Z Z  −k−(d+1)/2 |V| exp −tr AV−1 dV = V≥0

|W|

k−(d+1)/2

exp (−tr (AW)) dW

(193)

(194) (195)

(196)

W≥0

= |A|

−k

π

d(d−1)/4

d Y

Γ (k − (i − 1)/2)

(197)

i=1

if k > (d − 1)/2 and |A| > 0 to get P (X|Π) =

Z

α Λ

|Λ|

(d+1)/2

1 N d/2

N/2   Λ 1 tr (SΛ) dΛ exp − 2π 2

 d(d−1)/4 Qd i=1 Γ((N +1−i)/2) α π N d/2 |πS|N/2   = α π S−1 S |S | −N/2 (see [4]) 0 0 +

where S0 = (x − m0 )(x − m0 )T + V0

16

(198)

N >d

N ≤d

(199)

We can plug 199 into 194 to get the marginal posterior of Λ: P (Λ|X, Π) =

1 ZN d |Λ|

(d+1)/2

= W Λ|S−1 , N



  SΛ N/2 1 exp − tr (SΛ) 2 2

if N > d

(200) (201)

Qd where ZN d = π d(d−1)/4 i=1 Γ ((N + 1 − i)/2) and plug-in 199 into 189 we have the join posterior for m and Λ: N/2 1/2     N/2 d/2 Λ 1 N |πS| Λ −(d+1)/2 N T P (m, Λ|X, Π) = |Λ| exp − (m − x) Λ(m − x) exp − tr (SΛ) 2π 2π ZN d 2 2 (202) (N +1)/2 1/2     N SΛ 1 1 N T x) Λ(m − x) exp − = (m − tr (SΛ) exp − 2 (d+1)/2 πS 2 2 ZN d |Λ| (203) " 1/2 N/2   # " # SΛ NΛ N 1 1 T exp − (m − x) Λ(m − x) exp − tr (SΛ) = (d+1)/2 2 2π 2 2 ZN d |Λ| (204)



= N m|x, (N Λ)

−1



W Λ|S−1 , N



if N > d

(205) (206)

So finally, summing up: P (m, Λ|X, Π) = P (m|Λ, X, Π) P (Λ|X, Π)   −1 P (m|Λ, X, Π) = N m|x, (N Λ)  P (Λ|X, Π) = W Λ|S−1 , N if N > d

17

(207) (208) (209)

Now, we can calculate the marginal posterior for the mean m Z P (m|X, Π) = P (m, Λ|X, Π) dΛ

(210)

Λ

N/2   Z 1 α 1 1 Λ exp − tr (SΛ) (211) = P (X|Π) Λ |Λ|(d+1)/2 N d/2 2π 2   N Λ 1/2 exp − N (m − x)T Λ(m − x) dΛ 2π 2 (N +1)/2  Z   Λ 1  1 α T exp − tr S + N (m − x) (m − x) Λ = dΛ P (X|Π) Λ |Λ|(d+1)/2 2π 2 (212) −(N +1)/2 αZ(N +1)d 1 T = (213) S + N (m − x) (m − x) P (X|Π) π (N +1)d/2 −(N +1)/2 Z(N +1)d N d/2 T N/2 (214) + N (m − x) (m − x) |πS| = S ZN d π (N +1)d/2 −(N +1)/2 Γ ((N + 1)/2) N d/2 −1/2 T −1 = + N S (m − |S| x) (m − x) (215) I Γ ((N + 1 − d)/2) , π d/2 −(N +1)/2 Γ ((N + 1)/2) N d/2 T −1/2 −1 x) (m − x) (216) |S| I + N S (m − = Γ ((N + 1 − d)/2) , π d/2 1/2 −(N +1)/2 Γ ((N + 1)/2) N S−1 = (217) 1 + N (m − x)T S−1 (m − x) Γ ((N + 1 − d)/2) π = TM (m|x, S/N, N + 1) 2

(218)

= T m|x, S/N , N + 1 − d if N > d



(219)

where we have used the matrix relation in [8] |I + BC| = |I + CB|

(220)

So the mean marginal posterior is Student’s T distributed. We have found several definitions of Student’s T distribution in the literature. This is the definition in [4] TM x|m, Λ

−1

1/2 Λ −N/2 Γ (N/2) 1 + (x − m)T Λ(x − m) ,N = Γ ((N − d)/2) π 

(221)

This is the definition in [3] that seems more widely spread T x|m, Λ

−1

1/2  −(N +d)/2 Γ ((N + d)/2) Λ 1 T ,N = 1 + (x − m) Λ(x − m) Nπ Γ (N/2) N  −1 = TM x|m, N Λ , N + d 

(222) (223)

References [1] Niko; Brummer and Edward De Villiers, “The Speaker Partitioning Problem,” in Oddyssey Speaker and Language Recognition Workshop, 2010. [2] Niko Brummer, “Bayesian PLDA,” 2010. [3] Christopher Bishop, Pattern Recognition and Machine Learning, 2006. [4] Thomas Minka, “Inferring a Gaussian distribution,” 1998. 18

[5] Kaare Petersen and Michael Pedersen, The Matrix Cookbook, Technical University of Denmark, 2008. [6] H.V. Henderson and S.R. Searle, “On Deriving the Inverse of a Sum of Matrices,” Society for Industrial and Applied Mathematics, vol. 23, no. 1, pp. 53–60, 1981. [7] M. Brookes, “The Matrix Reference Manual,” 2005. [8] Thomas Minka, “Old and New Matrix Algebra Useful for Statistics,” 2000.

19

Bayesian Two-Covariance Model Integrating Out the ...

Oct 20, 2010 - Let Y be any of the previous speaker identity variables sets. ... Let M = (µ, B, W) be the set of all the parameters of the model and My = (µ, B). 1 ...

187KB Sizes 0 Downloads 120 Views

Recommend Documents

BAYESIAN HIERARCHICAL MODEL FOR ...
NETWORK FROM MICROARRAY DATA ... pecially for analyzing small sample size data. ... correlation parameters are exchangeable meaning that the.

Quasi-Bayesian Model Selection
the FRB Philadelphia/NBER Workshop on Methods and Applications for DSGE Models. Shintani gratefully acknowledges the financial support of Grant-in-aid for Scientific Research. †Department of Economics, Vanderbilt University, 2301 Vanderbilt Place,

Nonparametric Hierarchical Bayesian Model for ...
results of alternative data-driven methods in capturing the category structure in the ..... free energy function F[q] = E[log q(h)] − E[log p(y, h)]. Here, and in the ...

Nonparametric Hierarchical Bayesian Model for ...
employed in fMRI data analysis, particularly in modeling ... To distinguish these functionally-defined clusters ... The next layer of this hierarchical model defines.

Quasi-Bayesian Model Selection
We also thank the seminar and conference participants for helpful comments at ... tification in a more general framework and call it a Laplace-type estimator ...... DSGE model in such a way that CB is lower triangular so that the recursive identifi-.

The Bayesian Draughtsman: A Model for Visuomotor ...
by eye-tracking human subjects, we formulate a Bayesian model of draw- ..... PhD dissertation, Berkeley, University of California, Computer Science Division.

Dialogic RSA: A Bayesian Model of Pragmatic ...
Episodes are analogous to turns in natural language dialogue, as each .... http://www.aaai.org/ocs/index.php/FSS/FSS11/paper/download/4186/4502. Frank ...

Bayesian Model Averaging for Spatial Econometric ...
Aug 11, 2005 - represents a cross-section of regions located in space, for example, counties, states, or countries. y ¼ rWy ю ... If the sample data are to determine the posterior model probabilities, the prior probabilities ..... averaged estimate

A Bayesian hierarchical model of Antarctic fur seal ...
Mar 30, 2012 - transmitter (Advanced Telemetry Systems, Isanti, Min- nesota, USA), while 211 females were instrumented with only a radio transmitter, and 10 ...

Bayesian Statistical Model Checking with Application to ...
Jan 13, 2010 - discrete-time hybrid system models in Stateflow/Simulink: a fuel control system .... Formally, we start with a definition of a deterministic automaton. ...... demos.html?file=/products/demos/shipping/simulink/sldemo fuelsys.html .

A Weakly Supervised Bayesian Model for Violence ...
Social media and in particular Twitter has proven to ..... deriving word priors from social media, which is ..... ics T ∈ {1,5,10,15,20,25,30} and our significance.

Anatomically Informed Bayesian Model Selection for fMRI Group Data ...
A new approach for fMRI group data analysis is introduced .... j )∈R×R+ p(Y |ηj,σ2 j. )π(ηj,σ2 j. )d(ηj,σ2 j. ) is the marginal likelihood in the model where region j ...

A nonparametric hierarchical Bayesian model for group ...
categories (animals, bodies, cars, faces, scenes, shoes, tools, trees, and vases) in the .... vide an ordering of the profiles for their visualization. In tensorial.

A Bayesian Approach to Model Checking Biological ...
1 Computer Science Department, Carnegie Mellon University, USA ..... 3.2 also indicates an objective degree of confidence in the accepted hypothesis when.

Integrative Bayesian model on two opposite types of ...
posterior probability distribution P(xA, xV|yA, yV). We assume independence between the auditory and visual noises. Then, from Bayes' theorem, it follows that.

Fast and accurate Bayesian model criticism and conflict ...
Keywords: Bayesian computing; Bayesian modelling; INLA; latent Gaussian models; model criticism ... how group-level model criticism and conflict detection can be carried out quickly and accurately through integrated ...... INLA, Statistical Modelling

Bayesian Language Model Interpolation for ... - Research at Google
used for a variety of recognition tasks on the Google Android platform. The goal ..... Equation (10) shows that the Bayesian interpolated LM repre- sents p(w); this ...

A Collective Bayesian Poisson Factorization Model for ...
Request permissions from [email protected]. KDD'15, August 10-13, 2015, Sydney, NSW, Australia. ... Event-Based Social Networks, Cold-start Recommendation. 1. INTRODUCTION ... attract more users to register on their websites. Although ... mented in

Bayesian Model Averaging for Spatial Econometric ...
Aug 11, 2005 - There is a great deal of literature on Bayesian model comparison for nonspatial .... structure of the explanatory variables in X into account. ...... Further computational savings can be achieved by noting that the grid can be.

A Bayesian hierarchical model with spatial variable ...
towns which rely on a properly dimensioned sewage system to collect water run-off. Fig. ... As weather predictions are considered reliable up to 1 week ahead, we ..... (Available from http://www.abi.org.uk/Display/File/Child/552/Financial-Risks-.

Bayesian Model Averaging for Spatial Econometric ...
11 Aug 2005 - We extend the literature on Bayesian model comparison for ordinary least-squares regression models ...... with 95 models having posterior model probabilities 40.1%, accounting for. 83.02% probability ...... choices. 2 MATLAB version 7 s

A Bayesian Approach to Model Checking Biological ...
of the system interact and evolve by obeying a set of instructions or rules. In contrast to .... because one counterexample to φ is not enough to answer P≥θ(φ).

AlgebraSolidGeometry_E_3sec model 1 And The model answer.pdf ...
Whoops! There was a problem loading more pages. AlgebraSolidGeometry_E_3sec model 1 And The model answer.pdf. AlgebraSolidGeometry_E_3sec model ...