Bayesian Two-Covariance Model Integrating Out the Speaker Space Distribution Jes´ us Villalba, Niko Br¨ ummer October 20, 2010
1
Introduction
The two-covariance model presented in [1] for speaker verification is a two steps process: • Given a development database, make a ML point estimate of the 2-covariance model by EM iterations. • Given a train and test segment and the estimated model, calculate the target vs non-target posterior probability. In this work, we intend to do a Bayesian treatment of the two-covariance model as described on [2]. Using the posterior of the model given the data, we are going to integrate out the parameters describing the speakers distribution.
2
The Model
2.1
Notation
We are going to introduce some notation: • Let Φd be the development i-vectors dataset. • Let Φt = {l, r} be the test i-vectors. • Let Φ = Φd ∪ Φt • Let Φ be any of the previous datasets. • Let θd be the labelling of the development dataset. It partitions the Nd i-vectors into Md speakers. • Let θt be the labelling of the test set, so that θt ∈ {T , N }, where T is the hypothesis that l and r belong to the same speaker and N is the hypothesis that they belong to different speakers. • Let θ = θd ∧ θt the labelling of Φ • Let θ be any of the previous labellings. • Let Si be the i-vectors belonging to the speaker i. • Let Yd be the speaker identity variables of the development set. We will have as many identity variables as speakers. • Let Yt be the speaker identity variables of the test set. • Let Y = Yd ∪ Yt • Let Y be any of the previous speaker identity variables sets. • Let π = (PT , PN ) be the hypothesis prior where PT is the prior probability of a target trial and PN = 1 − PT be the prior probability of a non target trial. • Let M = (µ, B, W) be the set of all the parameters of the model and My = (µ, B) 1
2.2
The two-covariance model
We take a linear-Gaussian generative model M. We suppose that an i-vector φ of speaker s can be written as: φ = ys + z (1) where ys is the speaker identity variable and z is a channel offset. We assume the following probability distributions: P (y|M) = N y|µ, B−1 P (φ|y, M) = N φ|y, W−1
(2) (3)
where N denotes a Gaussian distribution; µ is the speakers mean; B−1 is the between class covariance matrix, W−1 is the within class covariance matrix; and B and W are the precision matrices.
3
Bayesian two-covariance model
Following a Bayesian treatment, instead of assuming fixed values for µ and B we are going to work with a probability distribution for model parameters. For this, we need a prior for the model P (My |Π) and calculate the posterior distribution of the model given the data, the labelling, W and the prior: P (My |Φ, θ, W, Π)
(4)
In Figure 1 we show the graphical representation of this model. Φd , Φt and θd are observed variables; µ, B and θt are hidden variables; W is assumed to be known; and π and Π are the priors over the hidden variables. We can use the rules described in [3] on this graphical model to determine the dependencies between variables.
d
t
d
Yd
W
,B
t
Yt
Figure 1: Graphical model of the Bayesian two-covariance model. We are going to use a non informative prior (Jeffreys prior) for the parameters µ and B of the speaker Gaussian distribution as in [4]. We are assuming that W is known for simplicity, we will use the ML point estimate on the experiments.
4
Likelihood ratio given a known model
Given a model M we can calculate the ratio of the posterior probabilities of target and non target as shown in [1]: P (T |Φt , M, π) PT P (Φt |T , M) PT = = R (Φt , M) P (N |Φt , M, π) PN P (Φt |N , M) PN
(5)
where we have defined the plug-in likelihood ratio R (Φt , M). To get this ratio we need to calculate P (Φ|θ, M). Given a model M, the y1 , y2 , . . . , yM ∈ Y are sampled independently from P (y|M).
2
Besides, given the M and a speaker i the set Si of i-vectors produced by that speaker are drown independently from P (Φ|yi , M). Using these independence assumptions we can write: P (Φ|θ, M) =
M Y
P (Si |M)
(6)
i=1
P (Si |y, M) =
Y
P (φ|y, M)
(7)
φ∈Si
Using Bayes rule we can write: P (Si , y0 |M) = P (Si |y0 , M) P (y0 |M) = P (y0 |Si , M) P (Si |M) and then P (Si |M) =
P (Si |y0 , M) P (y0 |M) P (y0 |Si , M)
(8)
(9)
where y0 is whatever value we want as long as the denominator is not zero. Note that y0 is not in the LHS of the equation so the result does not depend on it. Then, for the complete dataset: P (Φ|θ, M) =
M Y P (Si |y0 , M) P (y0 |M) = K(Φ)L(θ|Φ) P (y0 |Si , M) i=1
(10)
QN where K(Φ) = i=1 P (φj |y0 , M) is a term that only dependent on the dataset, not θ, so it vanishes when doing the ratio and we do not need to calculate it. What we need to calculate is: L(θ|Φ) =
M Y
Q (Si )
(11)
P (y0 |M) P (y0 |Si , M)
(12)
i=1
Q (Si ) = and the likelihood ratio is: R (Φt , M) =
Q ({l, r}) Q ({l}) Q ({r})
(13)
Distribution 2 is a conjugate prior for distribution 3 so the posterior is again Gaussian distributed: P (y|S, M) = N y|L−1 γ, L−1 (14) L = B + nW (15) X γ = Bµ + W φ (16) φ∈S
where n is the number of i-vectors of S. Making y0 = 0 we can use this posterior to calculate Q (S) ln Q (S) =
5
1 ln |B| − µT Bµ − ln |L| + γ T L−1 γ 2
(17)
Bayesian likelihood ratio
In the Bayesian case, given the development and test data and W we can calculate the ratio of the posterior probabilities of target and non target as shown in [1]: PT P (Φt |T , Φd , θd , W, Π) PT P (T |Φt , Φd , θd , W, Π, π) = = R Φ, θd , W, Π P (N |Φt , Φd , θd , W, Π, π) PN P (Φt |N , Φd , θd , W, Π) PN 3
(18)
where R Φ, θd , W, Π is the Bayesian likelihood ratio. Using Bayes rule we can write P (Φt , My |θt , Φd , θd , W, Π) = = P (Φt |My , θt , Φd , θd , W, Π) P (My |θt , Φd , θd , W, Π) = P My |Φ, θ, W, Π P (Φt |θt , Φd , θd , W, Π)
Using the conditional independence assumptions given by the graphic model we have: P (Φt |My , θt , W)P (My |Φd , θd , W, Π) = P My |Φ, θ, W, Π P (Φt |θt , Φd , θd , W, Π)
(19)
(20)
Then, we can write the ratio in 18 like:
P (Φt |My , T , W) P My |Φ, θd , N , W, Π R Φ, θd , W, Π = P (Φt |My , N , W) P My |Φ, θd , T , W, Π P My |Φ, θd , N , W, Π = R (Φt , My , W) P My |Φ, θd , T , W, Π
(21) (22)
So the Bayesian likelihood ratio is equal to the plug-in likelihood ratio multiplied by a correction factor. Note that the LHS of the equation does not dependent on the model My used in the computation so we can use whatever My that we find convenient. Therefore, now we need to calculate P My |Φ, θ, W, Π that is Z (23) P My |Y, Π P Y|Φ, θ, W, Π dY P My |Φ, θ, W, Π = Y
Solving this integral is complicated so we are going to use the Variational Bayes method to estimate the posterior distribution of My .
6
VB likelihood ratio
6.1
Likelihood ratio approximation
The Variational Bayes approach [3] is based on the approximation: P (My , Y|Φ, θ, W, Π) ≈ q (My , Y) = q (My ) q (Y) (24) We need to write R Φ, θd , W, Π as a function of P (My , Y|Φ, θ, W, Π). Again we use Bayes rule: P (My , Y|Φ, θ, W, Π) = P (Y|My , Φ, θ, W, Π) P (My |Φ, θ, W, Π)
(25)
Simplifying with the conditional independence rules: P (My , Y|Φ, θ, W, Π) = P (Y|My , Φ, θ, W) P (My |Φ, θ, W, Π)
(26)
If we use 26 in 21: P My , Y|Φ, θd , N , W, Π P Y|My , Φ, θd , T , W R Φ, θd , W, Π = R (Φt , My , W) P My , Y|Φ, θd , T , W, Π P Y|My , Φ, θd , N , W
(27)
Now we can use the fact that, given the model My , the y of each speaker are drown independently among then: P (Y|My , Φ, θ, W) =
N Y
P (yi |My , Si , W)
(28)
i=1
and
P Y|My , Φ, θ, W = P (Yt |My , Φt , θt , W) P (Yd |My , Φd , θd , W) 4
(29)
If we plug 29 into 27 we have P My , Y|Φ, θd , N , W, Π P (Yt |My , Φt , T , W) R Φ, θd , W, Π = R (Φt , My , W) P My , Y|Φ, θd , T , W, Π P (Yt |My , Φt , N , W)
(30)
Note again that the LHS does not depend on My or Y so we can use whatever we want. Now we make y = y0 for all speakers in Φ, what we denote as Y = Y0 . We take equation 28 and the plug-in likelihood in equation 11 and substitute them in 30. Finally simplifying we get:
R Φ, θd , W, Π = and the Variational Bayes version:
P My , Y0 |Φ, θd , N , W, Π 1 P (y0 |My ) P My , Y 0 |Φ, θd , T , W, Π
R Φ, θd , W, Π ≈ RV B Φ, θd , W, Π =
qN M y , Y 0 1 P (y0 |My ) qT My , Y 0
(31)
(32)
where qT My , Y0 and qN My , Y0 are the variational posteriors assuming that θt = {T } and θt = {N } respectively.
6.2
Variational distributions
In order to formulate the variational version of this model we need to write down the joint distribution of the random variables: P (Φ, My , Y|θ, W, Π) = P (Φ|My , Y, θ, W, Π) P (Y|My , θ, W, Π) P (My |θ, W, Π)
(33)
and simplifying: P (Φ, My , Y|θ, W, Π) = P (Φ|Y, θ, W) P (Y|My ) P (My |Π)
(34)
Now we consider the variational distribution: q (My , Y) = q (My ) q (Y)
(35)
The optimum for the factor q (Y) is given by ln q ∗ (Y) = EMy [ln P (Φ, My , Y|θ, W, Π)] + const
(36)
Using decomposition 34 we have: ln q ∗ (Y) = EMy [ln P (Φ|Y, θ, W)] + EMy [ln P (Y|My )] + const
(37)
Substituting equations 2, 3 and 7 into 37 and absorbing any term that does not depend on Y into the additive constant we have
5
ln q ∗ (Y) = =
M X X
EMy [ln P (φ|yi , W)] +
i=1 φ∈Si
=−
M X
EMy [ln P (yi |My )] + const
(38)
i=1
M M 1X 1XX EMy (yi − µ)T B(yi − µ) + const EMy (φ − yi )T W(φ − yi ) − 2 i=1 2 i=1
(39)
φ∈Si
=
M X
M
yiT W
i=1
X
φ−
φ∈Si
1X ni yiT Wyi 2 i=1
(40)
M M X 1X T yiT EMy [Bµ] + const yi EMy [B] yi + 2 i=1 i=1 M X X 1 T − yi EMy [B] + ni W yi + yiT EMy [Bµ] + W = φ + const 2 i=1
−
(41)
φ∈Si
Equation 41 has the form of the sum of logarithms of Gaussian distributions so finally:
q ∗ (Y) =
M Y
q ∗ (yi )
(42)
i=1 −1 q ∗ (yi ) = N yi |L−1 i γi , L i
Li = EMy [B] + ni W γi = EMy [Bµ] + W
X
(43) (44) φ
(45)
φ∈Si
The optimum for the factor q (My ) is given by ln q ∗ (My ) = EY [ln P (Φ, My , Y|θ, W, Π)] + const
(46)
Using decomposition 34 we have: ln q ∗ (My ) = EY [ln P (Y|My )] + EY [ln P (My |Π)] + const
(47)
Substituting equations 2, and 187 into 47 and absorbing any term that does not depend on µ or B into the additive constant we have ln q ∗ (My ) = =
M X
EY [ln P (yi |My )] + EY [ln P (µ, B|Π)] + const
(48)
i=1
=
M
1 M 1X d+1 ln |B| − ln |B| + const EY (yi − µ)T B(yi − µ) + ln |B| − 2 2 i=1 2 2
(49)
Now we define: M 1 X y= EY [yi ] M i=1
Sy =
M X i=1
(50)
M i X h T EY yi yiT − M yyT EY (yi − y) (yi − y) = i=1
6
(51)
Now we can write 49 as ln q ∗ (My ) = ! M i h X 1 1 d+1 M T + ln |B| − EY (yi − µ) (yi − µ) ln |B| − tr B ln |B| + const (52) = 2 2 2 2 i=1 ! M i h X M 1 T (53) = EY ((yi − y) − (µ − y)) ((yi − y) − (µ − y)) ln |B| − tr B 2 2 i=1 1 d+1 ln |B| − ln |B| + const 2 2 !! M i h X M 1 T T = ln |B| − tr B EY (yi − y) (yi − y) + M (µ − y) (µ − y) 2 2 i=1 +
(54) (55)
d+1 1 ln |B| − ln |B| + const 2 2 1 M d+1 1 M T = ln |B| − (µ − y) B(µ − y) + ln |B| − ln |B| − tr (BSy ) + const (56) 2 2 2 2 2 +
If we compare equation 56 with equation 204 we can see that q ∗ (My ) Gaussian-Wishart distributed: −1 W B|S−1 if M > d (57) q ∗ (My ) = N µ|y, (M B) y ,M
In order to estimate the parameters of the distributions, we need to evaluate the expec factoring tations EMy [B], EMy [Bµ], EY [yi ] and EY yi yiT . Using the properties of the Gaussian and Wishart distributions [3] we have EMy [B] = M S−1 y
(58)
Z Z
−1 W B|S−1 BµN µ|y, (M B) y , M dµ dB B µ Z Z −1 = B µN µ|y, (M B) dµ W B|S−1 y , M dB B µ Z BW B|S−1 = y , M dB y
EMy [Bµ] =
(59) (60) (61)
B
= M S−1 y y
EY [yi ] = L−1 i γi
(63)
T EY yi yiT = L−1 i + EY [yi ] EY [yi ]
(64)
=
6.3
(62)
L−1 i
+
T −1 L−1 i γi γi L i
(65)
Variational lower bound
In this section, we are going to evaluate the variational lower bound to use to check the convergence of our algorithm. L=
Z Z Z Y
B
q (µ, B, Y) ln µ
P (Φ, µ, B, Y|θ, W, Π) q (µ, B, Y)
dµ dB dY
= EMy ,Y [ln P (Φ, µ, B, Y|θ, W, Π)] − EMy ,Y [ln q (µ, B, Y)] = EY [ln P (Φ|Y, θ, W)] + EMy ,Y [ln P (Y|My )] + EMy [ln P (My |Π)] − EMy [ln q (µ, B)] − EY [ln q (Y)] 7
(66) (67) (68)
The simplifications that are done in the following equations assume that the lower bound is calculated after the q (My ) re-estimation. We are going to define:
φi =
1 X φ ni
(69)
φ∈Si
Rφ =
N X
φj φTj
(70)
j=1
Sφ =
M X X
i=1 φ∈Si
= Rφ +
i h T EY (φ − yi ) (φ − yi )
M X i=1
(71)
T T ni EY yi yiT − EY [yi ] φi − φi EY [yi ]
(72)
We evaluate EY [ln P (Φ|Y, θ, W)]: EY [ln P (Φ|Y, θ, W)] =
M X X
EY [ln P (φ|yi , W )]
(73)
i=1 φ∈Si
M Nd 1XX N ln |W| − ln(2π) − EY (φ − yi )T W(φ − yi ) 2 2 2 i=1 φ∈Si M i h N Nd 1 XX T EY (φ − yi ) (φ − yi ) = ln |W| − ln(2π) − tr W 2 2 2 i=1
=
(74)
(75)
φ∈Si
=
N Nd 1 ln |W| − ln(2π) − tr (WSφ ) 2 2 2
(76)
Now, we evaluate EMy ,Y [ln P (Y|My )]: EMy ,Y [ln P (Y|My )] =
M X
EMy ,Y [ln P (yi |µ, B)]
(77)
i=1
=
M
Md 1X M EB [ln |B|] − ln(2π) − EMy ,Y (yi − µ)T B(yi − µ) 2 2 2 i=1
M M Md 1X EMy ,Y yiT Byi EB [ln |B|] − ln(2π) − 2 2 2 i=1 T T −2EMy ,Y yi Bµ + EMy µ Bµ M Md = EB [ln |B|] − ln(2π) + M yT EMy [Bµ] 2 2 ! M X 1 T T − tr EMy [B] EY yi yi + M EMy Bµµ 2 i=1
=
(78)
(79)
(80)
To evaluate 80 we need to calculate some new expectations. Using the properties of the Gaussian and Wishart distributions [3] we have ˜ ≡ EB [ln |B|] = ln B
d X i=1
ψ
M +1−i 2
8
+ d ln 2 + ln S−1 y
(81)
Z Z
−1 W B|S−1 BµµT N µ|y, (M B) y , M dµ dB B µ Z Z −1 dµ W B|S−1 = B µµT N µ|y, (M B) y , M dB B µ Z −1 = B (M B) + yyT W B|S−1 y , M dB B Z −1 T =M I+ BW B|S−1 y , M dB yy
EMy BµµT =
(82) (83) (84) (85)
B
T = M −1 I + M S−1 y yy
(86)
Now we can plug equations 51, 58, 62, 81 and 86 into 80 EMy ,Y [ln P (Y|My )] =
=
M ˜ − M d ln(2π) + M yT M S−1 y ln B y 2 2 ! M X 1 −1 −1 −1 T T EY yi yi + M M I + M Sy yy − tr M Sy 2 i=1 M ˜ − M d ln(2π) ln B 2 2 1 − tr I + M S−1 y 2
(87)
(88) M X i=1
M ˜ − M d ln(2π) − ln B = 2 2 Md M ˜ ln B − ln(2π) − = 2 2 M ˜ − M d ln(2π) − = ln B 2 2
EY yi yiT − M yyT
!!
1 tr I + M S−1 y Sy 2 1 tr ((M + 1)I) 2 (M + 1)d 2
(90)
d ˜ d ln(2π) − ln B 2 2
(92)
(89)
(91)
Now, we evaluate EMy [ln P (My |Π)] EMy [ln P (My |Π)] = ln α −
Now, we evaluate EMy [ln q (µ, B)] i h −1 + EB ln W B|S−1 EMy [ln q (µ, B)] = EMy ln N µ|y, (M B) y ,M d 1 ˜ M M = ln + ln B − EMy (µ − y)T B(µ − y) − H [q (B)] 2 2π 2 2
where H [q (B)] is the Entropy of the Wishart distribution [3] H [q (B)] = H W B|S−1 y ,M M −d−1 ˜ + Md ln B = − ln B S−1 y ,M − 2 2 1 −N/2 B(W, N ) = N d/2 |W| 2 ZN d
(93) (94)
(95) (96) (97)
and EMy (µ − y)T B(µ − y) = EMy µT Bµ − 2yT EMy [Bµ] + yT EB [B] y = tr EMy BµµT − yT EB [B] y T = tr M −1 I + M S−1 − yT M S−1 y yy y y = dM
−1
9
(98) (99) (100) (101)
If we plug 101 in 94 d EMy [ln q (µ, B)] = ln 2
M 2π
+
1 ˜ d ln B − − H [q (B)] 2 2
(102)
Now, we evaluate EY [ln q (Y)] EY [ln q (Y)] =
M X i=1
=−
−1 EY ln N yi |L−1 i γi , L i M
(104)
M
h T i 1X 1X Md −1 ln(2π) + γ y − L γ ln |Li | − tr LEY yi − L−1 i i i i i 2 2 i=1 2 i=1 M
=−
M
Md 1X 1X −1 T ln(2π) + ln |Li | − EY (yi − L−1 i γi ) Li (yi − Li γi ) 2 2 i=1 2 i=1 M
=−
(103)
(105)
M
Md 1X 1X ln(2π) + ln |Li | − tr LL−1 2 2 i=1 2 i=1
(106)
M
=−
Md Md 1 X ln(2π) − + ln |Li | 2 2 2 i=1
(107)
Finally, simplifying
L=
M 1 1X M −1 N ln |W| − tr (WSφ ) − ln Sy ln |Li | + 2 2 2 i=1 2
−
7 7.1
(108)
Nd d Md ln(2π) − ln M + ln ZM d + (1 + ln 2) 2 2 2
VB likelihood ratio with conjugate priors Likelihood ratio approximation
The approach taken in section 6 has a great computational cost due to the fact that in each iteration we need to re-estimate the q (yi ) distributions for all the speakers of the development database. In this section, we are going to follow a different approximation for not having to use the development i-vectors on the calculus of the likelihood ratio. We intend to estimate the posterior distribution of the model My given Φd , θd , W and the non-informative prior Π; and then, using this posterior as a prior Πd for the calculus of the likelihood ratio. To estimate this posterior we are going to use the following approximation: P (My |Πd ) = P (My |Φd , θd , W, Π) ≈ qd (My )
(109)
where qd (My ) is the variational factor of the My given only the development data. This factor is Gaussian-Wishart distributed as we got in equation 57. −1 if νd > d (110) W B|S−1 qd (My ) = N µ|yd , (βd B) dy , νd
where βd = νd = Md . So with this approximation the likelihood ratios is
P (Φt |T , W, Πd ) P (Φt |N , W, Πd ) P (Φt |My , T , W) P (My |Φt , N , W, Πd ) = P (Φt |My , N , W) P (My |Φt , T , W, Πd ) P (My |Φt , N , W, Πd ) = R (Φt , My , W) P (My |Φt , T , W, Πd ) P (My , Yt |Φt , N , W, Πd ) P (Yt |My , Φt , T , W) = R (Φt , My , W) P (My , Yt |Φt , T , W, Πd ) P (Yt |My , Φt , N , W)
R (Φt , W, Πd ) =
10
(111) (112) (113) (114)
Now we make y = y0 for all speakers in Φt , what we denote as Yt = Yt0 . We take equation 28 and the plug-in likelihood in equation 11 and substitute them in 114. Finally simplifying we get:
R (Φt , W, Πd ) =
1 P (My , Yt0 |Φt , N , W, Πd ) P (y0 |My ) P (My , Yt0 |Φt , T , W, Πd )
(115)
and the Variational Bayes version: R (Φt , W, Πd ) ≈ RV B (Φt , W, Πd ) =
qN (My , Yt0 ) 1 P (y0 |My ) qT (My , Yt0 )
(116)
where qT (My , Yt0 ) and qN (My , Yt0 ) are the variational posteriors assuming that θt = {T } and θt = {N } respectively.
7.2
Variational distributions
We are going to get the variational factors using the conjugate prior. We write again the joint distribution of the random variables: P (Φ, My , Y|θ, W, Πd ) = P (Φ|Y, θ, W) P (Y|My ) P (My |Πd )
(117)
Now, we consider the variational distribution: q (My , Y) = q (My ) q (Y)
(118)
The optimum for the factor q (Y) is the same as in the previous section q ∗ (Y) =
M Y
q ∗ (yi )
(119)
i=1 −1 q ∗ (yi ) = N yi |L−1 i γi , L i
Li = EMy [B] + ni W γi = EMy [Bµ] + W
X
(120) (121) φ
(122)
φ∈Si
The optimum for the factor q (My ) is given by ln q ∗ (My ) = EY [ln P (Φ, My , Y|θ, W, Πd )] + const
(123)
Using decomposition 117 we have: ln q ∗ (My ) = EY [ln P (Y|My )] + EY [ln P (My |Πd )] + const
(124)
Substituting equations 2, and 110 into 124 and absorbing any term that does not depend on µ or B into the additive constant we have ln q ∗ (My ) = =
M X
EY [ln P (yi |My )] + EY [ln P (µ, B|Πd )] + const
(125)
i=1
=
M
1X M ln |B| − EY (yi − µ)T B(yi − µ) 2 2 i=1 +
(126)
βd 1 1 νd − d − 1 ln |B| − (µ − yd )T B(µ − yd ) + ln |B| − tr (BSdy ) + const 2 2 2 2 (127)
Now, we define 11
Now we define: y=
Sy =
M 1 X EY [yi ] M i=1
M X i=1
(128)
M i X h T EY yi yiT − M yyT EY (yi − y) (yi − y) =
(129)
i=1
β ′ = βd + M
(130)
′
ν = νd + M 1 y′ = ′ (βd yd + M y) β βd M T (y − yd ) (y − yd ) S′y = Sdy + Sy + β′
(131) (132) (133)
Now we can write 49 as ln q ∗ (My ) = M i h X M 1 T T = ln |B| − tr B EY (yi − y) (yi − y) + M (µ − y) (µ − y) 2 2 i=1
!!
(134)
1 νd − d − 1 βd 1 ln |B| − (µ − yd )T B(µ − yd ) + ln |B| − tr (BSdy ) + const 2 2 2 2 1 T T = − tr B M (µ − y) (µ − y) + βd (µ − yd ) (µ − yd ) (135) 2 νd + M − d − 1 1 1 ln |B| − tr (B (Sdy + Sy )) + const + ln |B| + 2 2 2 1 T T T T = − tr B M µµ − µy − yµ + yy + βd µµT − µyTd − yd µT + yd yTd (136) 2 1 ν′ − d − 1 1 + ln |B| + ln |B| − tr (B (Sdy + Sy )) + const 2 2 2 1 1 1 T µ (M y + βd yd ) − (M y + βd yd ) µT = − tr B (M + βd ) µµT − 2 M + βd M + βd 1 T T M yy + βd yd yd (137) + M + βd ν′ − d − 1 1 1 ln |B| − tr (B (Sdy + Sy )) + const + ln |B| + 2 2 2 1 β′ (138) = ln |B| − (µ − y′ )T B(µ − y′ ) 2 2 1 1 T − tr B M yyT + βd yd yTd − (M y + βd yd ) (M y + βd yd ) 2 M + βd ν′ − d − 1 1 + ln |B| − tr (B (Sdy + Sy )) + const 2 2 β′ 1 ′ T (139) = ln |B| − (µ − y ) B(µ − y′ ) 2 2 1 βd M ν′ − d − 1 T + const ln |B| − tr B Sdy + Sy + (y − yd ) (y − yd ) + 2 2 M + βd 1 ν′ − d − 1 β′ 1 = (140) ln |B| − (µ − y′ )T B(µ − y′ ) + ln |B| − tr BS′y + const 2 2 2 2 +
see that q ∗ (My ) Gaussian-Wishart distributed: −1 ′ W B|S′−1 q ∗ (My ) = N µ|y′ , (β ′ B) y ,ν 12
if ν ′ > d
(141)
In order to estimate the parameters of the distributions, we need to evaluate the expecta factoring tions EMy [B], EMy [Bµ], EY [yi ] and EY yi yiT . Using the properties of the Gaussian and Wishart distributions [3] we have
7.3
EMy [B] = ν ′ S′−1 y
(142)
′ EMy [Bµ] = ν ′ S′−1 y y
(143)
EY [yi ] = L−1 i γi
(144)
T EY yi yiT = L−1 i + EY [yi ] EY [yi ]
(145)
Variational lower bound
In this section, we are going to evaluate Ade variational lower bound to use to check the convergence of our algorithm.
L=
Z Z Z Y
B
q (µ, B, Y) ln µ
P (Φ, µ, B, Y|θ, W, Π) q (µ, B, Y)
dµ dB dY
(146)
= EMy ,Y [ln P (Φ, µ, B, Y|θ, W, Π)] − EMy ,Y [ln q (µ, B, Y)]
(147)
= EY [ln P (Φ|Y, θ, W)] + EMy ,Y [ln P (Y|My )] + EMy [ln P (My |Π)] − EMy [ln q (µ, B)] − EY [ln q (Y)]
(148)
We are going to define:
φi =
1 X φ ni
(149)
φ∈Si
Rφ =
N X
φj φTj
(150)
j=1
Sφ =
M X X
i=1 φ∈Si
= Rφ +
i h T EY (φ − yi ) (φ − yi )
M X i=1
T T ni EY yi yiT − EY [yi ] φi − φi EY [yi ]
(151)
(152)
We evaluate EY [ln P (Φ|Y, θ, W)]: EY [ln P (Φ|Y, θ, W)] =
N Nd 1 ln |W| − ln(2π) − tr (WSφ ) 2 2 2
(153)
Now, we evaluate EMy ,Y [ln P (Y|My )]: EMy ,Y [ln P (Y|My )] =
Md M EB [ln |B|] − ln(2π) + M yT EMy [Bµ] 2 2 ! M X 1 T T − tr EMy [B] EY yi yi + M EMy Bµµ 2 i=1 13
(154)
To evaluate 154 we need to calculate some new expectations. Using the properties of the Gaussian and Wishart distributions [3] we have ˜ ≡ EB [ln |B|] = ln B
d X
ψ
i=1
ν′ + 1 − i 2
+ d ln 2 + ln S′−1 y
Z Z
−1 ′ W B|S′−1 dµ dB BµµT N µ|y′ , (β ′ B) y ,ν B µ Z Z −1 ′ = B µµT N µ|y′ , (β ′ B) dµ W B|S′−1 dB y ,ν B µ Z −1 ′ B (β ′ B) + y′ y′T W B|S′−1 = dB y ,ν B Z ′ BW B|S′−1 dB y′ y′T = β ′−1 I + y ,ν
EMy BµµT =
= β ′−1 I +
B ′ ′T ν ′ S′−1 y y y
(155)
(156) (157) (158) (159) (160)
Now we can plug equations 129, 142, 143, 155 and 160 into 154 EMy ,Y [ln P (Y|My )] =
=
=
M ˜ − M d ln(2π) + M yT ν ′ S′−1 y′ ln B y 2 2 ! M X 1 ′−1 ′ ′−1 ′ ′T ′ ′−1 T − tr ν Sy EY y i y i + M β I + ν S y y y 2 i=1
M ˜ − M d ln(2π) + M yT ν ′ S′−1 y′ ln B y 2 2 !! M X 1 M ′ ′T ′ ′−1 T − tr I + ν Sy EY y i y i + M y y 2 β′ i=1 M ˜ − M d ln(2π) − M d ln B 2 2 2β ′ ′ ν′ Mν ′ ′ − (y − y)T S′−1 tr S′−1 y (y − y) − y Sy 2 2
Now, we evaluate EMy [ln P (My |Πd )] i h h i −1 EMy [ln P (My |Πd )] = EMy ln N µ|yd , (βd B) + EB ln W B|S−1 dy , νd 1 ˜ βd βd d + ln B − EMy (µ − yd )T B(µ − yd ) = ln 2 2π 2 2 ν −d−1 d ˜ − 1 tr Sdy EM [B] ln B + ln B S−1 y dy , νd + 2 2
(161)
(162)
(163)
(164) (165)
where
EMy (µ − yd )T B(µ − yd ) = EMy µT Bµ − 2yTd EMy [Bµ] + yTd EB [B] yd ′ T ′ ′−1 = tr EMy BµµT − 2yTd ν ′ S′−1 y y + y d ν Sy y d ′ ′T ′ T ′ ′−1 = tr β ′−1 I + ν ′ S′−1 − 2yTd ν ′ S′−1 y y y y y + y d ν Sy y d d ′ = ′ + ν ′ (y′ − yd )T S′−1 y (y − yd ) β
(166) (167) (168) (169)
If we plug 169 in 165 1 ˜ dβd βd ν ′ ′ ′ ln B − − (y − yd )T S′−1 y (y − yd ) 2 2β ′ 2 ν −d−1 ′ d ˜ − ν tr Sdy S′−1 ln B + + ln B S−1 , ν d y dy 2 2
d EMy [ln P (My |Πd )] = ln 2
βd 2π
+
14
(170)
Now, we evaluate EMy [ln q (µ, B)] i h −1 ′ + EB ln W B|S′−1 EMy [ln q (µ, B)] = EMy ln N µ|y′ , (β ′ B) y ,ν ′ d 1 ˜ β′ β = ln + ln B − EMy (µ − y′ )T B(µ − y′ ) − H [q (B)] 2 2π 2 2
where H [q (B)] is the Entropy of the Wishart distribution [3] ′ H [q (B)] = H W B|S′−1 y ,ν ′ ν′ − d − 1 ′ ˜+νd = − ln B S′−1 , ν − ln B y 2 2 1 −N/2 |W| B(W, N ) = N d/2 2 ZN d
(171) (172)
(173) (174) (175)
and EMy (µ − y′ )T B(µ − y′ ) = EMy µT Bµ − 2y′T EMy [Bµ] + y′T EB [B] y′ = tr EMy BµµT − y′T EB [B] y′ ′ ′T ′ = tr β ′−1 I + ν ′ S′−1 − y′T ν ′ S′−1 y y y y y = dβ
′−1
(176) (177) (178) (179)
If we plug 179 in 172 d EMy [ln q (µ, B)] = ln 2
β′ 2π
+
1 ˜ d ln B − − H [q (B)] 2 2
(180)
Now, we evaluate EY [ln q (Y)] EY [ln q (Y)] =
M X i=1
=−
−1 EY ln N yi |L−1 i γi , L i
(181)
M
Md Md 1 X ln(2π) − + ln |Li | 2 2 2 i=1
(182)
Finally, simplifying M
L=
N 1 1X ν′ ln |W| − tr (WSφ ) − ln |Li | − tr S′−1 y (Sy + Sdy ) 2 2 2 i=1 2
(183)
βd ν ′ ′ M ν′ ′ ′ ′ (y − y)T S′−1 (y − yd )T S′−1 y (y − y) − y (y − yd ) 2 2 νd −1 ν ′ ′−1 − ln S + ln Sy 2 dy 2 Nd Md ν′d βd d − ln(2π) − ln Zνd d + ln Zν ′ d + (1 + ln 2) + + ln ′ 2 β 2 2 2 −
A
Inferring a Gaussian distribution
In [4], we can find a detailed explanation of how to estimate a Gaussian distribution using non informative priors. Here, we are going to write some those equations using Precision matrices instead of Covariance matrices. A Gaussian distribution is defined by: Λ 1/2 1 −1 T P (x|m, Λ) = N x|m, Λ = exp − (x − m) Λ(x − m) (184) 2π 2 We assume a non informative prior Π for m and Λ (Jeffrey’s Prior): 15
P (m, Λ|Π) = P (m|Λ, Π) P (Λ|Π) −1 = lim N m|m0 , (kΛ) W (Λ|W0 /k, k) k→0 1/2 Λ −(d+1)/2 = α |Λ| 2π
Now, we get the posteriors for m and Λ, given a set observations X = {x1 , x2 , . . . , xN } ! N/2 N 1X P (m, Λ|Π) Λ T exp − P (m, Λ|X, Π) = (xi − m) Λ(xi − m) P (X|Π) 2π 2 i=1 N/2 N P (m, Λ|Π) Λ 1 T exp − (m − x) Λ(m − x) exp − tr (SΛ) = P (X|Π) 2π 2 2
(185) (186) (187)
(188) (189)
where
N 1 X xi x= N i=1
S=
N X
(190)
(xi − x) (xi − x)
T
(191)
i=1
The marginal posterior for the precision Λ: Z P (Λ|X, Π) = P (m, Λ|X, Π) dm
(192)
m
N/2 1 α 1 Λ 1 = exp − tr (SΛ) P (X|Π) |Λ|(d+1)/2 N d/2 2π 2 1/2 Z NΛ exp − N (m − x)T Λ(m − x) dm 2 m 2π N/2 1 Λ 1 α 1 = tr (SΛ) exp − P (X|Π) |Λ|(d+1)/2 N d/2 2π 2 1 = P (X|Λ) P (Λ|Π) P (X|Π)
Now, we use Z Z −k−(d+1)/2 |V| exp −tr AV−1 dV = V≥0
|W|
k−(d+1)/2
exp (−tr (AW)) dW
(193)
(194) (195)
(196)
W≥0
= |A|
−k
π
d(d−1)/4
d Y
Γ (k − (i − 1)/2)
(197)
i=1
if k > (d − 1)/2 and |A| > 0 to get P (X|Π) =
Z
α Λ
|Λ|
(d+1)/2
1 N d/2
N/2 Λ 1 tr (SΛ) dΛ exp − 2π 2
d(d−1)/4 Qd i=1 Γ((N +1−i)/2) α π N d/2 |πS|N/2 = α π S−1 S |S | −N/2 (see [4]) 0 0 +
where S0 = (x − m0 )(x − m0 )T + V0
16
(198)
N >d
N ≤d
(199)
We can plug 199 into 194 to get the marginal posterior of Λ: P (Λ|X, Π) =
1 ZN d |Λ|
(d+1)/2
= W Λ|S−1 , N
SΛ N/2 1 exp − tr (SΛ) 2 2
if N > d
(200) (201)
Qd where ZN d = π d(d−1)/4 i=1 Γ ((N + 1 − i)/2) and plug-in 199 into 189 we have the join posterior for m and Λ: N/2 1/2 N/2 d/2 Λ 1 N |πS| Λ −(d+1)/2 N T P (m, Λ|X, Π) = |Λ| exp − (m − x) Λ(m − x) exp − tr (SΛ) 2π 2π ZN d 2 2 (202) (N +1)/2 1/2 N SΛ 1 1 N T x) Λ(m − x) exp − = (m − tr (SΛ) exp − 2 (d+1)/2 πS 2 2 ZN d |Λ| (203) " 1/2 N/2 # " # SΛ NΛ N 1 1 T exp − (m − x) Λ(m − x) exp − tr (SΛ) = (d+1)/2 2 2π 2 2 ZN d |Λ| (204)
= N m|x, (N Λ)
−1
W Λ|S−1 , N
if N > d
(205) (206)
So finally, summing up: P (m, Λ|X, Π) = P (m|Λ, X, Π) P (Λ|X, Π) −1 P (m|Λ, X, Π) = N m|x, (N Λ) P (Λ|X, Π) = W Λ|S−1 , N if N > d
17
(207) (208) (209)
Now, we can calculate the marginal posterior for the mean m Z P (m|X, Π) = P (m, Λ|X, Π) dΛ
(210)
Λ
N/2 Z 1 α 1 1 Λ exp − tr (SΛ) (211) = P (X|Π) Λ |Λ|(d+1)/2 N d/2 2π 2 N Λ 1/2 exp − N (m − x)T Λ(m − x) dΛ 2π 2 (N +1)/2 Z Λ 1 1 α T exp − tr S + N (m − x) (m − x) Λ = dΛ P (X|Π) Λ |Λ|(d+1)/2 2π 2 (212) −(N +1)/2 αZ(N +1)d 1 T = (213) S + N (m − x) (m − x) P (X|Π) π (N +1)d/2 −(N +1)/2 Z(N +1)d N d/2 T N/2 (214) + N (m − x) (m − x) |πS| = S ZN d π (N +1)d/2 −(N +1)/2 Γ ((N + 1)/2) N d/2 −1/2 T −1 = + N S (m − |S| x) (m − x) (215) I Γ ((N + 1 − d)/2) , π d/2 −(N +1)/2 Γ ((N + 1)/2) N d/2 T −1/2 −1 x) (m − x) (216) |S| I + N S (m − = Γ ((N + 1 − d)/2) , π d/2 1/2 −(N +1)/2 Γ ((N + 1)/2) N S−1 = (217) 1 + N (m − x)T S−1 (m − x) Γ ((N + 1 − d)/2) π = TM (m|x, S/N, N + 1) 2
(218)
= T m|x, S/N , N + 1 − d if N > d
(219)
where we have used the matrix relation in [8] |I + BC| = |I + CB|
(220)
So the mean marginal posterior is Student’s T distributed. We have found several definitions of Student’s T distribution in the literature. This is the definition in [4] TM x|m, Λ
−1
1/2 Λ −N/2 Γ (N/2) 1 + (x − m)T Λ(x − m) ,N = Γ ((N − d)/2) π
(221)
This is the definition in [3] that seems more widely spread T x|m, Λ
−1
1/2 −(N +d)/2 Γ ((N + d)/2) Λ 1 T ,N = 1 + (x − m) Λ(x − m) Nπ Γ (N/2) N −1 = TM x|m, N Λ , N + d
(222) (223)
References [1] Niko; Brummer and Edward De Villiers, “The Speaker Partitioning Problem,” in Oddyssey Speaker and Language Recognition Workshop, 2010. [2] Niko Brummer, “Bayesian PLDA,” 2010. [3] Christopher Bishop, Pattern Recognition and Machine Learning, 2006. [4] Thomas Minka, “Inferring a Gaussian distribution,” 1998. 18
[5] Kaare Petersen and Michael Pedersen, The Matrix Cookbook, Technical University of Denmark, 2008. [6] H.V. Henderson and S.R. Searle, “On Deriving the Inverse of a Sum of Matrices,” Society for Industrial and Applied Mathematics, vol. 23, no. 1, pp. 53–60, 1981. [7] M. Brookes, “The Matrix Reference Manual,” 2005. [8] Thomas Minka, “Old and New Matrix Algebra Useful for Statistics,” 2000.
19