AN UTTERANCE COMPARISON MODEL FOR ...

Viewer
Transcript

AN UTTERANCE COMPARISON MODEL FOR SPEAKER CLUSTERING USING FACTOR ANALYSIS Woojay Jeon, Changxue Ma, and Dusan Macho Motorola, Inc. Schaumburg, Illinois, U.S.A. ABSTRACT We propose a novel utterance comparison model based on probability theory and factor analysis that computes the likelihood of two speech utterances originating from the same speaker. The model depends only on a set of statistics extracted from each utterance and can efficiently compare utterances using these statistics without requiring the indefinite storage of speech features. We apply the model as a distance metric for speaker clustering in the CALLHOME telephone conversation corpus to achieve competitive results compared to three other known similarity measures: the Generalized Likelihood Ratio, Cross-Likelihood Ratio, and eigenvoice distance. Index Terms— speaker clustering, speaker diarization, factor analysis 1. INTRODUCTION Speaker diarization [1], also referred to as speaker segmentation, is to identify “who spoken when” given an unlabelled, continuous conversation between two or more unknown speakers. State-of-the-art speaker diarization systems for single-channel audio often consist of two stages [1]: 1. An utterance segmentation stage where a continuous recording is broken down into small “utterances,” each utterance supposedly containing speech from only one speaker, typically using some sort of speaker-change detection method, and 2. A speaker clustering stage where some distance measure and clustering scheme are used to group similar utterances together, attributing them to the same speaker. Assuming that speakers never talk over each other in the conversation, we have perfect diarization when each utterance is “speaker-pure” in the sense that it truly contains speech from only one speaker, and when there is a one-to-one correspondance between the utterances of the clusters and the utterances of the actual speakers. Hence, in such a system, the diarization accuracy depends on how well the speaker-change detection is done to obtain pure clusters in the first stage, and how reliable the distance measure and clustering method is to produce the final result in the second stage. In this paper, we focus specifically on the speaker clustering stage, and propose a new “utterance comparison model” for comparing two arbitrary utterances based on speaker similarity. Many comparison methods have been used in the past, including the generalized likelihood ratio [1], cross-likelihood ratio [2], eigenvoice distance [3], and more. However, many of these measures are somewhat adhoc in their development and use. Recently, much work has been done on eigenvoices [4] and factor analysis [5] to accurately account for speaker and session variability in speaker models in a statistical framework. Eigenvoices are particularly efficient and effective in modeling speech utterances when the amount of data is too scarce for reliable MAP adaptation. Because speaker diarization often has to be done using very short utterances (a few seconds or

shorter), eigenvoices have proved to be quite effective for the task [6]. Our proposed model is rigorously derived from a very basic probability equation on which we apply calculus and factor analysis to mold into a practical and efficient form that can be fairly easily computed. Note that such an utterance comparison model could be used in any task requiring the comparison of arbitrary speech utterances, including speaker identification and speaker verification, but in this study we focus specifically on the speaker clustering aspect. 2. PROPOSED UTTERANCE COMPARISON MODEL 2.1. Basic formulation Assume two arbitrary speech utterances, Xa of length A and Xb of length B, each utterance defined as a sequence of acoustic feature vectors originating from exactly one speaker. Xa = {xa,1 , xa,2 , · · · , xa,A } Xb = {xb,1 , xb,2 , · · · , xb,B }

(1)

We begin by defining the hypothesis H1 that Xa and Xb were uttered by the same speaker. We define the utterance comparison function as the posterior probability of H1 . Assuming we can obtain the posterior P ( wi | X) for every speaker wi in the world for any given utterance X, an exact formula for the probability can be given: P ( H1 | X a , X b ) =

W X

P ( wi | X a ) P ( wi | X b )

(2)

i=1

where W is the population of the world. Of course, it is completely impractical to try to directly solve this equation, so we turn to eigenvoice theory and factor analysis, which allow us to approximate the GMM mean “supervector” (the mean vectors for all mixtures stacked onto a single vector) s of a speaker model as [5] s=m+Vy y ∼ N [0, I]

(3)

where m contains the mean parameters of a universal background model (UBM), V is an eigenvoice matrix, and y is a v-dimensional speaker factor vector with a unit Gaussian distribution. Now, if we assumed each speaker wi is mapped to a unique v-dimensional speaker factor vector yi , the summation in (2) can be rewritten as P ( H1 | X a , X b ) =

W X

P ( yi | X a ) P ( yi | X b )

(4)

i=1

This equation is still impractical, and we now mold it into a more analytical form. Assuming each d’th dimension in y, denoted by

y(d), is within the range [−R, R] for some very large real number R, we can partition the summation space into small volumes of length T at each side. Consider a small volume V (t1 , · · · , tv ) bounded by the two extreme points (t1 , · · · , tv ) and (t1 + T, · · · , tv + T ). If the volume is sufficiently small, all vectors yi that happen to lie in this volume will be similar to each other, and therefore, we can approximate the P ( yi | Xa ) value for each yi in the volume as the average value over such points:

2.2. Derivation of Closed Form Using Factor Analysis Let N (x; m, C) denote the d-dimensional Gaussian density function for observation x with mean m and covariance matrix C. It is easy to see that N (x; m1 + m2 , C) = N (m1 ; x − m2 , C) .

(12)

We also state without proof the following identity: P ( yi | X a ) ≈ =

1 |{j : yj ∈ V (t1 , · · · , tv )}|

X

P ( yj | X a )

yj ∈V(t1 ,··· ,tv )

P ( y ∈ V (t1 , · · · , tv )| Xa ) for each yi ∈ V (t1 , · · · , tv ) n (y ∈ V (t1 , · · · , tv ))

(5) where y ∈ V (t1 , · · · , tv ) means y (1) ∈ [t1 T, t1 T + T ) , · · · , y (v) ∈ [tv T, tv T + T ), and |·| indicates set cardinality, and n(·) indicates the number of speakers in the world for which the speaker factors satisfy the ranges specified in (·). Hence, the summation in (4) can then be rewritten as summations over all such volumes, using the average P ( y| Xa ) P ( y| Xb ) for each volume and multiplying it by the number of speakers in that volume: X

Dd×d −1 = AT C1 −1 A + B T C2 −1 B, D = DT dd×1 = AT C1 −1 m1 + B T C2 −1 m2

···

X P ( y ∈ V (t1 , · · · , tv )| Xa ) n (y ∈ V (t1 , · · · , tv ))

t1 =−R/T tv =−R/T

·

where

+R/T

+R/T

P ( H1 | X a , X b ) ≈

N (Aa×d xd×1 ; m1 , C1 ) N (Bb×d x; m2 , C2 ) −1/2 |C1 | |C2 | = (2π)−(a+b−d)/2 · N (x; Dd, D) |D| o 1n T −d Dd + m1 T C1 −1 m1 + m2 T C2 −1 m2 · exp − 2 (13)

P ( y ∈ V (t1 , · · · , tv )| Xb ) · n (y ∈ V (t1 , · · · , tv )) n (y ∈ V (t1 , · · · , tv ))

For an utterance X with A feature vectors, we apply (3) to write p ( X| y) =

(6) Note that n (y ∈ V (t1 , · · · , tv )) ≈ W · P (y ∈ V (t1 , · · · , tv ))

(7)

Also, for sufficiently small T , probability theory tells us that P (y ∈ V (t1 , · · · , tv )) ≈ pY (t1 T, · · · , tv T ) T v

(8)

where the speaker factor vector y is now continuous instead of discrete, and the probability P (·) is now a probability density function p(·). We have X

···

X pY ( t1 T, · · · , tv T | Xa ) T v W · pY (t1 T, · · · , tv T ) T v

t1 =−R/T tv =−R/T pY ( t1 T, · · · , tv T | Xb ) T v · W · pY · W · pY (t1 T, · · · , tv T ) T v

(t1 T, · · · , tv T ) T v

(9)

By the definition of the Riemann integral, we see that, for sufficiently small T , we have 1 P ( H1 | X a , X b ) ≈ W

Z

R

··· −R

Z

R −R

A Y

p ( xt | y) =

M A X Y

ck N (xt ; mk + Vk y, Ck )

t=1 k=1

t=1

(15) where ck , mk , and Ck are the weight, mean vector, and covariance matrix, respectively, of the k’th Gaussian in the GMM model admitted by y, M is the total number of Gaussians, and Vk is the submatrix of the eigenvoice matrix V in (3) that corresponds to the k’th T Gaussian. Note that V = V1 T | V2 T | · · · | VM T . (15) is too difficult to manage analytically. To simplify, we assume each observation is “generated” by only one Gaussian, i.e., p ( X| y) =

A Y

N (xt ; mt + Vt y, Ct ) =

t=1

+R/T

+R/T

P ( H1 | X a , X b ) ≈

(14)

A Y

N (Vt y; xt − mt , Ct )

t=1

(16) where mt , Vt , and Ct are the d × 1 mean vector, d × v eigenvoice matrix, and d × d covariance matrix pertaining to the Gaussian that “generated” xt , respectively, and we have applied (12). There can be a number of ways to decide which Gaussian to use for each xt . One way is to obtain the speaker factors for utterance X via maximum likelihood estimation using the method described in [7], then for each xt finding the Gaussian with the maximum “occupation” probability, arg max γm . m

p Y |X ( y1 , · · · , yv | Xa ) pY (y1 , · · · , yv )

· p Y |X ( y1 , · · · , yv | Xb ) dy1 · · · dyv

By continuously applying (13) to pairs of Gaussian terms in (16), it is not difficult to show that (10)

Assuming R is extremely large, we can set the integration limits to −∞ to ∞. The above equation is now written as Z ∞ p ( y| Xa ) p ( y| Xb ) 1 dy P ( H1 | X a , X b ) ≈ W −∞ p (y) Z ∞ (11) 1 1 1 = p ( Xa | y) p ( Xb | y) p (y) dy W p (Xa ) p (Xb ) −∞

p ( X| y) = α (X) N (y; DA dA , DA )

(17)

where α (X) =(2π)

−(Ad−v)/2

A 1 Y |Ct | |DA | t=1

!−1/2

i 1h · exp − −dA T DA dA + fA 2

(18)

where

The final form is A X

DA −1 =

Vt T Ct −1 Vt

t=1

dA =

A X

T

Vt C t

−1

(xt − mt )

(19)

t=1

fA =

A X

− 1 2 1 |JA | |JB | P ( H1 | X a , X b ) = W |D| o 1n T −d Dd + dA T JA dA + dB T JB dB · exp − 2

where (omitting DB , dB , JB , for which equivalent expressions can be easily obtained)

(xt − mt )T Ct −1 (xt − mt )

t=1

DA −1 =

This lets us write Z +∞ p ( X| y) p (y) dy p (X) = Z

dA =

+∞

(20)

−∞

= α (X) β (X)

Va,t T Ca,t −1 Va,t , JA = I + DA −1

A X

Va,t T Ca,t −1 (xa,t − ma,t )

−1

(27)

t=1

α (X) N (y; DA dA , DA ) N (y; 0, I) dy Z

A X t=1

−∞

=

(26)

+∞

N (y; JA dA , JA ) dy = α (X) β (X) −∞

where −1/2 |DA | β (X) =(2π)−v/2 |JA | i 1h · exp − −dA T JA dA + dA T DA dA 2

(21)

JA −1 =DA −1 + I

D = JA −1 + JB −1 − I

−1

, d = dA + dB

The computation of (26) is fairly straightforward. The matrices Vk Ck −1 and Vk T Ck −1 Vk pertaining to the k’th Gaussian can be precomputed offline for k = 1, 2, · · · , M . For each input utterance, we can use ML estimation of the speaker factors to obtain a “mixture sequence,” i.e., a sequence of indices indicating which Gaussian “generated” each feature vector, and then use the precomputed matrices to quickly compute JA −1 and dA in (27). These two items serve as representative statistics of the speech utterance, and they are all that is needed to compute the probability in (26). In a memory constrained environment, one may discard the feature vectors and keep only the statistics for all future comparisons with other utterances.

Hence, (11) becomes 3. EXPERIMENT

P (H1 |Xa , Xb ) 1 1 1 W p (Xa ) p (Xb )

Z

∞

p ( Xa | y) p ( Xb | y) p (y) dy Z ∞ 1 1 1 = α (Xa ) α (Xb ) W α (Xa ) β (Xa ) α (Xb ) β (Xb ) −∞ =

−∞

· N (y; DA dA , DA ) N (y; DB dB , DB ) N (y; 0, I) dy −1/2 |DA | |DB | 1 = (2π)−v/2 W β (Xa ) β (Xb ) |DC | o 1n −dC T DC dC + dA T DA dA + dB T DB dB · exp − 2 Z ∞ · N (y; DC dC , DC ) N (y; 0, I) dy (22) −∞

where DC −1 = DA −1 + DB −1 , dC = dA + dB

(23)

Combining the last two Gaussians again via (13), we get −1/2 (2π)−v |DA | |DB | P ( H1 | X a , X b ) = W β (Xa ) β (Xb ) |D| o n 1 −dT Dd + dA T DA dA + dB T DB dB (24) · exp − 2 where we have used the fact that we have defined

R +∞ −∞

N (y; Dd, D) dy = 1 and

D−1 = I + DA −1 + DB −1 , d = dC

(25)

We used the proposed utterance comparison model in (26) and (27) to perform speaker clustering in the CALLHOME database, which is a set of 680 recorded telephone conversations involving two or more speakers. For each conversation, we used the transcription provided with the data to divide it into a set of “speaker-pure” utterances. On occasion when speakers from both channels spoke simultaneously, we used the speech from only one of the channels to ensure that each utterance contains speech from exactly one speaker. Note that the purpose of this experiment is only to evaluate the speaker-based comparison and clustering capability of the proposed utterance comparison model. To use the model in a complete automatic speaker diarization system, a state-of-the-art speaker change detector and utterance segmenter will be needed to obtain such utterances automatically. The acoustic features used for this experiment consisted of 12 MFCC coefficients plus energy and delta coefficients, resulting in 26 coefficients. A harmonicity-based Voice Activity Detector was used to drop out any non-speech frames. The Universal Background Model (UBM) was trained using conventional ExpectationMaximization and the set of eigenvoices were trained using Principal Component Analysis (PCA) on MAP(maximum a priori)-adapted speaker-dependent GMMs, all from the CALLFRIEND and Switchboard I databases. 256 Gaussians were used for all GMM’s, and the number of eigenvoices was limited to 20 following [6]. Agglomerative hierarchical clustering was used, and the similarity between pairs of clusters was computed using the union of the feature vectors they contained. We jointly use cluster purity and speaker number accuracy to evaluate performance. Cluster purity is defined as [7] IPUR =

NC NS 1 X X nij 2 n i=1 j=1 ni

(28)

Table 2. Number of conversations and clustering accuracy of proposed model for varying number of speakers in CALLHOME corpus 2 3 4 5 6 7 N 490 129 44 11 4 2 IPUR 89.10 84.07 75.72 73.71 76.30 76.12 ISPK 90.31 85.01 80.68 76.36 95.83 64.29

Table 1. Clustering accuracy for CALLHOME utterances Proposed CLR GLR EV IPUR 86.82 85.50 83.70 48.60 ISPK 81.97 74.21 64.06 64.24

where NC is the number of resulting clusters, NS is the actual number of speakers, ni is the number of utterances in the i’th cluster, ni,j is the number of utterances by speaker j in cluster i, and n is the total number of utterances. When the clustering is done perfectly, we have NC = NS and ni,j = ni when i = j and ni,j = 0 when i 6= j, resulting in a purity of 1. Note, however, that the purity will also be 1 if each cluster contains only one (speaker-pure) utterance. Therefore, the purity measure only makes sense when used in conjunction with a measure on how accurately the number of speakers was estimated, as was done in other experiments [2, 7]. We use the following measure for the overall speaker number accuracy [2]: PN |NS,k − NC,k | ISPK = 1 − k=1 (29) PN k=1 NS,k

where NS,k is the number of speakers in the k’th conversation, NC,k is the corresponding number of clusters, and N is the total number of conversations. Since the cluster purity and speaker number accuracy vary according to the threshold value used as the stopping criterion, 2 2 we searched for the case where IPUR + ISPK was the maximum. Tab.1 shows the results. For comparison, we performed the same clustering experiment using three other cluster similarity measures, using the same utterances. The “cross-likelihood ratio (CLR)” between some cluster a and some cluster b is [2] CLRa,b = log

P ( X a | λb ) P ( X b | λa ) + log P ( X a | λa ) P ( X b | λb )

(30)

where λa and λb are the model parameters of clusters a and b. A previous study used either GMM’s or a VQ model [2], but here, we simply used the speaker factor vector y obtained via ML-estimation [4] to build an adapted GMM for each utterance via (3). The generalized likelihood ratio (GLR) is defined as GLRa,b = log

P ( Xa , Xb | λa,b ) P ( X a | λa ) P ( X b | λb )

(31)

where λa,b is a single set of model parameters (full-covariance single Gaussian following [7]) trained on both Xa and Xb . We also tried using a rudimentary Euclidean distance in the eigenvoice space between speaker factor vectors [3]: EVa,b = |ya − yb |

(32)

As seen in Tab.1, the proposed utterance comparison model gave the best overall performance for the task. All methods required a stopping critierion for the clustering, and the most widely used is the Bayesian Information Criterion (BIC). Here, however, we simply applied a single threshold to the cluster similarities at each iteration to know when to stop, varying the threshold over a range of values to find the threshold giving the best overall performance for each of the four cases. The fact that the proposed model gives the best performance overall in Tab.1 implies that not only is it a reliable distance metric, it is also statistically stable in some sense for such simplistic thresholding to work. The celebrated GLR had a low speaker number accuracy because it only worked well for relative comparisons

between clusters and not as an absolute measure for which a global threshold can be applied. The low performance of the EV distance shows that extracting the speaker factors alone is not enough to perform reliable speaker clustering. CLR was the most computationally complex, as it required GMM decoding over whole utterances every time it was computed, in contrast to the proposed model, which required only the auxiliary statistics of each utterance for use in a single nonlinear equation independent of utterance length. Because the CALLHOME database is somewhat imbalanced in that the majority of conversations are between two people, in Tab.2 we show the number of conversations and the best clustering accuracy using the proposed model for each conversation group. 4. CONCLUSION AND FUTURE WORK We proposed a new speech utterance comparison model for speaker clustering rigorously derived from probability theory using factor analysis and eigenvoices. The model is easy to implement and is efficient to compute because it relies only on a set of representative statistics extracted from the speech utterance rather than the utterance itself. The model was shown to give robust performance in a speaker clustering task compared to three other known methods. More extensive comparisons with other state-of-the-art distance metrics will be made in the future, as well as investigation on how the Bayesian Information Criterion and other stopping criterion relates to the proposed similarity measure. 5. REFERENCES [1] S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker diarization systems,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 5, Sept. 2006. [2] M. Nishida and T. Kawahara, “Speaker model selection based on the bayesian information criterion applied to unsupervised speaker indexing,” IEEE Trans. ASLP, vol. 13, no. 4, July 2005. [3] R. Faltlhauser and G. Ruske, “Robust speaker clustering in eigenspace,” in Proc. IEEE Workshop on ASRU, 2001. [4] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid speaker adaptation in eigenvoice space,” IEEE Trans. Audio, Speech, and Language Processing, vol. 8, no. 6, Nov. 2000. [5] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A study of interspeaker variability in speaker verification,” IEEE Trans. ASLP, vol. 16, no. 5, July 2008. [6] F. Castaldo, D. Colibro, E. Dalmasso, P. Laface, and C. Vair, “Stream-based speaker segmentation using speaker factors and eigenvoices,” in IEEE ICASSP, 2008. [7] W.-H. Tsai, S.-S. Cheng, and H.-M. Wang, “Automatic speaker clustering using a voice characteristic reference space and maximum purity estimation,” IEEE Trans ASLP, vol. 15, no. 4, 2007.