Fast Clustering of Gaussians and the Virtue of ...

Viewer
Transcript

Fast Clustering of Gaussians and the Virtue of Representing Gaussians in Exponential Model Format Peder A. Olsen and Karthik Visweswariah IBM T.J. Watson Research Center {pederao,kv1}@us.ibm.com

Abstract This paper aims to show the power and versatility of exponential models by focusing on exponential model representations of Gaussian Mixture Models (GMMs). In a recent series of papers by several authors, GMMs of varying structure and complexity have been considered. These GMMs can all be readily represented as exponential models and oftentimes favorably so. This paper shows how the exponential model representation can offer useful insight even in the case of diagonal and full covariance GMMs! The power of the exponential model is illustrated by proving the concavity of the log det function and also by discovering how to speed up diagonal covariance gaussian clustering.

1. Introduction The exponential model can be represented in various different forms. The preferred form in this paper is the form promoted for discrete probability distributions in [1, 2]. The exponential model with parameters θ ∈ RD and features f : Rd → RD is thus written P (x; θ) =

1 θ> f (x) e , Z(θ)

(1)

R > where Z(θ) = Rd eθ f (x) dx is the normalizer for the exponential distribution. Sometimes we shall drop the normalizer entirely and merely write ˜> ˜ f (x)

P (x; θ) = eθ

,

(2)

with the understanding that the features and model parameters have been extended as follows: 1 ˜f (x) = ˜ = − log Z(θ) . and θ f (x) θ When the exponential model is represented using (2) the computation of the likelihood is just an innerpoduct between the feature vector and the extended model param˜ For programming in C++ or a similar language eters θ. this representation is quite convenient. The implementation of the likelihood is simple, allows us to share code between different exponential models, and we can use library routines to compute the vector inner product. The

last point is particularily interesting if we are concerned with the efficiency of implementing the likelihood computation. 1.1. Diagonal and Full Covariance Models To see some examples of normal distributions represented in exponential model format we consider the full and diagonal covariance Gaussian. The full covariance model 1

N (x; µ, Σ) =

>

−1

e− 2 (x−µ) Σ (x−µ) (2π)d/2 det(Σ)1/2

can be written in exponential model format by introducing linear and quadratic exponential model parameters. The quadratic parameters are the precision parameters P = Σ−1 and the linear parameters are ψ = Pµ. Write vec(S) for the entries of the upper diagonal portion of a symmetrix matrix S in some fixed √ order with the off diagonal elements multiplied by 2. The features and exponential model parameters for the full covariance model are x ψ ff (x) = and θ = . f vec(P) − 21 vec(xx> ) The corresponding normalizer is given by Z(θ) =

det(P1/2 ) − 1 ψ> P−1 ψ e 2 . (2π)d/2

For a diagonal covariance model where Σij = 0 for i 6= j we simply omit the off diagonal parameters, so that x ψ . and θ = fd (x) = d diag(P) − 21 diag(xx> ) When diagonal covariance models are not represented in exponential form we write v = diag(Σ) for the variances. 1.2. Other Gaussians that are Exponential Models in Disguise A number of papers on speech recognition have considered various forms of GMMs with varying restrictions

on the covariance models. Each of these may be represented in exponential model format, though some may be slightly disguised. A number of models have linear features x, but different quadratic features that corresponds to some tying of the covariance parameters. Here is a list of some different exponential models that corresponds to some form of gaussian model tying 1. Semi-tied covariance model, [3], also known as the Maximum Likelihood Linear Transforms (MLLT) model, [4], have quadratic features (a>k x)2 , k = 1, . . . , d where ak ∈ Rd are parameters shared among all gaussians in a possibly large GMM. 2. The Extended Maximum Likelihood Linear Transforms (EMLLT) model considers a larger number of features (a>k x)2 , k = 1, . . . , D, D ≥ d, [5, 6].

3. The Subspace Precision And Mean (SPAM) model allows for general quadratic featuresx> Sk x, k = 1, . . . , D, Sk ∈ Rd×d , 1 ≤ D ≤ d2 , [7, 8]. The SPAM model also allows for more general linear features Lx, L ∈ RM ×d , 1 ≤ M ≤ d.

4. The most general exponential model approximating a full covariance model known to us at the moment is the Subspace Covariance Gaussian Mixture Model (SCGMM). In the SCGMM model the features are in a general subspace of the full covariance features; fS (x) = Bff (x), B ∈ RD×d(d+3)/2 , [7, 9]. Finally we mention that some authors have considered even more general features in the setting of GMMs. By considering full covariance features ff (z) with z = Ay, A ∈ Rd×d and y1 = x1 y2 = x2 + h2 (x1 ) y3 = x3 + h3 (x1 , x2 ) .. .. . .

yd = xd + hd (x1 , x2 , . . . , xd−1 ). we can explicitly model the data with an exponential model with features that are more general than linear and quadratic features. With hj , j = 2, . . . , d being quadratic functions we are in effect using an exponential model with features that are quartic polynomials, [10]. [11] considers features that are not polynomials in x. For all of these exponential models the art of the modeling lies in choosing the parameters determining the exponential features. However, an optimization package will generally do the trick. If the models are represented in the exponential model format then a lot of the sourcecode for the maximum likelihood estimation, gaussian clustering and likelihood evaluation can be shared across all the exponential models. We will show how clustering works in the case of diagonal covariance modeling. The exponential model formulation will readily generalize to all the other models, whereas the standard variance and mean formulation does not.

2. Concavity of log det Before commencing our discussion on clustering we take a detour to discuss one of the most important properties of exponential models, namely the convexity of log Z(θ) on its natural domain DZ = {θ : X(θ) < ∞}. This property makes optimization in a maximum likelihood context particularily simple. The same is true for the optimization for the M-step in the EM algorithm for mixtures of exponential models. To prove convexity it suffices to show that the Hessian is positive definite. But this follows immediately from the observation that the Hessian equals the covariance of f (x) with respect to the exponential model density, and any covariance matrix is positive definite. As a particular example, we may consider a full covariance model with µ = ψ = 0. For this model we have 2 log Z(θ) = − log det P + d log(2π). From this it follows that the log det function is concave on the domain of positive definite symmetric matrices. The simple proof of this fact usually requires quite a bit of finesse, [12]. We hope that this example along with the clustering example in the next section will convince the readers of the power and versatility of the exponential model. Though the concavity of log det is well known, the corresponding convexity results for the other constrained Gaussian models such as EMLLT, SPAM and SCGMM are less trivial to arrive at without use of the exponential model viewpoint.

3. Gaussian Clustering The objective of Gaussian clustering is to discover a smaller set of Gaussians, C, to represent a larger set of Gaussians, G. This is particularily useful for fast evaluation of gaussians in a speech recognition system. A system with 100,000 gaussians can for example be represented with 1000 gaussians that can be rapidly evaluated and used as guidance to choose a subset of the full 100K gaussians for further inspection. A clustering consists of learning • A clustering map c : G → C.

• Exponential model parameters θ c , c ∈ C for each of the cluster gaussians. We shall measure the goodness of the clustering in terms of the average Kullback Leibler divergence D(gkc(g)), [13], between a gaussian in G and its associated cluster gaussian c(g) ∈ C: X D(G, C) = πg D(gkc(g)), g∈G

where πg are some averaging weights. We chose πg to be the GMM priors of the gaussians in G. If these are unavailable, uniform weights πg = 1/|G| seems to work just as well in practice. To minimize D(G, C) we use a hill climbing technique commonly applied in the K-means algorithm, [14]. This algorithm procedes as follows

1. For each g ∈ G choose c(g) such that D(gkc(g)) ≤ D(gkc) for all c ∈ C.

2. For a fixed mapping c : G → C choose θ c so as to minimize the D(G, C).

3. Repeat the previous two steps until some convergence criteria is satisfied. 3.1. Estimation of Gaussian Cluster Parameters We remind the reader that for diagonal covariance models N (x; µ, v) = q

(2π)

1 Qd d

i=1

e

−

Pd

i=1

(xi −µi )2 2vi

vi

the Kullback Leibler divergence is given by Z N (x; µg , vg ) N (x; µg , vg ) log dx D(gkc) = N (x; µc , vc ) Rd d

=

1X vci vgi +(µgi − µci )2 log − d+ . 2 i=1 vgi vci

(3)

If we fix the mapping c : G → C then the gaussian model parameters that optimizes D(G, C) are given by P g:c(g)=c πg µg ˆc = P µ and g:c(g)=c πg P 2 g:c(g)=c πg (vg + µg ) P ˆc = , v g:c(g)=c πg

where µ2g is to be understood as elementwise squaring. In the exponential model representation the parameters are > > ˆ c = (ψ ˆ> , p ˆ = µ ˆ c = 1/ˆ ˆ c /ˆ θ vc and ψ vc c ˆ c ) , where p c and the division operator is to be taken elementwise. We ˆ c is no more computationally see that reestimation of θ ˆ c and v ˆ c . Note that since for costly than reestimation of µ ˆ c minimizes D(G, C) the value of a fixed cluster map θ ˆc . D(G, C) cannot increase under the map θ c → θ 3.2. Computing the Cluster Map For a fixed value of θ c the cluster map that minimizes c : G → C satisfies c(g) = argminc∈C D(gkc). Ignoring terms that are specific to the gaussian g in (3), and precomputing terms that are specific only to the cluster c it appears that at least d subtractions, multiplications and divisions and 2d additions are needed. This is of the order of 5d operations. We shall now see that by use of the exponential model formulation the computation can be simplified. We have Z ˜g − θ ˜ c )>˜fd (x)N (x; µ , vg )dx D(gkc) = (θ g Rd

>

˜g − θ ˜ c ) < ˜fd >g , = (θ

where



 1 . µg < ˜fd >g =  − 21 (vg + µ2g )

˜> < ˜fd >g does not depend on the Morevover, since θ g ˜> < ˜fd >g cluster gaussian c it suffices to compare −θ c to determine the value of c that minimizes D(gkc). Note that the computation of this innerproduct only requires 2d + 1 multiply and add operations. In essence d subtractions are being saved from the above representation. In addition, since the operation is a straightforward innerproduct, fast library routines for vector dot products can be taken advantage of. This representation could of course be discovered directly from (3), but the exponential model formulation gives it to us for free! 3.3. Other Exponential Models We finish the discussion on clustering by adding some notes on implementation. There are two parts of the clustering computation that are specific to the type of exponential model that we consider. Firstly, we need to implement code that computes < ˜f >g . The expected features for a particular gaussian will take the place of the statistics f (xt ) that is needed in the EM-algortihm. The second model specific computation is the update of the model parameters θ c . The function we need to minimize for a particular cluster gaussian c may be written X ˜> πg < ˜f >g . −θ c g:c(g)=c

The optimization of this function is already at hand if we have implemented the EM algorithm for this specific model. In the EM algorithm the target function that is maximized is X ˜> θ γg (xt )˜f (xt ). c t

where the counts γg (xt ) are the aposteriori probability of gaussian g at time t. We seeP that if the statistics P ˜ ˜f (xt ) is replaced with − γ (x ) g t g:c(g)=c πg < f >g t then the two functions are identical. As an example consider the full covariance model. We have   1  . µg < ˜f> f >g = > vec(Σg + µg µg ) The updates are given by X µc = πg µg

and

g:c(g)=c

vec(Σc ) =

X

g:c(g)=c

πg vec(Σg + µg µ>g )

and Pc = Σ−1 c and ψ c = Pc µc . For a second example we can consider the EMLLT model. The feature space stats are > 2 < ˜f> f >g = Eg [(ak x) ]

=

a>k

(

d+D X

θgi ak a>k )−1

+

µg µ>g

i=d+1

!

ak ,

{θgi }D+d i=d

where are the quadratic model parameters for gaussian g and ψg = Pg µg are the linear model parameters. The optimization of the final function is discussed at length in [5] and is readily available to us if we have already implemented the EM algorithm for the EMLLT model. The same procedure can also be easily applied also to the SPAM and SCGMM models. The authors arrived at the exponential model view slowly and only after working on the EMLLT and SPAM models.

4. Conclusions The exponential model can be used to represent many of the Gaussians with varying degree of parameter tying. Representing many models in exponential model format allows structure to be shared across different models, making for efficient source code organisation when implementing maximum likelihood optimization, clustering and efficient likelihood evaluations. We have shown that the exponential model is powerful and versatile and it is our hope that the speech recognition community as a whole will recognize this.

5. Acknowledgements The authors would like to thank Dr. Ramesh Gopinath and Dr. Satya Dharanipragada for many insightful comments and discussions.

6. References [1] Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra, “A maximum entropy approach to natural language processing,” Computational Linguistics, vol. 22, no. 1, pp. 39–71, 1996. [2] Lawrence D. Brown, Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory, vol. 9 of Lecture Notes Monograph Series, Institute of Mathematical Statistics, Hayward, California, 1986. [3] M. J. F. Gales, “Semi-tied covariance matrices for Hidden Markov Models,” IEEE Transactions in Speech and Audio Processing, vol. 7, pp. 272–281, 1999. [4] R. A. Gopinath, “Maximum likelihood modeling with gaussian distributions for classification,” in Proceedings of IEEE International Conference on

Acoustics, Speech and Signal Processing, Seattle, USA, 1998, vol. II, pp. 661–664. [5] Peder A. Olsen and Ramesh A. Gopinath, “Modeling inverse covariance matrices by basis expansion,” Transactions in Speech and Audio Processing, vol. 12, no. 1, pp. 37–46, January 2004. [6] P. Olsen and R. A. Gopinath, “Modeling inverse covariance matrices by basis expansion,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Orlando, Florida, May 2002, vol. I, pp. 945–948. [7] Scott Axelrod, Vaibhava Goel, Ramesh A. Gopinath, Peder A. Olsen, and Karthik Visweswariah, “Constrained gaussian mixture models for speech recognition,” Transactions in Speech and Audio Processing, 2003, submitted. [8] Scott Axelrod, Ramesh Gopinath, and Peder Olsen, “Modeling with a subspace constraint on inverse covariance matrices,” in Proceedings of the International Conference on Spoken Language Processing, Denver, Colorado, September 2002, vol. 3, pp. 2177–2180. [9] Karthik Visweswariah, Scott Axelrod, and Ramesh Gopinath, “Acoustic modeling with mixtures of subspace constrained exponential model,” in Proceedings of Eurospeech 2003, Geneva, Switzerland, September 2003, vol. 3, pp. 2613–2616. [10] Peder A. Olsen, Scott Axelrod, Karthik Visweswariah, and Ramesh Gopinath, “Gaussian mixture modeling with volume preserving nonlinear feature space transforms,” in Proceedings of 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, US Virgin Islands, December 2003, vol. 1, pp. 285–290. [11] M. K. Omar and M. Hasegawa-Johnson, “Nonlinear maximum likelihood feature transformation for speech recognition,” in Proc. Eurospeech, Geneva, Switzerland, September 2003, vol. 4, pp. 2497– 2500. [12] J. R. Magnus and H. Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics, John Wiley & Sons, West Sussex, England, 1999. [13] S. Kullback, Information Theory and Statistics, Dover Publications Inc., Mineola, New York, 1968. [14] Richard O. Duda and Peter E. Hart, Pattern Classification, John Wiley & Sons, Inc., second edition, 2001.