Ten Steps of EM Suffice for Mixtures of Two Gaussians Christos Tzamos EECS and CSAIL, MIT [email protected]

Manolis Zampetakis EECS and CSAIL, MIT [email protected]

Constantinos Daskalakis EECS and CSAIL, MIT [email protected]

Abstract We provide global convergence guarantees for the expectation-maximization (EM) algorithm applied to mixtures of two Gaussians with known covariance matrices. We show that EM converges geometrically to the correct mean vectors, and provide simple, closed-form expressions for the convergence rate. As a simple illustration, we show that in one dimension ten steps of the EM algorithm initialized at +∞ result in less than 1% error estimation of the means.

1

Introduction

The Expectation-Maximization (EM) algorithm [DLR77, Wu83, RW84] is one of the most widely used heuristics for maximizing likelihood in statistical models with latent variables. Consider a probability distribution pλ sampling (X, Z), where X is a vector of observable random variables, Z a vector of non-observable random variables and λ ∈ Λ a vector of parameters. Given independent samples x1 , . . . , xn of the observed random variables, the goal of maximum likelihood estimation is P to select λ ∈ Λ maximizing the log-likelihood of the samples, namely i log pλ (xi ). Unfortunately, computing pλ (xi ) involves summing pλ (xi , zi ) over all possible values of zi , which commonly results in a log-likelihood function that is non-convex with respect to λ and therefore hard to optimize. In this context, the EM algorithm proposes the following heuristic: • Start with an initial guess λ(0) of the parameters. • For all t ≥ 0, until convergence: (t)

– (E-Step) For each sample i, compute the posterior Qi (z) := pλ(t) (Z = z|X = xi ). P P (t) (xi ,z) – (M-Step) Set λ(t+1) := arg maxλ i z Qi (z) log pλ(t) . Qi (z)

Intuitively, the E-step of the algorithm uses the current guess of the parameters, λ(t) , to form (t) beliefs, Qi , about the state of the (non-observable) Z variables for each sample i. Then the M-step uses the new P beliefs about the state of Z for each sample to maximize with respect to λ a lower bound on i log pλ (xi ). Indeed, by the concavity of the log function, the objective function used in the M-step of the algorithm is a lower bound on the true log-likelihood for all values of λ, and it equals the true log-likelihood for λ = λ(t) . From these observations, it follows that the above alternating procedure improves the true log-likelihood until convergence. Despite its wide use and practical significance, little is known about whether and under what conditions EM converges to the true maximum likelihood estimator. A few works establish local convergence of the algorithm to stationary points of the log-likelihood function [Wu83, Tse04, CH08], and even fewer local convergence to the MLE [RW84, BWY14]. Besides local convergence to the MLE, it is also known that badly initialized EM may settle far from the MLE both in parameter and in likelihood distance [Wu83].

The lack of theoretical understanding of the convergence properties of EM is intimately related to the non-convex nature of the optimization it performs. Our paper aims to illuminate why EM works well in practice and develop techniques for understanding its behavior. We do so by analyzing one of the most basic and natural, yet still challenging, statistical models EM may be applied to, namely balanced mixtures of two multi-dimensional Gaussians with equal and known covariance matrices. In particular, the family of parameterized density functions we will be considering are: pµ1 ,µ2 (x) = 0.5 · N (x; µ1 , Σ) + 0.5 · N (x; µ2 , Σ), where Σ is a known covariance matrix, (µ1 , µ2 ) are unknown parameters, and N (µ, Σ; x) represents the Gaussian density with mean µ and covariance matrix Σ, i.e. N (x; µ, Σ) = √

 1 exp −0.5(x − µ)T Σ−1 (x − µ) . 2π det Σ

To elucidate the optimization nature of the algorithm and avoid analytical distractions arising in the finite sample regime, it has been standard practice in the literature of theoretical analyses of EM to consider the “population version” of the algorithm, where the EM iterations are performed assuming access to infinitely many samples from a distribution pµ1 ,µ2 as above. With infinitely 2 many samples, we can identify the mean, µ1 +µ , of pµ1 ,µ2 , and re-parametrize the density around 2 the mean as follows: pµ (x) = 0.5 · N (x; µ, Σ) + 0.5 · N (x; −µ, Σ).

(1)

We will study the convergence of EM when we perform iterations with respect to the parameter µ of pµ (x) in (1). Starting with an initial guess λ(0) for the unknown mean vector µ, the t-th iteration of EM amounts to the following update: h i (t) Ex∼pµ 0.5Np (x;λ(x) ,Σ) x (t) h λ i , λ(t+1) = M (λ(t) , µ) , (2) (t) Ex∼pµ 0.5Np (x;λ(x) ,Σ) λ(t)

where we have compacted both the E- and M-step of EM into one update. To illuminate the EM update formula, we take expectations with respect to x ∼ pµ because we are studying the population version of EM, where we assume access to infinitely many samples (t) from pµ . For each sample x, the ratio 0.5Np (x;λ(x) ,Σ) is our belief, at step t, that x was sampled λ(t) from the first Gaussian component of pµ , namely the one for which our current estimate of its mean vector is λ(t) . (The complementary probability is our present belief that x was sampled from the other Gaussian component.) Given these beliefs for all vectors x, the update (2) is the result of the M-step of EM. Intuitively, our next guess λ(t+1) for the mean vector of the first Gaussian component is a weighted combination over all samples x ∼ pµ where the weight of every x is our belief that it came from the first Gaussian component. Our main result is the following: Informal Theorem. Whenever the initial guess λ(0) is not equidistant to µ and −µ, EM converges geometrically to either µ or −µ, with convergence rate that improves as t → ∞. We provide a simple, closed form expression of the convergence rate as a function of λ(t) and µ. A formal statement is provided as Theorem 2 in Section 4. We start with the proof of the single-dimensional version, presented as Theorem 1 in Section 3. As a simple illustration of our 2

result, we show in Section 5 that, in one dimension, when our original guess λ(0) = +∞ and the signal-to-noise ratio µ/σ = 1, 10 steps of the EM algorithm result in 1% error. Despite the simplicity of the case we consider, no global convergence results were known prior to our work. Balakrishnan, Wainwright and Yu [BWY14] studied the same setting proving only local convergence, i.e. convergence only when the initial guess is close to the true parameters. In this work, we study the problem under arbitrary starting points and completely characterize the fixed points of EM. We show that other than a measure-zero subset of the space, any initialization of the EM algorithm converges in a few steps to the true parameters of the Gaussians and provide explicit bounds on the convergence rate. To achieve this, we follow an orthogonal approach to [BWY14]: Instead of trying to directly compute the number of steps required to reach convergence for a specific instance of the problem, we study the sensitivity of the EM iteration as the instance varies. This enables us to relate the behavior of EM on all instances of the Gaussian mixture problem and gain a handle on the convergence rate of EM on all instances at once.

1.1

Related Work on Learning Mixtures of Gaussians

We have already outlined the literature on the Expectation-Maximization algorithm. Several results study its local convergence properties and there are known cases where badly initialized EM fails to converge. See above. There is also a large body of literature on learning mixtures of Gaussians. A long line of work initiated by Dasgupta [Das99, AK01, VW04, AM05, KSV05, DS07, CR08, BV08, CDV09] provides rigorous guarantees on recovering the parameters of Gaussians in a mixture under separability assumptions, while later work [KMV10, MV10, BS10] has established guarantees under minimal information theoretic assumptions. More recent work [HP15] provides tight bounds on the number of samples necessary to recover the parameters of the Gaussians as well as improved algorithms, while another strand of the literature studies proper learning with improved running times and sample sizes [SOAJ14, DK14]. Finally, there has been work on methods exploiting general position assumptions or performing smoothed analysis [HK13, GHK15]. In practice, the most common algorithm for learning mixtures of Gaussians is the ExpectationMaximization algorithm, with the practical experience that it performs well in a broad range of scenarios despite the lack of theoretical guarantees. In recent work, Balakrishnan, Wainwright and Yu [BWY14] studied the convergence of EM in the case of an equal-weight mixture of two Gaussians with the same and known covariance matrix, showing local convergence guarantees. In particular, they show that when EM is initialized close enough to the actual parameters, then it converges. In this work, we revisit the same setting considered by [BWY14] but establish global convergence guarantees. We show that, for any initialization of the parameters, the EM algorithm converges geometrically to the true parameters. We also provide a simple and explicit formula for the rate of convergence. Concurrent and independent work by Xu, Hsu and Maleki [XHM16] has also provided global and geometric convergence guarantees for the same setting, as well as a slightly more general setting where the mean of the mixture is unknown, but they do not provide explicit convergence rates.

2

Preliminary Observations

In this section we illustrate some simple properties of the EM update (2) and simplify the formula. First, it is easy to see that plugging in the values λ ∈ {−µ, 0, µ} into M (λ, µ) results into M (−µ, µ) = −µ

;

M (0, µ) = 0 ;

3

M (µ, µ) = µ.

(3)

In particular, for all µ, these values are certainly fixed points of the EM iteration. Next, we rewrite M (λ, µ) as follows: h i h i 0.5N (x;λ,Σ) 0.5N (x;λ,Σ) 1 1 E E x + x 2 x∼N (µ,Σ) 2 x∼N (−µ,Σ) p (x) pλ (x) i i . h λ h M (λ, µ) = 0.5N (x;λ,Σ) 0.5N (x;λ,Σ) 1 1 E + E x∼N (µ,Σ) x∼N (−µ,Σ) 2 2 pλ (x) pλ (x) It is easy to observe that by symmetry this simplifies to h1 i N (x;λ,Σ)− 21 N (x;−λ,Σ) 1 2   E x 1 1 x∼N (µ,Σ) 2 N (x;λ,Σ)+ 2 N (x;−λ,Σ) N (x; λ, Σ) − N (x; −λ, Σ) 2 M (λ, µ) = x . i = Ex∼N (µ,Σ) h1 N (x;λ,Σ)+ 21 N (x;−λ,Σ) N (x; λ, Σ) + N (x; −λ, Σ) 1 2 E 1 1 x∼N (µ,Σ) 2 N (x;λ,Σ)+ N (x;−λ,Σ) 2

2

Simplifying common terms in the density functions N (x; λ, Σ), we get that "   # exp λT Σ−1 x − exp −λT Σ−1 x x . M (λ, µ) = Ex∼N (µ,Σ) exp (λT Σ−1 x) + exp (−λT Σ−1 x) We thus get the following expression for the EM iteration   M (λ, µ) = Ex∼N (µ,Σ) tanh(λT Σ−1 x)x .

3

(4)

Single-dimensional Convergence

In the single dimensional case the EM algorithm takes the following form according to (4). " ! # (t) x λ λ(t+1) = M (λ(t) , µ) = Ex∼N (µ,σ2 ) tanh x σ2

(5)

Observe that the function M (λ, µ) is increasing with respect to λ. Indeed the partial derivative of M with respect to λ is " ! # ∂M (λ, µ) λ(t) x x2 0 = Ex∼N (µ,σ2 ) tanh ∂λ σ2 σ2 which is strictly greater than zero since the tanh0 function is strictly positive. We will show next that the fixed points we identified at (3) are the only fixed points of M (·, µ). When initialized with λ(0) > 0 (resp. λ(0) < 0), the EM algorithm converges to µ > 0 (resp. to −µ < 0). The point λ = 0 is an unstable fixed point. Theorem 1. In the single dimensional case, when λ(0) , µ > 0, the parameters λ(t) satisfy ! (t) , µ)2 min(λ (t+1) − µ ≤ κ(t) λ(t) − µ where κ(t) = exp − λ 2σ 2 Moreover κ(t) is a decreasing function of t. Proof. For simplicity we will use λ for λ(t) , λ0 for λ(t+1) and we will assume that X ∼ N (0, σ 2 ). By a simple change of variables we can see that     λ(X + µ) M (λ, µ) = E tanh (X + µ) σ2 4

The main idea is to use the Mean Value Theorem with respect to the second coordinate of the function M on the interval [λ, µ]. M (λ, µ) − M (λ, λ) ∂M (λ, y) = with ξ ∈ (λ, µ) µ−λ ∂y y=ξ

But we know that M (λ, λ) = λ and M (λ, µ) = λ0 − λ ≥

λ0

and therefore we get ! ∂M (λ, y) min (µ − λ) ∂y ξ∈[λ,µ] y=ξ

which is equivalent to 0 λ − µ ≤

! ∂M (λ, y) |λ − µ| 1 − min ∂y ξ∈[λ,µ] y=ξ

where we have used the fact that λ0 < µ which is comes from the fact that M (λ, µ) is increasing with respect to λ and that M (µ, µ) = µ. The only thing that remains to complete our proof is to prove a lower bound of the partial derivative of M with respect to µ.      ∂M (λ, y) λ λ(X + ξ) 0 λ(X + ξ) = E 2 tanh (X + ξ) + tanh ∂y σ σ2 σ2 y=ξ h i The first term is non-negative, Lemma 1. The second term is at least 1 − exp − min(ξ,λ)·ξ , 2 2σ Lemma 2 and the theorem follows.    Lemma 1. Let α, β > 0 and X ∼ N (α, σ 2 ) then E tanh0 βX/σ 2 X ≥ 0. Proof.

        Z ∞ βX 1 (y − α)2 0 βy √ E tanh0 X = tanh y exp − dy σ2 σ2 2σ 2 2πσ −∞ But now we can seethat since tanh0 is an even function and since for any y > 0 we have 2 2 exp − (y−α) ≥ exp − (−y−α) then 2σ 2 2σ 2         Z 0 Z ∞ 1 (y − α)2 1 (y − α)2 0 βy 0 βy tanh tanh −√ y exp − dy ≤ √ y exp − dy σ2 2σ 2 σ2 2σ 2 2πσ −∞ 2πσ 0    which means that E tanh0 βX/σ 2 X ≥ 0. h i   Lemma 2. Let α, β > 0 and X ∼ N (α, σ 2 ) then E tanh βX/σ 2 ≥ 1 − exp − min(α,β)·α . 2 2σ   Proof. Note that E tanh βX/σ 2 is increasing as a function of β as its derivative with to h respect i   αβ β is positive by Lemma 1. It thus suffices to show that E tanh βX/σ 2 ≥ 1 − exp − 2σ2 when β ≤ α. We have that       2 1 2 E 1 − tanh βX/σ =E ≤E exp(2βX/σ 2 ) + 1 exp(βX/σ 2 )     2 2 2 −α   Z ∞ exp − (x−α) Z ∞ exp (α−β) 1 (x − α + β)2 2σ 2 2σ 2 √ =√ dx = exp − dx 2σ 2 2πσ −∞ exp(βx/σ 2 ) 2πσ −∞     (α − β)2 − α2 αβ = exp ≤ exp − 2 2σ 2 2σ which completes the proof. 5

4

Multi-dimensional Convergence

In the multidimensional case, the EM algorithm takes the form of (4). In this case, we will quantify our approximation guarantees using the Mahalanobis distance k·kΣ between vectors with respect to matrix Σ, defined as follows: q kx − ykΣ = (x − y)T Σ−1 (x − y). We will show that the fixed points identified are the only of

fixed points

M (·, µ). When

in (3)

initialized with λ(0) such that λ(0) − µ Σ < λ(0) + µ Σ (resp. λ(0) − µ Σ > λ(0) + µ Σ ), the EM algorithm to µ (resp. to −µ). The algorithm converges to λ = 0 when initialized

converges

with λ(0) − µ Σ = λ(0) + µ Σ . In particular,



Theorem 2. Whenever λ(0) − µ Σ < λ(0) + µ Σ , i.e. the initial guess is closer to µ than −µ, the estimates λ(t) of the EM algorithm satisfy  !



(t),T Σ−1 λ(t) , µT Σ−1 λ(t) 2 min λ

(t+1)

where κ(t) = exp − − µ ≤ κ(t) λ(t) − µ , .

λ Σ Σ 2λ(t),T Σ−1 λ(t)

Moreover, κ(t) is a decreasing function of t. The symmetric things hold when λ(0) − µ Σ >

(0)

λ + µ . When the initial guess is equidistant to µ and −µ, then λ(t) = 0 for all t > 0. Σ Proof. For simplicity we will use λ for λ(t) , λ0 for λ(t+1) . By applying the following change of variables λ ← Σ−1/2 λ and µ ← Σ−1/2 µ we may assume that Σ = I where I is the identity matrix. Therefore the iteration of EM becomes M (λ, µ) = Ex∼N (µ,I) [tanh(hλ, xi)x] = Ex∼N (0,I) [tanh(hλ, xi + hλ, µi)(x + µ)] ˆ be the unit vector in the direction of λ, λ ˆ ⊥ be the unit vector that belongs to the plane Let λ ˆ ˆ ⊥ , v3 , ..., vd } be a basis of Rd . We have: of µ, λ and is perpendicular to λ, and let {v1 = λ, v2 = λ hvi , λ0 i = Ex∼N (0,I) [tanh(hλ, xi + hλ, µi)(hvi , xi + hvi , µi)]

(6)

Since the Normal distribution is rotation invariant we can equivalently write:   X X hvi , λ0 i = Eα1 ,...,αd ∼N (0,1) tanh(hλ, αj vj i + hλ, µi)(hvi , αj vj i + hvi , µi) j

j

which simplifies to hvi , λ0 i = Eα1 ,...,αd ∼N (0,1) [tanh(α1 kλk + hλ, µi)(ai + hvi , µi)] =   Eα1 ∼N (0,1) tanh(α1 kλk + hλ, µi) · (Eα2 ,...,αd ∼N (0,1) [ai ] + hvi , µi)

(7)

We now consider different cases for i to further simplify Equation (7). h  i ˆ λ0 i = Ey∼N (0,1) tanh(kλk (y + hλ, ˆ µi)) y + hλ, ˆ µi . This is – When i = 1, we have that hλ, equivalent with an iteration of EM in one dimension and thus from Theorem 1 we get that ˆ µi − hλ, ˆ λ0 i| ≤ κ|hλ, ˆ µi − hλ, ˆ λi| |hλ, where ˆ λi, hλ, ˆ µi)2 min(hλ, κ = exp − 2 6

!

min(hλ, λi, hλ, µi)2 = exp − 2hλ, λi 

(8) 

ˆ ⊥ , µi and thus – When i = 2, Eα2 ,...,αd ∼N (0,1) [ai ] + hvi , µi = hλ h i ˆ ⊥ , λ0 i = h λ ˆ ⊥ , µiEy∼N (0,1) tanh(kλk (y + hλ, ˆ µi)) hλ Let κ as defined before and using Lemma 2 we get that ˆ ⊥ , µi ≥ hλ ˆ ⊥ , λ0 i ≥ (1 − κ)hλ ˆ ⊥ , µi hλ

(9)

– When i ≥ 3, Eα2 ,...,αd ∼N (0,1) [ai ] + hvi , µi = 0 and thus hvi , λ0 i = 0. We can now bound the distance of λ0 from µ: sX q

0

0 2 ˆ λ0 − µi2 + hλ ˆ ⊥ , λ0 − µi2

λ − µ = hvi , λ − µi = hλ, i (8), (9)



q ˆ λ − µi2 + κ2 hλ ˆ ⊥ , λ − µi2 ≤ κ kλ − µk κ2 hλ,

We now have to prove that this convergence rate κ decreases as the iterations increase. This is ˆ λi, hλ, ˆ µi) ≤ min(hλ ˆ 0 , λ0 i, hλ ˆ 0 , µi) implied by the following lemmas which show that min(hλ, ˆ µi then hλ, ˆ µi ≤ kλ0 k and hλ, ˆ µi ≤ hλˆ0 , µi. Lemma 3. If kλk ≥ hλ, ˆ +β·λ ˆ ⊥ , where Proof. The analysis above implies that λ0 can be written in the form λ0 = α · λ ˆ µi ≤ α ≤ kλk and 0 ≤ β ≤ hλ ˆ ⊥ , µi. It is easy to see that the first inequality holds since hλ, 0 ˆ kλ k ≥ α ≥ hλ, µi. For the second, we write hλˆ0 , µi as:

hλˆ0 , µi =

hλˆ0 , µi kλ0 k

=

ˆ µi + αhλ, p α2 + β 2

where we used the fact that

ˆ ⊥ , µi βhλ

ˆ ⊥ ,µi hλ ˆ hλ,µi



β α

 2 1 + αβ ˆ ˆ  2 ≥ hλ, µi r  2 ≥ hλ, µi 1 + αβ 1 + αβ

1+ ˆ µi r = hλ,

ˆ ⊥ ,µi β hλ ˆ α hλ,µi

which follows by the bounds on α and β.

ˆ µi then kλk ≤ kλ0 k ≤ hλˆ0 , µi. Lemma 4. If kλk ≤ hλ, ˆ +β·λ ˆ ⊥ , where kλk ≤ α ≤ hλ, ˆ µi and 0 ≤ β ≤ hλ ˆ ⊥ , µi. We also Proof. We have that λ0 = α · λ 2 2 ˆ µi + βhλ ˆ ⊥ , µi ≥ α2 + β 2 = kλ0 k ≥ α2 ≥ kλk so the lemma follows. have hλ0 , µi = αhλ, Finally substituting back in the basis that we started before changing coordinates to make the covariance matrix identity we get the result as stated at the theorem.

5

An Illustration of the Speed of Convergence

Using our results in the previous sections we can calculate explicit speeds of convergence of EM to its fixed points. In this section, we present some results with this flavor. For simplicity, we focus on the single dimensional case, but our calculations easily extend to the multidimensional case. Let us consider a mixture of two single-dimensional Gaussians whose signal-to-noise ratio η = µ/σ is equal to 1. There is nothing special about the value of 1, except that it is a difficult case to consider since the Gaussian components are not separated, as shown in Figure 1. When the SNR is larger, the numbers presented below still hold and in reality the convergence is even faster. When

7

Figure 1: The density of 12 N (x; 1, 1) + 21 N (x; −1, 1). the SNR is even smaller than one, the numbers change, but gracefully, and they can be calculated in a similar fashion. We will also assume a completely agnostic initialization of EM, setting λ(0) → +∞.1 To analyze the speed of convergence of EM to its fixed point µ, we first make the observation that in one step we already get to λ(1) ≤ µ + σ. To see this we can plug λ(0) → ∞ into equation (5) to get: λ(1) = Ex∼N (µ,σ2 ) [sign(x)x] = Ex∼N (µ,σ2 ) [|x|] , which equals the mean of the Folded Normal Distribution. A well-known for this mean is (1) bound µ + σ. Therefore the distance from the true mean after one step is λ − µ ≤ σ. Now, using Theorem 1, we conclude that in all subsequent steps the distance to µ shrinks by a factor of at least e+1/2 . This means that, if we want to estimate µ to within additive error 1%σ, then we need to run EM for at most 2 · ln 100 steps. That is, 10 iterations of the EM algorithm suffice to get to within error 1% even when our initial guess of the mean is infinitely away from the true value! In Figure 2 we illustrate the speed of convergence of EM as implied by Theorem 2 in multiple dimensions. The plot was generated for a Gaussian mixture with µ = (2 2) and Σ = I, but the 1 behavior illustrated in this figure is generic (up to a transformation of the space by Σ− 2 ). As implied by Theorem 2, the rate of convergence depends on the distance of λ(t) from the origin 0 and the angle hλ(t) , µi. The figure shows the directions of the EM updates for every point, and the factor by which the distance to the fixed point decays, with deeper colors corresponding to faster decays. There are three fixed points. Any point that is equidistant from µ and −µ is updated to 0 in one step and stays there thereafter. Points that are closer to µ are pushed towards µ, while points that are closer to −µ are pushed towards −µ.

Acknowledgements We thank Sham Kakade for suggesting the problem to us, and for initial discussions. The authors were supported by NSF Awards CCF-0953960 (CAREER), CCF-1551875, CCF-1617730, and CCF1650733, ONR Grant N00014-12-1-0999, and a Microsoft Faculty Fellowship.

References [AK01]

Sanjeev Arora and Ravi Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 247–257. ACM, 2001.

1

In the multi-dimensional setting, this would corrrespond to a very large magnitude λ(0) chosen in a random direction.

8

Figure 2: Illustration of the Speed of Convergence of EM in Multiple Dimensions as Implied by Theorem 2. [AM05]

Dimitris Achlioptas and Frank McSherry. On spectral learning of mixtures of distributions. In International Conference on Computational Learning Theory, pages 458–469. Springer, 2005.

[BS10]

Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 103–112. IEEE, 2010.

[BV08]

S Charles Brubaker and Santosh S Vempala. Isotropic PCA and affine-invariant clustering. In the 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2008.

[BWY14] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the EM algorithm: From population to sample-based analysis. arXiv preprint arXiv:1408.2156, 2014. [CDV09] Kamalika Chaudhuri, Sanjoy Dasgupta, and Andrea Vattani. Learning mixtures of gaussians using the k-means algorithm. arXiv preprint arXiv:0912.0086, 2009. [CH08]

St´ephane Chr´etien and Alfred O Hero. On EM algorithms and their proximal generalizations. ESAIM: Probability and Statistics, 12:308–326, 2008.

[CR08]

Kamalika Chaudhuri and Satish Rao. Learning Mixtures of Product Distributions Using Correlations and Independence. In the 21st International Conference on Computational Learning Theory (COLT), 2008.

[Das99]

Sanjoy Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science, 1999. 40th Annual Symposium on, pages 634–644. IEEE, 1999.

[DK14]

Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In Proceedings of The 27th Conference on Learning Theory, pages 1183–1213, 2014. 9

[DLR77] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977. [DS07]

Sanjoy Dasgupta and Leonard Schulman. A probabilistic analysis of em for mixtures of separated, spherical gaussians. Journal of Machine Learning Research, 8(Feb):203–226, 2007.

[GHK15] Rong Ge, Qingqing Huang, and Sham M Kakade. Learning mixtures of gaussians in high dimensions. In the 47th Annual ACM on Symposium on Theory of Computing (STOC), 2015. [HK13]

Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In the 4th conference on Innovations in Theoretical Computer Science (ITCS), 2013.

[HP15]

Moritz Hardt and Eric Price. Tight bounds for learning a mixture of two gaussians. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages 753–760. ACM, 2015.

[KMV10] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Efficiently learning mixtures of two gaussians. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 553–562. ACM, 2010. [KSV05] Ravindran Kannan, Hadi Salmasian, and Santosh Vempala. The spectral method for general mixture models. In the 18th International Conference on Computational Learning Theory (COLT), 2005. [MV10]

Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 93–102. IEEE, 2010.

[RW84]

Richard A Redner and Homer F Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM review, 26(2):195–239, 1984.

[SOAJ14] Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Nearoptimal-sample estimators for spherical gaussian mixtures. In Advances in Neural Information Processing Systems, pages 1395–1403, 2014. [Tse04]

Paul Tseng. An analysis of the EM algorithm and entropy-like proximal point methods. Mathematics of Operations Research, 29(1):27–44, 2004.

[VW04]

Santosh Vempala and Grant Wang. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841–860, 2004.

[Wu83]

CF Jeff Wu. On the convergence properties of the EM algorithm. The Annals of statistics, pages 95–103, 1983.

[XHM16] Ji Xu, Daniel Hsu, and Arian Maleki. Global analysis of Expectation Maximization for mixtures of two Gaussians. In the 30th Annual Conference on Neural Information Processing Systems (NIPS), 2016.

10

Ten Steps of EM Suffice for Mixtures of Two Gaussians

Journal of the royal statistical society. Series B. (methodological), pages 1–38, 1977. [DS07]. Sanjoy Dasgupta and Leonard Schulman. A probabilistic analysis of em for mixtures of separated, spherical gaussians. Journal of Machine Learning Research, 8(Feb):203–226,. 2007. [GHK15] Rong Ge, Qingqing Huang, and ...

3MB Sizes 1 Downloads 37 Views

Recommend Documents

Mixtures of Inverse Covariances
class. Semi-tied covariances [10] express each inverse covariance matrix 1! ... This subspace decomposition method is known in coding ...... of cepstral parameter correlation in speech recognition,” Computer Speech and Language, vol. 8, pp.

A Simple Algorithm for Clustering Mixtures of Discrete ...
mixture? This document is licensed under the Creative Commons License by ... on spectral clustering for continuous distributions have focused on high- ... This has resulted in rather ad-hoc methods for cleaning up mixture of discrete ...

MIXTURES OF INVERSE COVARIANCES Vincent ...
We introduce a model that approximates full and block- diagonal covariances in a Gaussian mixture, while reduc- ing significantly both the number of parameters to estimate and the computations required to evaluate the Gaussian like- lihoods. The inve

mixtures of inverse covariances: covariance ... - Vincent Vanhoucke
Jul 30, 2003 - of the software infrastructure that enabled this work, I am very much indebted to. Remco Teunen .... 6.2.2 Comparison against Semi-Tied Covariances . ...... Denoting by ⋆ the Kronecker product of two vectors, the Newton update can be

Fast Clustering of Gaussians and the Virtue of ...
A clustering map c : G→C. • Exponential model parameters θc, c ∈ C for each of the cluster gaussians. We shall measure the goodness of the clustering in terms.

Hierarchical Mixtures of GLMs for Combining ... - Research at Google
ically, e.g. bounce-rate, conversion-rate, or at some incurred cost, e.g. human subject evaluation. Although human evalution data tends to be more pure in general, it typically demonstrates signifi- cant biases, due to the artificial nature of the pr

Mixtures of Sparse Autoregressive Networks
Given training examples x. (n) we would ... be learned very efficiently from relatively few training examples. .... ≈-86.34 SpARN (4×50 comp, auto) -87.40. DARN.

Performance Characterization of Bituminous Mixtures ...
Mixtures With Dolomite Sand Waste and BOF Steel Slag,” Journal of Testing and Evaluation, Vol. ... ABSTRACT: The rapid growth of transport load in Latvia increases the demands for ..... and Their Application in Concrete Production,” Sci.

mixtures of inverse covariances: covariance ... - Vincent Vanhoucke
Jul 30, 2003 - archive of well-characterized digital recordings of physiologic signals ... vein, the field of genomics has grown around the collection of DNA sequences such ... went the transition from using small corpora of very constrained data (e.

difference of gaussians type neural image filtering with ...
iment explores the edges recovery ability on a natural image. The results show that the ... The spiking neuron network and the neural filter- ing method are ...

Difference of gaussians type neural image filtering with ...
Framework. Neural image filtering. Visual attention. Neural computation. Outline. 1. Framework. Visual attention .... "best" neuronal. Gradual results. S. Chevallier ...

PBT/PAr mixtures: Influence of interchange ... - Wiley Online Library
Jan 10, 1996 - Furthermore, an enhanced effect is observed when the amount of the catalyst is increased. In addition, a slight decrease in the low deformation mechanical properties and a significant increase in the deformation at break is observed as

2-physics-em-ten-marks-questions.pdf
2-physics-em-ten-marks-questions.pdf. 2-physics-em-ten-marks-questions.pdf. Open. Extract. Open with. Sign In. Main menu.

Xiang_Zhengzheng_TSP13_Degrees of Freedom for MIMO Two-Way ...
Page 1 of 50. UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS. International General Certificate of Secondary Education. MARK SCHEME for the May/June 2011 question paper. for the guidance of teachers. 0620 CHEMISTRY. 0620/12 Paper 1 (Multiple Choi

Variable Length Mixtures of Inverse Covariances - Vincent Vanhoucke
In that situation, a standard Newton algorithm can be used to optimize d [3, Chapter 9]. For that, we compute the gradient. ¥ with respect to v and the Hessian ¦ . Note that the Hessian is diagonal here. For simplicity we'll denote by § the diagon

Variable Length Mixtures of Inverse Covariances - Vincent Vanhoucke
vector and a parameter vector, both of which have dimension- .... Parametric Model of d ... to optimize with a family of barrier functions which satisfy the inequality ...