U-Q2: Compare and contrast maximum likelihood and MAP estimation in the presence of correlated noise.

K. Krishanth Supervisor: Dr. T. Kirubarajan

July 28, 2014

Maximum Likelihood Estimate (MLE) The maximum likelihood estimate of parameter θ is the value of θ which maximizes the likelihood L(θ) θˆML (y) = arg max p(y|θ)

(1)

θ

Consider sample of N observations on X distributed according to N (µ, Σ) is x1 , ..., xN , here the distribution is p variate. Goal: estimate mean vector µ and the covariance matrix Σ. The likelihood function is # N ′ 1X −1 L= exp − (xα − µ) Σ (xα − µ) 2 α=1 (2Π)(1/2)pN |Σ|(1/2)N 1

"

(2)

2 / 21

Maximum Likelihood Estimate (MLE) Since log is an increasing function its maximum is at the same point in the space of µ and Σ as the maximum of L N ′ 1 1 1X logL = − pN log(2Π) + N log |Σ| − (xα − µ) Σ−1 (xα − µ) (3) 2 2 2 α=1

It can be shown that 1 X xα N α X ˆ= 1 Σ (xα − x¯)(xα − x ¯)′ N α µ ˆ=

=

N X





xα xα − N x¯x ¯.

(4) (5)

(6)

α=1

3 / 21

MLE (correlated noise) When the noise is correlated i.e., σij = σi σj ρij

(ρii = 1)

1 X xα N α 1 X 1 X 2 σ ˆi2 = x − Nx ¯2i ) (xiα − x ¯i )2 = ( N α N α iα µ ˆ=

(7) (8)

where xiα is the ith component of xα and x¯i is the ith component of x ¯ The maximum likelihood estimate of ρij is Σα (xiα − x ¯i )(xjα − x¯j ) p ρˆij = p Σα (xiα − x ¯i )2 Σα (xjα − x ¯j )2 Σα xiα xjα − N x ¯i x ¯j q = p ¯2i Σα x2jα − N x ¯2j Σα x2iα − N x

(9) (10)

4 / 21

Problem statement

Y = Ax + ǫ

(11)

Y is p × n matrix (p sensors, n time-steps) A is p × q matrix (known) x is q × n matrix ǫ is p × n matrix with distribution ǫi ∼ N (0, Σ), where Σ is unknown Parameters to be estimated Estimate of x Estimate of Σ, which is uncorrelated/ correlated

5 / 21

Maximum Likelihood Estimate (MLE) Now consider the equation (11). The log likelihood function can be written as L(x, Σ) = −

np n 1 log(2π) − log|Σ| − tr{Σ−1 (Y − Ax)′ (Y − Ax)} (12) 2 2 2

Using the proved results, we can find that x ˆ = (A′ A)−1 A′ Y ˆ = 1 (Y − Aˆ Σ x)(Y − Aˆ x)′ n

(13) (14)

For the correlated condition x ˆ = (A′ A)−1 A′ Y 1 σ ˆk2 = (Y − Aˆ x)k (Y − Aˆ x)k k denotes k−th row n (Y − Aˆ x)k (Y − Aˆ x)l pP ρˆkl = pP 2 ((Y − Aˆ x)k ) ((Y − Aˆ x)l )2

(15) (16) (17)

6 / 21

Introduction: Maximum A Posteriori Estimate In the MAP estimation approach, we choose to maximize the posterior PDF p(y|θ) p(θ) θˆMAP (y) = arg max Z θ p(y|ϑ) p(ϑ) dϑ

(18)

ϑ

= arg max p(y|θ) p(θ)

(19)

θ

Denominator in the posterior does not depend on θ Equivalently θˆMAP (y) = arg max [ln p(y|θ) + ln θ]

(20)

θ

7 / 21

MAP (Σ is known) Let the priori distribution of x is given by N (µ0 , Σ0 ) The log function to maximize can be written as ′ 1 1 1 logF = − pN log(2Π) + N log |Σ| − (Y − Ax) Σ−1 (Y − Ax) 2 2 2   ′ 1 1 1 −1 + − pN log(2Π) + N log |Σ0 | − (x − µ0 ) Σ0 (x − µ0 ) 2 2 2 (21)

Assume Σ is known, then we can find the maximum value of the above function by differentiating with respect to x Need to estimate x For the sample considered in Slide (1), estimate of x is convex combination of sample mean and prior mean −1 For the above, x ˆ = (A′ Σ−1 A + Σ−1 (A′ Σ−1 Y + µ0 Σ−1 0 ) 0 ) If the prior is zero mean, then it differs from the corresponding x ˆMLE by Σ−1 0 term 8 / 21

MAP (Σ is unknown) Assume Σ is unknown as well Differentiating the equation (21) with respect to x and Σ −1 x ˆ = (A′ A + Σ−1 (A′ Y + µ0 Σ0−1 ) 0 ) If the prior is zero mean, then it differs from the corresponding x ˆMLE by Σ−1 0 term 1 ˆ x)(Y − Aˆ x)′ , this Equation is same as the Σ = n (Y − Aˆ corresponding MLE

9 / 21

MAP (Σ is unknown) Σ is unknown and correlated Differentiating the equation (21) with respect to x and Σ −1 xˆ = (A′ A + Σ−1 (A′ Y + µ0 Σ−1 0 ) 0 ) 1 σ ˆk2 = (Y − Aˆ x)k (Y − Aˆ x)k n (Y − Aˆ x)k (Y − Aˆ x)l pP ρˆkl = pP ((Y − Aˆ x)k )2 ((Y − Aˆ x)l )2

(22) (23) (24)

10 / 21

MAP (prior is uniform) If the prior is uniform p(x) =



1 b−a

0,

a≤x≤b x < a or x > b

(25)

The function to maximize is zero when x < a or x > b The function to maximize is when a ≤ x ≤ b ′ 1 1 1 logF = − pN log(2Π) + N log |Σ| − (Y − Ax) Σ−1 (Y − Ax) 2 2 2

If a ≤ x ˆMLE ≤ b, then xˆMAP = xˆMLE

11 / 21

MAP (prior is exponential) If the prior is exponential PDF  λexp(−λx) p(x) = 0,

x≥0 x<0

(26)

The function to maximize is ′ 1 1 1 logF = − pN log(2Π) + N log |Σ| − (Y − Ax) Σ−1 (Y − Ax) 2 2 2 + [lnλ − λx] (27)

Estimates are x ˆ = (A′ A)−1 (A′ Y − λI) Equation for the estimate of Σ will be same as corresponding MLE case

12 / 21

Summary MLE (correlated) x ˆ = (A′ A)−1 A′ Y 1 σ ˆk2 = (Y − Aˆ x)k (Y − Aˆ x)k k denotes k−th row n (Y − Aˆ x)k (Y − Aˆ x)l pP ρˆkl = pP ((Y − Aˆ x)k )2 ((Y − Aˆ x)l )2

(28) (29) (30)

MAP (correlated)

−1 xˆ = (A′ A + Σ−1 (A′ Y + µ0 Σ0−1 ) 0 ) 1 x)k (Y − Aˆ x)k σ ˆk2 = (Y − Aˆ n (Y − Aˆ x)k (Y − Aˆ x)l pP ρˆkl = pP ((Y − Aˆ x)k )2 ((Y − Aˆ x)l )2

(31) (32) (33)

13 / 21

References Steven Kay, ”Fundamentals of Statistical Signal Processing,: Detection Theory” Edward Vonesh, Vernon M. Chinchilli, ”Linear and Nonlinear Models for the Analysis of Repeated Measurements” Robb J. Muirhead, ”Aspects of Multivariate Statistical Theory” T. W. Anderson, ”An Introduction to Multivariate Statistical Analysis”

14 / 21

Thank you Questions

15 / 21

Appendix: MLE (derivation) Lemma: Let x1 , ..., xN be N vectors (p component) and let x ¯ be defined by (??). Then for any vector b N X



(xα − b)(xα − b) =

α=1

N X





(xα − x ¯)(xα − x¯) + N (¯ x − b)(¯ x − b)

(34)

α=1

when we let b = µ, we have N X



(xα − µ)(xα − µ) =

α=1

N X



(xα − x ¯)(xα − x¯)

α=1 ′

+ N (¯ x − µ) + (¯ x − µ)



= A + N (¯ x − µ) + ((¯ x − µ)

16 / 21

Appendix: MLE (derivation) Using this Presult and the properties of a matrix trCD = cij dji = trDC we have N X



(xα − µ) Σ−1 (xα − µ) = tr

α=1

N X



(xα − µ) Σ−1 (xα − µ)

α=1

= tr

N X



Σ−1 (xα − µ)(xα − µ)

α=1 ′

= trΣ−1 A + trΣ−1 N (¯ x − µ)(¯ x − µ) ′

= trΣ−1 A + N (¯ x − µ) Σ−1 (¯ x − µ)

(35)

Thus we can write (3) as 1 1 log L = − pN log (2Π) + N log|Σ−1 | 2 2 1 1 −1 x − µ)Σ−1 (¯ x − µ) − tr Σ A − N (¯ 2 2

(36) 17 / 21

Appendix: MLE (derivation) Since Σ is positive definite, N (¯ x − µ)Σ−1 (¯ x − µ) > 0 and is 0 if µ = xˆ. To maximize the second and third terms use the following lemma: Lemma: Let f (C) =

p 1 X 1 cij dij N log|C| − 2 2 i,j=1

(37)

where C and D are positive definite. The maximum of f (C) is taken on at C = N D−1 .

18 / 21

(MLE) Applying this lemma to (3) with last term zero, we find that the maximum occurs at   1 Σ= A N

(38)

Thus the maximum likelihood estimates of µ and Σ are µ ˆ=x ¯ and ˆ = 1 A. Σ N 1 X xα N α X ˆ= 1 (xα − x¯)(xα − x ¯)′ Σ N α

µ ˆ = x¯ =

=

N X





xα xα − N x ¯x ¯.

(39) (40)

(41)

α=1

19 / 21

Appendix: Lemma Lemma: Let x1 , ..., xN be N vectors (p component) and let x ˆ be defined by (??). Then for any vector b N X



(xα − b)(xα − b) =

α=1

N X





(xα − x ¯)(xα − x¯) + N (¯ x − b)(¯ x − b)

(42)

α=1

Proof of Lemma: N X



(xα − b)(xα − b) =

α=1

N X

[(xα − x ¯) + (¯ x − b)][(xα − x ¯) + (¯ x − b)]

N X

[(xα − x ¯) + ((xα − x ¯) + (xα − x¯)(¯ x − b)



α=1

=

(43) ′



α=1 ′



+ (¯ x − b)(xα − x ¯) + (¯ x − b)(¯ x − b) ] (44) 20 / 21

Appendix: Lemma

=

N X



(xα − x ¯) + ((xα − x ¯) +

α=1

+ (¯ x − b)

"

N X

#

α=1

N X



(xα − x ¯) (¯ x − b)





[(xα − x ¯) + N (¯ x − b)(¯ x − b)

(45)

α=1

The second and third terms are 0 because

PN

α=1 [(xα

−x ¯)] = 0

21 / 21

Krishanth_G-Q2.pdf

The maximum likelihood estimate of parameter θ is the value of θ. which maximizes the likelihood L(θ). ˆθML(y) = arg max. θ. p(y|θ) (1). Consider sample of N observations on X distributed according to. N (μ, Σ) is x1, ..., xN , here the distribution is p variate. Goal: estimate mean vector μ and the covariance matrix Σ.

1MB Sizes 3 Downloads 204 Views

Recommend Documents

No documents