Sampling Methods Sara Wade University of Cambridge

Charles University 8-19 April 2013, Prague

Sara Wade

Sampling Methods

1 / 30

Numerical Integration We want to compute G = Ep [g(X)] =

R

g(x)p(x)dx.

Solution 1: approximate integral by ˆ= G

T X

g(xt )p(xt )∆x,

t=1

where xt lie on an equidistant grid.

Problem: the number of grid points needed grows exponentially with the dimension of X . Sara Wade

Sampling Methods

2 / 30

Monte Carlo Solution 2: approximate integral by T X ˆ= 1 g(xt ), G T

iid

where X t ∼ p

t=1

Sara Wade

Sampling Methods

3 / 30

Monte Carlo

Properties: ˆ = 1 PT E[g(X t )] = G. unbiased: E[G] t=1 T ˆ convergence: G → Ep [g(X)] as T → ∞ (a.s.) under mild conditions (SLLN). ˆ = 1 V(g(X)) ← doesn’t depend on the dimension of variance: V(G) T x!

Sara Wade

Sampling Methods

4 / 30

Monte Carlo

Properties: ˆ = 1 PT E[g(X t )] = G. unbiased: E[G] t=1 T ˆ convergence: G → Ep [g(X)] as T → ∞ (a.s.) under mild conditions (SLLN). ˆ = 1 V(g(X)) ← doesn’t depend on the dimension of variance: V(G) T x! But, how do we generate random samples from p(x)?

Sara Wade

Sampling Methods

4 / 30

Probability integral transformation

Solution 1: Probability integral transformation: If X ∼ F , then U = F (X) ∼ Unif(0, 1). → X = F −1 (U ) ∼ F. Note: F −1 is the generalized inverse F −1 (u) = inf x {x : F (x) ≥ u}. We can obtain a sample x from F by sample U ∼ Unif(0, 1), set x = F −1 (u). Problems: need to compute F −1 , which can be expensive or unavailable.

Sara Wade

Sampling Methods

5 / 30

Probability integral transformation

0.4 0.0

0.2

p(x)

0.6

0.8

Ex. truncated normal p(x) ∝ N(x|µ, σ 2 )1(x ≥ 0) sample U ∼ Unif(Φ(−µ/σ), 1), set x = Φ−1 (u)σ + µ.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ●● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ●●

−3

−2

−1

0

1

2

3

x

Sara Wade

Sampling Methods

6 / 30

Rejection sampling Let q(x) be a density (that can be easily sampled from) such that p(x) ≤

q(x) α

∀ x,

for some constant α ∈ (0, 1). Solution 2: rejection sampling algorithm: obtain a sample x from p(x) by 1. sample x∗ from q(x) ∗

) 2. accept x = x∗ with probability α(x∗ ) = min(1, p(x q(x∗ ) α), otherwise go to step 1.

Problems: need to find q(x) that can be easily sampled and such that the number of rejections is small.

Sara Wade

Sampling Methods

7 / 30

Rejection sampling Ex. truncated normal p(x) ∝ N(x|µ, σ 2 )1(x ≥ 0);

q(x) = N(x|µ, σ 2 ),

0.4 0.0

0.2

p(x)

0.6

0.8

1. sample x∗ from N(µ, σ 2 ), 2. set x = x∗ if x∗ ≥ 0, otherwise go to step 1. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ●● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ●●

xx x xx x xxxxxxxxxx xxxxxxxxxxxxxxxxx x xxx

−3

−2

−1

0

1

2

3

x

Sara Wade

Sampling Methods

8 / 30

Markov chain Monte Carlo A third solution is to generate a Markov chain {X t }t≥1 with an appropriately defined transition probability such that for large enough t, X t is approximately sampled from p(x). Markov chain (first order): over the state space X , is a sequence of random variables X1 , X2 , . . . such that π(xt+1 |x1 , . . . , xt ) = π(xt+1 |xt ). Homogeneous: a Markov chain is homogeneous if the transition probabilities do not depend on t, π(xt+1 |xt ) ← does not depend on t.

Sara Wade

Sampling Methods

9 / 30

Markov chain Monte Carlo

Stationary distribution: p(x) is stationary with respect to a homogeneous Markov chain if X π(x|xt )p(xt ). p(x) = xt

A sufficient condition to ensure that p(x) is a stationary distribution is the detailed balance condition: p(x)π(x0 |x) = p(x0 )π(x|x0 ).

Sara Wade

Sampling Methods

10 / 30

Markov chain Monte Carlo If {X t }t≥1 is a homogeneous Markov chain such that state space is the support of p(x), irreducible and aperiodic, has stationary distribution p(x), d

from ergodic theory, X t → X, where X has density p(x) and T X 1 g(xt ) → Ep [g(X)] a.s. T − T0 t=T0

Thus, our Markov chain produces (dependent) samples from p can be used to approximate G. → We generate a large Markov chain and discard samples to obtain nearly independent samples from p(x). Sara Wade

Sampling Methods

11 / 30

Metropolis-Hastings How to construct such a chain? Such a chain can be obtained by introducing a proposal density q(x|xt ) and choose a starting value x0 . At each iteration: generate x∗ from q(x|xt ), with acceptance probability   p(x∗ ) q(xt |x∗ ) ∗ t α(x |x ) = min 1, , p(xt ) q(x∗ |xt ) set xt+1 = x∗ , otherwise set xt+1 = xt . The transition probability: π(xt+1 |xt ) = q(xt+1 |xt )α(xt+1 |xt ) + r(xt )δxt (xt+1 ), R where r(xt ) = q(x|xt )α(x|xt )dx, satisfies the detailed balance condition with respect to p(x). For large enough T0 , (xt )t≥T0 are approximate samples from p(x). Sara Wade

Sampling Methods

12 / 30

Independent Metropolis Hasting The proposal density q(x) is independent of xt and such that p(x) ≤

q(x) , α

for some constant α ∈ (0, 1). The acceptance probability reduces to   p(x∗ ) q(xt ) ∗ t α(x |x ) = min 1, . p(xt ) q(x∗ ) This algorithm is similar to rejection sampling but produces dependent samples, but it can be shows that on average independent MH accepts more often.

Sara Wade

Sampling Methods

13 / 30

Independent MH Ex. truncated normal p(x) ∝ N(x|µ, σ 2 )1(x ≥ 0);

q(x) = N(x|µ, σ 2 ),

0.4 0.0

0.2

p(x)

0.6

0.8

sample x∗ from N(µ, σ 2 ), set xt+1 = x∗ if x∗ ≥ 0, otherwise set xt+1 = xt . ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ●● ● ● ●● ● ● ●● ●●● ●

x x

−3

−2

x xx xxxxxxx xx xxx● x ● −1

0





1

2



3

x

Sara Wade

Sampling Methods

14 / 30

Random walk The candidate value is obtained by perturbing the previous value: x∗ = xt + , where  is independent of xt with symmetric distribution centered at 0; for ex.  ∼ N(0, σ 2 ). The acceptance probability reduces to   p(x∗ ) . α(x |x ) = min 1, p(xt ) ∗

t

The scale of the proposal should be as large as possible without incurring high rejection rates.

Sara Wade

Sampling Methods

15 / 30

Random walk Ex. bivariate normal p(x) = N(x|µ, Σ), q(x|xt ) = N(x1 |xt1 , σq2 )N(x2 |xt2 , σq2 ),

sample x∗ from N(x1 |xt1 , σq2 )N(x2 |xt2 , σq2 ) , sample U ∼ Unif(0, 1) and set  ∗ x if u < min(1, N(x∗ |µ, Σ)/N(xt |µ, Σ)) t+1 x = . t x otherwise Sara Wade

Sampling Methods

16 / 30

Gibbs sampling

Assume x = (x1 , . . . , xd ). The Gibbs sampling algorithm iteratively samples xi given x−i = (x1 , . . . , xi−1 , xi+1 , . . . , xn ). Let p(xi |x−i ) be the full conditional density. The chain is initialized at x0 and the Gibbs sampling algorithm procedes by sample xt+1 from p(x1 |xt2 , . . . , xtd ), 1 t t sample xt+1 from p(x2 |xt+1 2 1 , x3 , . . . , xd ), .. . t+1 sample xt+1 from p(xd |xt+1 1 , . . . , xd−1 ), d

The Gibbs sampling procedure is a special case of MH where the proposal is always accepted.

Sara Wade

Sampling Methods

17 / 30

Gibbs sampling Ex. bivariate normal p(x) = N(x|µ, Σ): 2 ), sample xt+1 from N(x1 |β0,1 + β1,1 xt2 , σ1|2 1 2 sample xt+1 from N(x2 |β0,2 + β1,2 xt+1 2 1 , σ2|1 ).

Strong correlations can slow down Gibbs sampling. If groups of variables can be updated jointly, we can improve the mixing; this is known as blocked Gibbs sampling. Sara Wade

Sampling Methods

18 / 30

Convergence of MCMC Samples need to be discarded so that (xt )Tt=1 are independent samples from p(x): burn in: to overcome poor initialization, we may need to throw away the first T0 samples. thinning: to reduce dependency, we may need to keep only every k th sample. It is difficult to monitor convergence, but we can determine lack of convergence via trace plots, autocorrelations, and comparing chains with different initializations.

Sara Wade

Sampling Methods

19 / 30

Summary of MCMC

Advantages: general, applicable to many problems, easy to implement, theoretical guarantees as T → ∞. Disadvantages: can be slow and expensive to compute, difficult to assess convergence.

Sara Wade

Sampling Methods

20 / 30

MCMC in Bayesian inference In Bayesian inference, a closed form for the posterior distribution is often unavailable. MCMC can be used to obtain samples (θt )Tt=1 from the posterior p(θ|x), and T 1X E[g(θ)|x] ≈ g(θt ). T t=1

For example, the posterior distribution can be approximated by P (A|x) ≈

T 1X δθt (A), T t=1

for any measurable set A ⊆ Θ, and the predictive density at x∗ is approximately T 1X p(x∗ |x) ≈ p(x∗ |θt , x). T t=1

Sara Wade

Sampling Methods

21 / 30

Probit regression Model: ind

Yi |xi , w ∼ Bern(p(xi )), p(xi ) = p(Yi = 1|xi , w) = Φ(w0 + w1 xi ). Prior: w ∼ N(µ0 , Σ0 ). Latent model: Introduce a latent variable Y˜i where ind Y˜i |xi , w ∼ N(w0 + w1 xi , 1),

yi |˜ yi , xi , w = 1(˜ yi ≥ 0) Note: if we integrate out Y˜i , we recover the original model. Sara Wade

Sampling Methods

22 / 30

Probit regression - Gibbs sampler

Gibbs sampling algorithm: Initialize w, y ˜ ˆ sample w|˜ y, x ∼ N(w, ˆ Σ), where ˆ = (Σ−1 + X 0 X)−1 , Σ 0 ˆ −1 µ0 + X 0 y). w ˆ = Σ(Σ 0 sample Y˜i |w, xi , yi independently with density 1 N(˜ yi |w0 + w1 xi , 1)(1(˜ yi ≥ 0))yi (1(˜ yi < 0))1−yi . Z

Sara Wade

Sampling Methods

23 / 30

← Trace plot of first 2000 samples for w0 (left) and w1 (right)

0.0

−0.6

0.1

−0.4

0.2

−0.2

0.3

0.0

0.4

0.2

0.5

0.4

0.6

Probit regression

0

500

1000

1500

2000

0

500

1000

1500

2000

1.0 0.5 −0.5

0.0

← Autocorrelation for w0 (left) and w1 (right)

−1.0

−0.5

0.0

Autocorrelation

1.0 0.5

Iterations

−1.0

Autocorrelation

Iterations

0

5

10

15

20 Lag

Sara Wade

25

30

35

0

5

10

15

20

25

30

35

Lag

Sampling Methods

24 / 30

0.6 0.4 0.0

0.2

p(y=1|x)

0.8

1.0

Probit regression

xx xxxxx x x xxxxxxxxxxxxxx xxx x

xx xx xx

x x

0

2

x

x

x

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●●●● ●●●●● ●●●●● ● ● ● ●●● ● ●● ●●●●●●● ● ●●●● ● ● ●●● ● ● ●●●●●● ● ●●●●●● ● ●● ● ●●●● ●●● ● ●●● ● ● ● ●● ● ●●

−4

−2

4

x

Posterior mean of p(Y∗ = 1|x∗ , w) at a grid of new values x∗ with 95% pointwise credible intervals (and the true curve in red). The n = 100 data points are represented as crosses for yi = 0 and circles for yi = 1. Sara Wade

Sampling Methods

25 / 30

Logit regression

Model: ind

Yi |xi , w ∼ Bern(p(xi )), exp(w0 + w1 xi ) . p(xi ) = p(Yi = 1|xi , w) = 1 + exp(w0 + w1 xi ) Prior: w ∼ N(µ0 , Σ0 ).

Sara Wade

Sampling Methods

26 / 30

Logit regression - MH random walk MH random walk algorithm: Initialize w sample w∗ |wt with proposal density ˆ −1 )−1 V ), q(w|wt ) = N(w|wt , V (Σ−1 0 +C where Cˆ is variance-covariance matrix of the MLEs and V is a diagonal matrix of tuning parameters. accept wt+1 = w∗ with probability   ! n  ∗ + w ∗ x ) yi t + wt x ) Y exp(w 1 + exp(w i i 0 1 0 1 α(x∗ |xt ) = min 1, exp(w0t + w1t xi ) 1 + exp(w0∗ + w1∗ xi ) i=1

otherwise set wt+1 = wt . Implemented in MCMClogit function of the R package MCMCpack.

Sara Wade

Sampling Methods

27 / 30

Logit regression Trace of x

0.7

0.5

0.8

Trace of (Intercept)

−0.5

0.3

0.4

0.0

0.5

0.6

← Trace plot of first 2000 samples for w0 (left) and w1 (right)

500

1000

1500

2000

0

1000

(Intercept)

x

1500

2000

0.5 −0.5

0.0

← Autocorrelation for w0 (left) and w1 (right)

−1.0

−0.5

0.0

Autocorrelation

0.5

1.0

Iterations

−1.0

Autocorrelation

500

Iterations

1.0

0

0

5

10

15

20 Lag

Sara Wade

25

30

35

0

5

10

15

20

25

30

35

Lag

Sampling Methods

28 / 30

0.6 0.4 0.0

0.2

p(y=1|x)

0.8

1.0

Logit regression

xxxx xxx xxxxxxxxxxxx xxxx

x xx x x xxxxxx

x

0

2

x x

x x

●●●●●●●●●●●●●● ● ●●●●●●●●●●●● ● ● ● ●●●●●● ● ●●● ● ●●●● ● ●●●●●●●●● ● ●●● ●● ● ● ● ● ●●●●●●●● ● ● ●●●●● ● ● ● ●●●● ● ● ●●● ● ●●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●●●●● ● ●●● ●●●● ●●●

−4

−2

4

x

Posterior mean of p(Y∗ = 1|x∗ , w) at a grid of new values x∗ with 95% pointwise credible intervals (and the true curve in red). The n = 100 data points are represented as crosses for yi = 0 and circles for yi = 1. Sara Wade

Sampling Methods

29 / 30

References

Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hasting algorithm. The American Statistician, 49, 4:327-335. Casella, G. and George, E. (1992). Explaining Gibbs sampler. The American Statistician, 46, 3:167-174.

Sara Wade

Sampling Methods

30 / 30

Sampling Methods

Solution 1: approximate integral by. ˆG = T. ∑ t=1 g(xt)p(xt)∆x, ... t=1 E[g(Xt)] = G. convergence: ... Solution 1: Probability integral transformation: If X ∼ F, then U ...

556KB Sizes 2 Downloads 276 Views

Recommend Documents

No documents