Sampling Methods Sara Wade University of Cambridge
Charles University 8-19 April 2013, Prague
Sara Wade
Sampling Methods
1 / 30
Numerical Integration We want to compute G = Ep [g(X)] =
R
g(x)p(x)dx.
Solution 1: approximate integral by ˆ= G
T X
g(xt )p(xt )∆x,
t=1
where xt lie on an equidistant grid.
Problem: the number of grid points needed grows exponentially with the dimension of X . Sara Wade
Sampling Methods
2 / 30
Monte Carlo Solution 2: approximate integral by T X ˆ= 1 g(xt ), G T
iid
where X t ∼ p
t=1
Sara Wade
Sampling Methods
3 / 30
Monte Carlo
Properties: ˆ = 1 PT E[g(X t )] = G. unbiased: E[G] t=1 T ˆ convergence: G → Ep [g(X)] as T → ∞ (a.s.) under mild conditions (SLLN). ˆ = 1 V(g(X)) ← doesn’t depend on the dimension of variance: V(G) T x!
Sara Wade
Sampling Methods
4 / 30
Monte Carlo
Properties: ˆ = 1 PT E[g(X t )] = G. unbiased: E[G] t=1 T ˆ convergence: G → Ep [g(X)] as T → ∞ (a.s.) under mild conditions (SLLN). ˆ = 1 V(g(X)) ← doesn’t depend on the dimension of variance: V(G) T x! But, how do we generate random samples from p(x)?
Sara Wade
Sampling Methods
4 / 30
Probability integral transformation
Solution 1: Probability integral transformation: If X ∼ F , then U = F (X) ∼ Unif(0, 1). → X = F −1 (U ) ∼ F. Note: F −1 is the generalized inverse F −1 (u) = inf x {x : F (x) ≥ u}. We can obtain a sample x from F by sample U ∼ Unif(0, 1), set x = F −1 (u). Problems: need to compute F −1 , which can be expensive or unavailable.
Sara Wade
Sampling Methods
5 / 30
Probability integral transformation
0.4 0.0
0.2
p(x)
0.6
0.8
Ex. truncated normal p(x) ∝ N(x|µ, σ 2 )1(x ≥ 0) sample U ∼ Unif(Φ(−µ/σ), 1), set x = Φ−1 (u)σ + µ.
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ●● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ●●
−3
−2
−1
0
1
2
3
x
Sara Wade
Sampling Methods
6 / 30
Rejection sampling Let q(x) be a density (that can be easily sampled from) such that p(x) ≤
q(x) α
∀ x,
for some constant α ∈ (0, 1). Solution 2: rejection sampling algorithm: obtain a sample x from p(x) by 1. sample x∗ from q(x) ∗
) 2. accept x = x∗ with probability α(x∗ ) = min(1, p(x q(x∗ ) α), otherwise go to step 1.
Problems: need to find q(x) that can be easily sampled and such that the number of rejections is small.
Sara Wade
Sampling Methods
7 / 30
Rejection sampling Ex. truncated normal p(x) ∝ N(x|µ, σ 2 )1(x ≥ 0);
q(x) = N(x|µ, σ 2 ),
0.4 0.0
0.2
p(x)
0.6
0.8
1. sample x∗ from N(µ, σ 2 ), 2. set x = x∗ if x∗ ≥ 0, otherwise go to step 1. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ●● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ●●
xx x xx x xxxxxxxxxx xxxxxxxxxxxxxxxxx x xxx
−3
−2
−1
0
1
2
3
x
Sara Wade
Sampling Methods
8 / 30
Markov chain Monte Carlo A third solution is to generate a Markov chain {X t }t≥1 with an appropriately defined transition probability such that for large enough t, X t is approximately sampled from p(x). Markov chain (first order): over the state space X , is a sequence of random variables X1 , X2 , . . . such that π(xt+1 |x1 , . . . , xt ) = π(xt+1 |xt ). Homogeneous: a Markov chain is homogeneous if the transition probabilities do not depend on t, π(xt+1 |xt ) ← does not depend on t.
Sara Wade
Sampling Methods
9 / 30
Markov chain Monte Carlo
Stationary distribution: p(x) is stationary with respect to a homogeneous Markov chain if X π(x|xt )p(xt ). p(x) = xt
A sufficient condition to ensure that p(x) is a stationary distribution is the detailed balance condition: p(x)π(x0 |x) = p(x0 )π(x|x0 ).
Sara Wade
Sampling Methods
10 / 30
Markov chain Monte Carlo If {X t }t≥1 is a homogeneous Markov chain such that state space is the support of p(x), irreducible and aperiodic, has stationary distribution p(x), d
from ergodic theory, X t → X, where X has density p(x) and T X 1 g(xt ) → Ep [g(X)] a.s. T − T0 t=T0
Thus, our Markov chain produces (dependent) samples from p can be used to approximate G. → We generate a large Markov chain and discard samples to obtain nearly independent samples from p(x). Sara Wade
Sampling Methods
11 / 30
Metropolis-Hastings How to construct such a chain? Such a chain can be obtained by introducing a proposal density q(x|xt ) and choose a starting value x0 . At each iteration: generate x∗ from q(x|xt ), with acceptance probability p(x∗ ) q(xt |x∗ ) ∗ t α(x |x ) = min 1, , p(xt ) q(x∗ |xt ) set xt+1 = x∗ , otherwise set xt+1 = xt . The transition probability: π(xt+1 |xt ) = q(xt+1 |xt )α(xt+1 |xt ) + r(xt )δxt (xt+1 ), R where r(xt ) = q(x|xt )α(x|xt )dx, satisfies the detailed balance condition with respect to p(x). For large enough T0 , (xt )t≥T0 are approximate samples from p(x). Sara Wade
Sampling Methods
12 / 30
Independent Metropolis Hasting The proposal density q(x) is independent of xt and such that p(x) ≤
q(x) , α
for some constant α ∈ (0, 1). The acceptance probability reduces to p(x∗ ) q(xt ) ∗ t α(x |x ) = min 1, . p(xt ) q(x∗ ) This algorithm is similar to rejection sampling but produces dependent samples, but it can be shows that on average independent MH accepts more often.
Sara Wade
Sampling Methods
13 / 30
Independent MH Ex. truncated normal p(x) ∝ N(x|µ, σ 2 )1(x ≥ 0);
q(x) = N(x|µ, σ 2 ),
0.4 0.0
0.2
p(x)
0.6
0.8
sample x∗ from N(µ, σ 2 ), set xt+1 = x∗ if x∗ ≥ 0, otherwise set xt+1 = xt . ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ●● ● ● ●● ● ● ●● ●●● ●
x x
−3
−2
x xx xxxxxxx xx xxx● x ● −1
0
●
●
1
2
●
3
x
Sara Wade
Sampling Methods
14 / 30
Random walk The candidate value is obtained by perturbing the previous value: x∗ = xt + , where is independent of xt with symmetric distribution centered at 0; for ex. ∼ N(0, σ 2 ). The acceptance probability reduces to p(x∗ ) . α(x |x ) = min 1, p(xt ) ∗
t
The scale of the proposal should be as large as possible without incurring high rejection rates.
Sara Wade
Sampling Methods
15 / 30
Random walk Ex. bivariate normal p(x) = N(x|µ, Σ), q(x|xt ) = N(x1 |xt1 , σq2 )N(x2 |xt2 , σq2 ),
sample x∗ from N(x1 |xt1 , σq2 )N(x2 |xt2 , σq2 ) , sample U ∼ Unif(0, 1) and set ∗ x if u < min(1, N(x∗ |µ, Σ)/N(xt |µ, Σ)) t+1 x = . t x otherwise Sara Wade
Sampling Methods
16 / 30
Gibbs sampling
Assume x = (x1 , . . . , xd ). The Gibbs sampling algorithm iteratively samples xi given x−i = (x1 , . . . , xi−1 , xi+1 , . . . , xn ). Let p(xi |x−i ) be the full conditional density. The chain is initialized at x0 and the Gibbs sampling algorithm procedes by sample xt+1 from p(x1 |xt2 , . . . , xtd ), 1 t t sample xt+1 from p(x2 |xt+1 2 1 , x3 , . . . , xd ), .. . t+1 sample xt+1 from p(xd |xt+1 1 , . . . , xd−1 ), d
The Gibbs sampling procedure is a special case of MH where the proposal is always accepted.
Sara Wade
Sampling Methods
17 / 30
Gibbs sampling Ex. bivariate normal p(x) = N(x|µ, Σ): 2 ), sample xt+1 from N(x1 |β0,1 + β1,1 xt2 , σ1|2 1 2 sample xt+1 from N(x2 |β0,2 + β1,2 xt+1 2 1 , σ2|1 ).
Strong correlations can slow down Gibbs sampling. If groups of variables can be updated jointly, we can improve the mixing; this is known as blocked Gibbs sampling. Sara Wade
Sampling Methods
18 / 30
Convergence of MCMC Samples need to be discarded so that (xt )Tt=1 are independent samples from p(x): burn in: to overcome poor initialization, we may need to throw away the first T0 samples. thinning: to reduce dependency, we may need to keep only every k th sample. It is difficult to monitor convergence, but we can determine lack of convergence via trace plots, autocorrelations, and comparing chains with different initializations.
Sara Wade
Sampling Methods
19 / 30
Summary of MCMC
Advantages: general, applicable to many problems, easy to implement, theoretical guarantees as T → ∞. Disadvantages: can be slow and expensive to compute, difficult to assess convergence.
Sara Wade
Sampling Methods
20 / 30
MCMC in Bayesian inference In Bayesian inference, a closed form for the posterior distribution is often unavailable. MCMC can be used to obtain samples (θt )Tt=1 from the posterior p(θ|x), and T 1X E[g(θ)|x] ≈ g(θt ). T t=1
For example, the posterior distribution can be approximated by P (A|x) ≈
T 1X δθt (A), T t=1
for any measurable set A ⊆ Θ, and the predictive density at x∗ is approximately T 1X p(x∗ |x) ≈ p(x∗ |θt , x). T t=1
Sara Wade
Sampling Methods
21 / 30
Probit regression Model: ind
Yi |xi , w ∼ Bern(p(xi )), p(xi ) = p(Yi = 1|xi , w) = Φ(w0 + w1 xi ). Prior: w ∼ N(µ0 , Σ0 ). Latent model: Introduce a latent variable Y˜i where ind Y˜i |xi , w ∼ N(w0 + w1 xi , 1),
yi |˜ yi , xi , w = 1(˜ yi ≥ 0) Note: if we integrate out Y˜i , we recover the original model. Sara Wade
Sampling Methods
22 / 30
Probit regression - Gibbs sampler
Gibbs sampling algorithm: Initialize w, y ˜ ˆ sample w|˜ y, x ∼ N(w, ˆ Σ), where ˆ = (Σ−1 + X 0 X)−1 , Σ 0 ˆ −1 µ0 + X 0 y). w ˆ = Σ(Σ 0 sample Y˜i |w, xi , yi independently with density 1 N(˜ yi |w0 + w1 xi , 1)(1(˜ yi ≥ 0))yi (1(˜ yi < 0))1−yi . Z
Sara Wade
Sampling Methods
23 / 30
← Trace plot of first 2000 samples for w0 (left) and w1 (right)
0.0
−0.6
0.1
−0.4
0.2
−0.2
0.3
0.0
0.4
0.2
0.5
0.4
0.6
Probit regression
0
500
1000
1500
2000
0
500
1000
1500
2000
1.0 0.5 −0.5
0.0
← Autocorrelation for w0 (left) and w1 (right)
−1.0
−0.5
0.0
Autocorrelation
1.0 0.5
Iterations
−1.0
Autocorrelation
Iterations
0
5
10
15
20 Lag
Sara Wade
25
30
35
0
5
10
15
20
25
30
35
Lag
Sampling Methods
24 / 30
0.6 0.4 0.0
0.2
p(y=1|x)
0.8
1.0
Probit regression
xx xxxxx x x xxxxxxxxxxxxxx xxx x
xx xx xx
x x
0
2
x
x
x
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●●●● ●●●●● ●●●●● ● ● ● ●●● ● ●● ●●●●●●● ● ●●●● ● ● ●●● ● ● ●●●●●● ● ●●●●●● ● ●● ● ●●●● ●●● ● ●●● ● ● ● ●● ● ●●
−4
−2
4
x
Posterior mean of p(Y∗ = 1|x∗ , w) at a grid of new values x∗ with 95% pointwise credible intervals (and the true curve in red). The n = 100 data points are represented as crosses for yi = 0 and circles for yi = 1. Sara Wade
Sampling Methods
25 / 30
Logit regression
Model: ind
Yi |xi , w ∼ Bern(p(xi )), exp(w0 + w1 xi ) . p(xi ) = p(Yi = 1|xi , w) = 1 + exp(w0 + w1 xi ) Prior: w ∼ N(µ0 , Σ0 ).
Sara Wade
Sampling Methods
26 / 30
Logit regression - MH random walk MH random walk algorithm: Initialize w sample w∗ |wt with proposal density ˆ −1 )−1 V ), q(w|wt ) = N(w|wt , V (Σ−1 0 +C where Cˆ is variance-covariance matrix of the MLEs and V is a diagonal matrix of tuning parameters. accept wt+1 = w∗ with probability ! n ∗ + w ∗ x ) yi t + wt x ) Y exp(w 1 + exp(w i i 0 1 0 1 α(x∗ |xt ) = min 1, exp(w0t + w1t xi ) 1 + exp(w0∗ + w1∗ xi ) i=1
otherwise set wt+1 = wt . Implemented in MCMClogit function of the R package MCMCpack.
Sara Wade
Sampling Methods
27 / 30
Logit regression Trace of x
0.7
0.5
0.8
Trace of (Intercept)
−0.5
0.3
0.4
0.0
0.5
0.6
← Trace plot of first 2000 samples for w0 (left) and w1 (right)
500
1000
1500
2000
0
1000
(Intercept)
x
1500
2000
0.5 −0.5
0.0
← Autocorrelation for w0 (left) and w1 (right)
−1.0
−0.5
0.0
Autocorrelation
0.5
1.0
Iterations
−1.0
Autocorrelation
500
Iterations
1.0
0
0
5
10
15
20 Lag
Sara Wade
25
30
35
0
5
10
15
20
25
30
35
Lag
Sampling Methods
28 / 30
0.6 0.4 0.0
0.2
p(y=1|x)
0.8
1.0
Logit regression
xxxx xxx xxxxxxxxxxxx xxxx
x xx x x xxxxxx
x
0
2
x x
x x
●●●●●●●●●●●●●● ● ●●●●●●●●●●●● ● ● ● ●●●●●● ● ●●● ● ●●●● ● ●●●●●●●●● ● ●●● ●● ● ● ● ● ●●●●●●●● ● ● ●●●●● ● ● ● ●●●● ● ● ●●● ● ●●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●●●●● ● ●●● ●●●● ●●●
−4
−2
4
x
Posterior mean of p(Y∗ = 1|x∗ , w) at a grid of new values x∗ with 95% pointwise credible intervals (and the true curve in red). The n = 100 data points are represented as crosses for yi = 0 and circles for yi = 1. Sara Wade
Sampling Methods
29 / 30
References
Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hasting algorithm. The American Statistician, 49, 4:327-335. Casella, G. and George, E. (1992). Explaining Gibbs sampler. The American Statistician, 46, 3:167-174.
Sara Wade
Sampling Methods
30 / 30