Criteria-Based Shrinkage and Forecasting
Peter Reinhard Hansen Stanford University
SITE Conference, July 2006
QRINKAGE
Outline ❏ Theory: Derive the distribution of Out-of-Sample LR statistic... and related statistics. ❏ Implications for Data Mining ❍ Harder to mine OS... OS results are more credible. ❏ Qrinkage: OS distribution motivates Criteria-Based Shrinkage ❍ Very simple in regression model... soft AIC/BIC ❍ Implications/Insight... ❋ Diffusion Indicies ❋ Forecast Combination ❏ Simple Fix of Weak/Many Instruments ❏ Empirical results using Stock & Watson macro data
P. R. Hansen, 2006 (p.1)
QRINKAGE
A Simple Example Suppose Xi ∼ i i d N(θ0 , 1) where θ0 is unknown.
In-sample: A single observation, X1 . The MLE is θˆ1 = X1 , and the log-likelihood (at θˆ1 and θ0 ) `1 (θˆ1 )
=
`1 (θ0 ) =
1 2 1 2
log(2π) − 12 (X1 − θˆ1 )2 =
1 2
log(2π)
log(2π) − 12 (X1 − θ0 )2 .
So the relative fit (in terms of 2×log-likelihood) is n o 2 `1 (θˆ1 ) − `1 (θ0 ) = (X1 − θ0 )2 ∼ χ12 , such that the expected “over-fit” is 1.
P. R. Hansen, 2006 (p.2)
QRINKAGE
Out-of-sample: A single observation, X2 . (Think of θˆ1 as a prediction of X2 ). `2 (θˆ1 )
= c − 12 (X2 − θˆ1 )2 = c −
1 2
2
{(X2 − θ0 ) − (X1 − θ0 )} ,
`2 (θ0 ) = c − 12 (X2 − θ0 )2 . So the relative fit is now n o 2 `2 (θˆ1 ) − `2 (θ0 ) = 2(X1 − θ0 )(X2 − θ0 ) − (X1 − θ0 )2 = 2Z10 Z2 − Z10 Z1 , where Zi = (Xi − θ0 ) ∼ i i d N(0, 1). The expected relative fit is now −1!
P. R. Hansen, 2006 (p.3)
QRINKAGE
Fully consistent with existing intution... • Fixed Scheme... MSE loss Clark and McCracken • True model may yield worse forecast than parsimonious model. E.g. Clark and West (JoE, 200?) • Parameters may be too small to be useful for forecasting... E.g. Diebold & Nason (JIE, 1990): “Very slight [...] nonlinearities might be truly present [...], while nevertheless yielding negligible ex ante forecast improvement.” • Expectation: E (2Z10 Z2 − Z10 Z1 ) is a term in Giacomini-Rossi decomposition.
P. R. Hansen, 2006 (p.4)
QRINKAGE
(Correct Specification + Regularity + IS size=OS size):
In-Sample Out-of-Sample
= 2
`1 (θˆ1 ) − `1 (θ0 ) `2 (θˆ1 ) − `2 (θ0 )
A ∼
+Z10 Z1 2Z10 Z2 − Z10 Z1
,
where Z1 , Z2 ∼ i i d Nk (0, Ik ), k being the dimension of θ0 . ❏ Misspecification... other crit. (GMM, loss-functions etc.) we get QMLE type results. ❏ The more unnecessary parameters estimated... ... the worse out-of-sample fit.
P. R. Hansen, 2006 (p.5)
QRINKAGE
Theoretical Framework: Truth: Let {Yi } be iid random variables in Rp with density g(y). Model: Let {f (y; θ)}θ∈Θ be a family of densities and suppose that a.e.
for some θ0 ∈ Θ.
g(y) = f (y; θ0 ), We have
Y1 , . . . , Yn , Yn+1 , . . . , Yn+m , {z } | {z } | In-Sample
Out-of-Sample
and define the two log-likelihoods `1 (θ) ≡
n X i =1
log f (Yi ; θ),
and
`2 (θ) ≡
n+m X
log f (Yi ; θ).
i =n+1
P. R. Hansen, 2006 (p.6)
QRINKAGE
Theorem 1: Given Regularity Conditions, define θˆ1 ≡ arg max `1 (θ), θ
and the in-sample and out-of-sample likelihood ratios, LRi s ≡ 2[`1 (θˆ1 ) − `1 (θ0 )] and Then
LRi s
LRos
d →
LRos ≡ 2[`2 (θˆ1 ) − `2 (θ0 )].
Z10 Z1 q 0 m 0 2 m Z Z − n 1 2 n Z1 Z1
,
as
n, m → ∞ m n →π ∈R
where Z1 , Z2 ∼ i i d Nk (0, Ik ).
P. R. Hansen, 2006 (p.7)
QRINKAGE
Proof: Define Score and Hessian: S1 (θ) ≡
∂ `1 (θ) ∂θ
and
H1 (θ) ≡
∂ `1 (θ). 0 ∂θ∂θ
Given regularity conditions the MLE, θˆ1 , is given by FOC: ˜ θˆ1 − θ0 ), 0 = S1 (θˆ1 ) = S1 (θ0 ) + H1 (θ)(
θ˜ ∈ [θ0 , θˆ1 ],
such that h i−1 ˜ S1 (θ0 ). θˆ1 − θ0 = −H1 (θ)
P. R. Hansen, 2006 (p.8)
QRINKAGE
The out-of-sample score: S2 (θˆ1 ) 6= 0 almost surely. Taylor-expand the OS log-likelihood ¯ θˆ1 − θ0 ), `2 (θˆ1 ) = `2 (θ0 ) + S2 (θ0 )0 (θˆ1 − θ0 ) + 21 (θˆ1 − θ0 )0 H2 (θ)( where θ¯ ∈ [θ0 , θˆ1 ], such that LRos =
0 2S2,θ 0
−1 −1 −1 0 S1,θ0 . −H1,θ˜ S1,θ0 − S1,θ −H1,θ˜ [−H2,θ¯ ] −H1,θ˜ 0
Under regularity conditions p p θˆ1 → θ0 ⇒ Hj (θˆ1 )[Hj (θ0 )]−1 → Ik ,
j = 1, 2.
P. R. Hansen, 2006 (p.9)
QRINKAGE
Define si (θ) ≡
∂ log f (Yi ; θ) and ∂θ
hi (θ) =
∂ log f (Yi ; θ). 0 ∂θ∂θ
We note that the scores and the Hessians can be expressed S1 (θ) =
n X
si (θ),
i =1
n+m X
i =n+1
n X
n+m X
hi (θ).
si (θ) and
S2 (θ) =
and H1 (θ) =
i =1
hi (θ) and
H2 (θ) =
i =n+1
P. R. Hansen, 2006 (p.10)
QRINKAGE
Correct specification ensures that Σs ≡ E [si ,θ0 si0,θ ] = −E [hi ,θ0 ], 0
(information matrix equality) and by regularity conditions n
−H1,θ˜
1X p hi ,θ˜ → E [hi ,θ0 ] = Σs . =− n i =1
Thus if we define −1/2 1 √ n
Z1,n = Σs
n X
si ,θ0
and
−1/2 1 √ m
Z2,m = Σs
i =1
d
n+m X
si ,θ0 ,
i =n+1 d
it follows that Z1,n → Z1 , as n → ∞, and Z2,m → Z2 as m → ∞, where (Z10 , Z20 )0 ∼ N2k (0, I2k ).
P. R. Hansen, 2006 (p.11)
QRINKAGE
Applying this to the Out-of-Sample log-likelihood LRos
= = = =
−1 −1 −1 0 0 −H1,θ˜ [−H2,θ¯ ] −H1,θ˜ S1,θ0 2S2,θ −H1,θ˜ S1,θ0 − S1,θ 0 0 q S 0 h H i−1 S h H i−1 h H i h H i−1 S1,θ ˜ 2,θ 1,θ 2,θ¯ 1,θ˜ 1,θ 1,θ˜ m m 0 0 0 √ √ 2 n √m − n − − − − n n m n n n 2
q
m 0 Z Z n 2,m 1,n
2
q
m 0 Z Z n 2 1
−
−
m 0 Z Z n 1,n 1,n
m 0 Z Z n 1 1
S1,θ0 √ n
+ op (1)
+ op (1).
as postulated.
P. R. Hansen, 2006 (p.12)
QRINKAGE
Corollary 2: Let m = n E (Z10 Z1 ) =
+k,
var(Z10 Z1 ) = k 2 + 2k,
E (2Z20 Z1 − Z10 Z1 ) =
−k,
var(2Z20 Z1 − Z10 Z1 ) = k 2 + 6k.
Practice is typically different from m = n, e.g. recursive scheme. Resulting distribution will be different (integrals/stochastic integrals of brownian motions). Recursive scheme, average under-fit (m = h = 1) is: 1 T −R
Z1 T X 1 1 T k ' k T −R du → n u R/T
n=R+1
k 1−π
Z1 λ
1 − log(λ) d u = k 1−λ > k. u
P. R. Hansen, 2006 (p.13)
QRINKAGE
P. R. Hansen, 2006 (p.14)
QRINKAGE
Implications for Data Mining/Model Mining:
In-Sample: Estimated models always do better than true model. “Mining” over several models, will yield something much better than the truth. Out-of-Sample: True model has the edge... Still, estimated model could still spuriously do better than true model. “Mining” over many models out-of-sample will presumably yield something better than truth. How much better?
P. R. Hansen, 2006 (p.15)
QRINKAGE
Simulation example: Yi = ε i ,
i = 1, . . . , n.
We estimate, Yi =
β(j0 ) X(j ),i
+ ui ,
j = 1, . . . ,
K k ,
X 0X = IK ). where X(j ),i consists of k regressors from a pool of K. (X Gaussian errors and fixed regressors: (i s)
LRj
(os)
LRj
=
RSS(β0 ) − RSS(βˆ(j ) ) ∼ χk2 ,
=
RSS(β0 ) − RSS(β˜(j ) ) ∼ 2Z10 Z2 − χk2 ,
where β˜(j ) obtained from an independent data set.
P. R. Hansen, 2006 (p.16)
QRINKAGE
Simulation Design: n = m = 50
Sample Size
K = 1, . . . , 25
Pool of Regressors
k = 1, 2, 3
# Included Regressors
k
cr0.05
2
5% crit.val of χk .
Rejection probabilities (not accounting for ‘mining’) (i s)
k > cr0.05 ).
(os)
k > cr0.05 ).
In-Sample:
P (maxj LRj
Out-of-Sample:
P (maxj LRj
P. R. Hansen, 2006 (p.17)
QRINKAGE
P. R. Hansen, 2006 (p.18)
QRINKAGE
Some Interpertions: Data mining is more damaging in-sample than out-of-sample. Naturally, one should always account for the mining ... ala. White (EMA, 2000), Hansen (JBES, 2005), Hansen, Nason, Lunde (WP, 2005). Such methods require knowledge of the full set of models... often unknown. Beating a simple model out-of-sample is more impressive than beating it in-sample... in particular if one suspects Unaccounted Mining.
P. R. Hansen, 2006 (p.19)
QRINKAGE
Qrinkage: Criteria-Based Shrinkage
Idea: Adjust estimated parameters to offset overfit... hoping to reduce the out-of-sample underfit. θ˜1
solves
˜ = `1 (θˆ1 ) − k/2. `1 (θ)
Solution typically not unique... In which direction do we ‘shrink’ θˆ1 ? Introduce a gravity model, f (y; θ ? ), and shrink θˆ1 in the direction of θ ? . Gravity model motivated by... ❏ Economic theory... (similar to Schorfheide). ❏ Standard practice... (parsimonious principle). Gravity point is θ ? . Gravity set Θ? ⊂ Θ is also a possibility. P. R. Hansen, 2006 (p.20)
QRINKAGE
Simple linear regression model: Y = X β + ε. 2×log-likelihood is 2`(σε2 , β)
∝−
n σε2
(Syy − β 0 Sxy − Syx β + β 0 Sxx β),
where Syy = Y 0 Y/n,
Sxy = X 0 Y/n,
0 Syx = Sxy ,
and
Sxx = X 0X /n.
−1 Shrink the OLS estimator, βˆ = Sxx Sxy towards gravity point β ? = 0.
P. R. Hansen, 2006 (p.21)
QRINKAGE
Diagonalize Sxx = QΛQ0 where Q0 Q = Ik , and define γ = Q0 β. Y = |{z} X Q Q0 β + ε = Z1 γ1 + · · · + Zk γk + ε. |{z} Z
γ
We have Z 0Z = Q0X 0X Q = nQ0 QΛQ0 Q = nΛ, (i.e. orthogonal regressors), such that Y 0 Zi δi γˆ i = 0 = , λi Zi Zi a ratio of ‘signal’ and ‘ regressor variation’, where δi ≡ Y 0 Zi /n. Now −2`(σε2 , γ) ∝
n σε2
(Syy − 2δ 0 γ + γ 0 Λγ) =
n σε2
(Syy − 2
k X
δi γi + λi γi2 ).
i =1
P. R. Hansen, 2006 (p.22)
QRINKAGE
Shrink γi towards 0 to reduce 2`(·) by one unit (if possible). Let γ˜ i = κi γˆ i . For each i we seek the solution to n σε2
(−2δi κi γˆ i + λi (κi γˆ i )2 ) =
n σε2
(λi γˆ i2 − 2δi γˆ i ) + 1,
where ki ∈ [0, 1) (if equality cannot be acheived, we set κi = 0). Substitute δi = γˆ i λi and
ti2
=
2 γˆ i
σε2 /(nλi )
to get
−2κi ti2 + κi2 ti2 = ti2 − 2ti2 + 1 ⇔ κi2 − 2κi + 1 − 1/ti2 = 0 which leads to the simple solution κi = 1 ± 1/|ti |
we seek
κi∗ = max(0, 1 − 1/|ti |).
Thus |ti | ≤ 1... shrink all the way to zero (κi∗ = 0). While κi∗ → 1 as |ti | → ∞. P. R. Hansen, 2006 (p.23)
QRINKAGE
Qrinkage estimator is: 0 γ˜ i = γˆ (1 − |t |−1 ) i i
|ti | ≤ 1, |ti | > 1.
Adjustment term is (apart for sign) γˆ i /ti =
γˆ i γˆ i / σ
σε √ zi n
σε = √ . σzi n
• n ↑ More observations – less shrinkage. • σε2 ↑ More noise – more shrinkage. • λi = σz2i ↑ More regressor variation – less shrinkage.
P. R. Hansen, 2006 (p.24)
QRINKAGE
Relation to Borr-estimator: γ˜ i = γˆ i × max(0, 1 −
σγˆi |γˆ i |
) = max(0, γˆ i − σγˆi
γˆ i |γˆ i |
),
where σγ2ˆ = σε2 /(nλi ). i
With the parameterization α˜ i = γ˜ i /σγˆi and αˆ i = γˆ i /σγˆi , we have α˜ i = max {0, αˆ i − sign(αˆ i )} , which is the Borr-estimator.
P. R. Hansen, 2006 (p.25)
QRINKAGE
P. R. Hansen, 2006 (p.26)
QRINKAGE
Related Shrinkage Estimators: ❏ Pretest estimator... as advocated by Hendry and coauthors. ❏ Bayesian estimators, e.g. Laplace by Magnus (1999). ❏ Bagging by Breiman (1996), see also Inoue & Kilian (WP). ❏ Empirical Bayes, see Knox, Stock, Watson (WP, 2004). ❏ Imposing restrictions on γˆ 1 , . . . , γˆ K . ❍ Mallows’ max(0, 1 −
1 )γˆ i , ti2
see Obenchain (1975).
❍ Lasso/Non-Neg. Garrote Breiman(1995)/Tibshirani(1996). ❏ Sliced Inverse Regressions by Li (JASA, 1991). ❏ Stock & Watson (WP), Miller (2002, ‘Subset... Regression’ )
P. R. Hansen, 2006 (p.27)
QRINKAGE
An Argument in favor of Shrinkage Example: Suppose we have standardized variables var(Yi ) = var(X1,i ) = · · · = var(Xk,i ) = 1. Consider the regression model Yi = β1 X1,i + · · · + βk Xk,i + σε εi ,
i = 1, . . . , n,
where we normalize var(εi ) to be unity. The parameter space for β is not unrestricted, as β12 + · · · + βk2 ≤ 1
(= 1 − σε2 ).
Thus we know something about regression coefficients without a prior.
P. R. Hansen, 2006 (p.28)
QRINKAGE
How is OLS going to compare to Qrinkage? Consider the case where n = 100 and k = 10 or 20. Let 0 ≤ c < 1 and allocated σε2 = 1 − c2
β12 + · · · + βk2 = c2 .
and
We set βi2
=c
2
i −1
ρ , (1−ρk )/(1−ρ)
for some
|ρ| ≤ 1,
such that k X
i −1
ρ (1−ρk )/(1−ρ)
= 1.
i =1
P. R. Hansen, 2006 (p.29)
QRINKAGE
P. R. Hansen, 2006 (p.30)
QRINKAGE
P. R. Hansen, 2006 (p.31)
QRINKAGE
P. R. Hansen, 2006 (p.32)
QRINKAGE
Qrinkage Applied to Principal Components: Consider the PC regression model: Y = γ1 Z1 + · · · + γK ZK + ε, 0
where Zi Zj = 0 for i 6= j, and c i ), σˆz2i ≡ Zi0 Zi /n = var(Z
σˆz21 ≥ σˆz22 ≥ · · · ≥ σˆz2K .
Stock & Watson proposed to include the first k ≤ K PCs. At first sight, this seems a bit absurd. ❏ PCs are chosen without regard to the Yi . One could imagine: Y = γK ZK + ε... why start with Z1 , Z2 , ...? ❏ PC are not invariant to rotation of original X -regressors.
P. R. Hansen, 2006 (p.33)
QRINKAGE
Can this approach be justified? Qrinkage yields Yˆi = γ˜ 1 Z1i + · · · + γ˜ K ZK i , where γ˜ i = max 0, 1 − |t1 | γˆ i , i
γˆ i =
2 σˆy,z i
σˆz2i
and
ti = γˆ i
,r
2 σˆε n×σˆz2
.
i
γ˜ i = 0 or γ˜ i = γˆ i − γˆ i /|ti |. ‘Shrinkage term’ is σˆε γˆ i =√ . ti nσˆzi If σˆz2 is large (first principal components) γ˜ i ≈ γˆ i . 2 If σˆz is small (last principal components) |γ˜ i | << |γˆ i |, possibly zero.
P. R. Hansen, 2006 (p.34)
QRINKAGE
P. R. Hansen, 2006 (p.35)
QRINKAGE
P. R. Hansen, 2006 (p.36)
QRINKAGE
P. R. Hansen, 2006 (p.37)
QRINKAGE
Hold that thought....Forecast Combination Consider competing forecasts of Yt+h Yˆ1,t , . . . , YˆM,t . We seek a good linear combination of these forecasts. MSE objective... seek the regression coefficients Yt+h = ω1 Yˆ1,t + · · · + ωM YˆM,t + µ + ut+h,t . Now rotate Yˆ1 , . . . , YˆM into their principal components, Yt+h = γ1 Z1,t + · · · + γM ZM,t + µ + ut+h,t .
P. R. Hansen, 2006 (p.38)
QRINKAGE
Conjecture: “ First principal component almost proportional to simple average” Z1,t
1 ˆ 1 ˆ ∝ Y1,t + · · · + YM,t , M M
(approximately).
Because Yˆj,t = MeanForecastt + noisej,t , MeanForecastt ' Et (Yt+h ) and var(Et (Yt+h )) >> var(noisej,t ), (plausible... e.g. inflation). So 1st PC ' simple average.
P. R. Hansen, 2006 (p.39)
QRINKAGE
Qrinkage Approach: For a collection of unbiased forecasts X = Yˆ1,t , . . . , YˆM,t , 0 0 0 decompose X X /n = QΛQ , and normalize Q-vectors... Q ι = ι.
Every column vector of (Z1 , . . . , ZM ) = Z = X Q, is a convex combination of Yˆ1,t , . . . , YˆM,t . Shrink γ2 , . . . , γK , (e.g. K = 5), and set (for convexity) X γ˜ 1 = 1 − γ˜ j . j >1
P. R. Hansen, 2006 (p.40)
QRINKAGE
Ordered Qrinkage: Sometimes there is a natural ordering of the regressors. For instance the AR(p) model Yt = ϕ1 Yt−1 + · · · + ϕp Yt−p + µ + εt . X t,· = (Yt−1 − Y¯−1 , . . . , Yt−p − Y¯−p )) We can rotate regressors (X Z = X (C0 )−1 , using Choleski: X h0X = CC 0 . i 0 −1 So Z1 = X1 , Z2 = In − X1 (X1 X1 ) X1 X2 , and Zk+1
0 −1 0 X 1:k X 1:k ) X 1:k Xk+1 , = In − X 1:k (X
here Zk+1 is made orthogonal to X1 , . . . , Xk .
P. R. Hansen, 2006 (p.41)
QRINKAGE
Estimate the parameters in the transformed model, γˆ i =
Y 0 Zi Zi0 Zi
Invert transformation ˆ ϕ 1 .. . = (C0 )−1 ˆp ϕ
and
γˆ 1 .. . γˆ p
γ˜ i = γˆ i max(0, 1 −
1 ). |ti |
˜1 ϕ .. 0 −1 = (C ) . ˜p ϕ
γ˜ 1 .. . . γ˜ p
P. R. Hansen, 2006 (p.42)
QRINKAGE
P. R. Hansen, 2006 (p.43)
QRINKAGE
P. R. Hansen, 2006 (p.44)
QRINKAGE
P. R. Hansen, 2006 (p.45)
QRINKAGE
Partial Qrinkage: We only want to shrink β Y = W ξ + X β + ε. Define “residuals” R0 = (I − W (W 0 W )−1 W 0 )Y
and
R1 = (I − W (W 0 W )−1 W 0 )X,
−1 ˜ Then βˆ = S11 and estimate β from R0 = R1 β + ε. S10 where 0
S10 =
R 1 R0 n ,
S01 =
0 , S10
0
and
S11 =
R 1 R1 n
= QΛQ0 ,
we shrink −1 S10 , γˆ = (Q0 R10 R1 Q)−1 QR10 R0 = Q0 S11
˜ and ξ˜ found by regressing Y − X β˜ on W. as usual. Finally, β˜ = Qγ,
P. R. Hansen, 2006 (p.46)
QRINKAGE
Instrumental Variables Regression: Weak/Many... Y = X β + u,
and
X = Zπ + v,
cov(u, v) 6= 0.
2SLS: Estimate Xˆ = Z πˆ and regress Y on Xˆ .
ˆ0
βˆ2SLS = X Xˆ
−1
Xˆ 0 Y = X 0 Z(Z 0 Z)−1 Z 0 X
−1
X 0 Z(Z 0 Z)−1 Z 0 Y.
Issue: First-stage regression yield too good a fit. Consequently: ❏ Too much variation in Xˆ . ❏ Xˆ 0 Xˆ too large. c βˆ2SLS ) = ❏ var(
σˆu2
ˆ0
X Xˆ
−1
too small... poor inference.
Noise-to-signal increases as: (i) π ↓ 0 (ii) dim(Z) ↑
P. R. Hansen, 2006 (p.47)
QRINKAGE
Simple Qrinkage-Fix: Shrink πˆ = (Z 0 Z)−1 Z 0 X, ˜ in second-stage regression. towards zero, and use X˜ = Z π, n o n o −1/2 −1/2 0 I = var Ωv vi = var Ωv (Yi − π Xi ) −1/2 0 −1/2 = var Ωv Yi − Ωv π QQ0 Xi , | {z } | {z0 }| {z } Y˜i
Γ
Zi
d Γˆ − Γ) = (Z 0 Z)−1 ⊗ Ωv˜ = Λ−1 ⊗ I. Now Γˆ = (Z 0 Z)−1 Z 0 Y˜ , with avar( ˜ 1/2 Shrink one-by-one using individual t-stats. π˜ = QΓΩ v . Imbens & Chamberlain “Hierarchical Bayes Models with Many Instrumental Variables”
P. R. Hansen, 2006 (p.48)
QRINKAGE
Summary ❏ Out-of-Sample LR distribution: generalizes to other criteria... ❏ Implications: Harder to mine OS than IS. ❏ OS-Theory Motivates Qrinkage: ❍ Soft Information Criteria ❍ Diffusion Indicies ❍ Forecast Combination ❏ Simple Fix of Weak/Many Instruments
P. R. Hansen, 2006 (p.49)