Model Selection Criterion for Instrumental Variable Models Naoya Sueishi∗ Kobe University† This Version: April 2012

Abstract This paper proposes a model selection criterion for instrumental variable (IV) models. The criterion is similar to the Mallows’ Cp . We address the issue of selecting a best approximating model under a certain loss function rather than identifying a correct model. Our main result is an extension of Li (1987) to the case of two-stage least squares estimation. We show that our criterion is asymptotically optimal in the sense of selecting a pair of model and set of instruments that achieves the lowest loss. Moreover, we consider a selection method of the smoothing parameters for the sieve IV estimator of Newey and Powell (2003) and Blundell, Chen, and Kristensen (2007). The result of the Monte Carlo study shows that our simple criterion works satisfactory well. Keywords: Instrumental variables; Mallows criterion; model selection; nonparametric IV models; optimality. JEL Classification: C14, C26, C52.



This paper is based on my dissertation project at the University of Wisconsin. I am grateful to Bruce Hansen

and Jack Porter for their guidance. I also thank Qingfeng Liu, Ryo Okui, Jun Shao, Christopher Taber and seminar participants at Kyoto University for their comments. † Graduate School of Economics, 2-1 Rokkodai-cho, Nada-ku, Kobe, 657-8501, Japan. Email: [email protected]

1

1

Introduction

This paper develops a simple model selection method for instrumental variable (IV) models on the basis of the two-stage least squares (TSLS) estimator. We address the issue of selecting a finite dimensional approximating model when the true model is infinite dimensional. We first introduce a loss function to evaluate a goodness of IV models. Then, we propose a selection criterion that is similar to the Mallows’ Cp (Mallows (1973)). We show that our criterion is asymptotically optimal in the sense that the selected pair of model and set of instruments achieves the minimum loss among all candidates. Our result is parallel to that of Li (1987), who shows the asymptotic optimality of the Cp criterion in the regression model. Numerous model selection methods for regression models have been advocated in the literature. These methods include Akaike information criterion (AIC; Akaike (1973)), Bayesian information criterion (BIC; Schwarz (1978)), Mallows’ Cp , cross-validation (CV; Stone (1974), Shao (1993)), generalized CV (GCV; Craven and Wahba (1979)), and final prediction error (Akaike (1970), Shibata (1984)). On the other hand, there have been few studies on model selection for IV models. Many studies deal with IV selection problem under the assumption that a model (structural equation) is correctly specified (e.g., Donald and Newey (2001), Hall and Peixe (2003), and Donald, Imbens, and Newey (2009)). Existing model selection methods for IV models include Pesaran and Smith (1994) and Andrews and Lu (2001). Pesaran and Smith (1994) develop an R2 -type criterion for IV models. Andrews and Lu (2001) propose model and moments (instruments) selection criteria in the context of GMM estimation. Andrews and Lu (2001) extend the result of Andrews (1999) and show that their BIC-like criterion is consistent in the sense that it asymptotically selects a correct model and all correct moments. A similar result is also obtained by Hong, Preston, and Shum (2003) in the case of generalized empirical likelihood estimation. Although our criterion is similar to that of Andrews and Lu (2001), our goal is different from theirs. We propose a model selection method that is applicable when the true model is infinite dimensional and/or nonparametric. We view finite dimensional models as approximations to the true data generating process (DGP) and deal with the issue of selecting a best approximating model rather than identifying the correct model. As an extension of our method, we also consider a selection method of the smoothing parameters for the sieve IV estimator of Newey and Powell (2003) and Blundell, Chen, and Kristensen (2007). Newey and Powell (2003) propose a nonparametric analog to the TSLS estimator. The important problem of the nonparametric IV estimation is that the estimator is highly fluctuating because of the so-called ill-posed inverse problem (Carrasco, Florens, and Renault (2007)). Newey and Powell (2003) and Blundell, Chen, and Kristensen (2007) utilize a regularization technique to stabilize the variability of the estimator, which requires researchers to determine some smoothing parameters. Although some selection methods for the regularization parameter have been considered in the nonparametric regression literature (Craven and Wahba (1979), Li (1986)), there is no optimality result in the endogenous regressor case. We propose a simple selection criterion for determining smoothing parameters simultaneously and show that our criterion is asymptotically optimal under a certain loss. The result of the Monte Carlo study shows

2

that our simple method works satisfactory well. There is a growing nonparametric IV estimation literature. Some contributions include Hall and Horowitz (2005), Darolles, Fan, Florens, and Renault (2011), Chen and Pouzo (2012), and Gagliardini and Scaillet (2012) in addition to Newey and Powell (2003) and Blundell, Chen, and Kristensen (2007). Although all estimation methods involve selecting smoothing parameters, theoretically justified selection criterion has not been fully investigated. Recently, Horowitz (2011) proposes a selection method for the sieve estimator of Horowitz (2012) and show that the regularized estimator converges in probability at a rate that is at least as fast as the asymptotically optimal rate multiplied by (log n)1/2 . Now we review the literature of asymptotic optimality of model selection. The study of asymptotic optimality was originated by Shibata (1980) in the context of order selection of linear time series models. Shibata (1981) shows the optimality of the AIC procedure in the regression model with Gaussian errors. Li (1987) obtains the optimality of the Cp criterion, CV, and GCV for homoskedastic regression models. Andrews (1991) extends the result of Li (1987) to heteroskedastic errors. Shao (1997) provides a nice overview. Hansen (2007) proposes a Mallows criterion for least squares averaging estimators and shows the optimality. To the best of my knowledge, however, there is no optimality theory for IV models. The remainder of this paper is organized as follows. In Section 2 we introduce a loss function to evaluate IV models and describe our selection criterion. Section 3 discusses the optimality of the criterion. In Section 4 we discuss how to select the smoothing parameters of the sieve IV estimator. Section 5 makes some remarks on an implementation. Section 6 presents the result of the Monte Carlo study. Section 7 concludes. Proofs are presented in the Appendix.

2

Approximating models, loss function and selection cri-

terion 2.1

Approximating models

Let {(yi , xi , zi ), i = 1, . . . , n} be a random sample. We consider the infinite dimensional IV model: yi µ(xi )

= µ(xi ) + ei , ∞ ∑ = βj xji ,

(2.1) (2.2)

j=1

E[ei |zi ] = 0,

(2.3)

E[e2i |zi ] =

(2.4)

σ2 ,

where yi is a scalar, xi = (x1i , x2i , . . . ) is an infinite dimensional vector of explanatory variables, and zi = (z1i , z2i , . . . ) is an infinite dimensional vector of exogenous variables. In vector notation, we write y = µ + e, where y = (y1 , . . . , yn )′ , µ = (µ(x1 ), . . . , µ(xn ))′ , and e = (e1 , . . . , en )′ . The model is similar to that of Hansen (2007), but we allow a possibility that a subvector of xi is correlated with ei . The error term ei is assumed to be homoskedastic in the sense that

3

Var[ei |zi ] is constant though Var[ei |xi ] may depend on xi . Because the elements of xi may be terms in expansion, the model includes nonparametric models. However, the case where the nonparametric function depends on the endogenous variables will be discussed separately in Section 4. We consider a sequence of approximating models and sets of instruments to estimate µ(·). Let h be the index of a pair of model and set of instruments. Let Hn be the set of all pairs being considered. The subscript n indicates that the set may depend on the sample size. The pair h ∈ Hn is presented as yi uhi

= x′hi βh + uhi ,

(2.5)

= bhi + ei ,

(2.6)

E[zhi uhi ] = 0,

(2.7)

where xhi = (xj1 i , . . . , xjph i )′ is a ph ×1 vector of explanatory variables and βh = (βj1 , . . . , βjph )′ is a ph × 1 parameter vector. The term bhi = µ(xi ) − x′hi βh is the approximation error. The vector zhi = (zj1 i , . . . , zjlh i ) is an lh × 1 vector of instruments. We assume lh > ph for h ∈ Hn . xhi and zhi may or may not be nested for different h. The problem we consider is how to select h in a data-driven way. Notice that the set of instruments is only valid for the “true” error ei and hence the orthogonality condition (2.7) is not satisfied for the finite dimensional models. Thus, all finite dimensional models are misspecified. We exclude the case where the infinite dimensional model is misspecified, that is, there is no function µ(·) such that E[µ(xi )|zi ] = 0 almost surely. For detecting such a misspecification, we can use, for instance, the specification test of Horowitz (2012) for the nonparametric IV problem. If the finite dimensional model is correctly specified, that is, if bhi = 0, then βh can be estimated by TSLS: βˆh = (Xh′ Zh (Zh′ Zh )−1 Zh′ Xh )−1 Xh′ Zh (Zh′ Zh )−1 Zh′ y,

(2.8)

where Xh = (xh1 , . . . , xhn )′ and Zh = (zh1 , . . . , zhn )′ . However, if the true DGP is infinite dimensional, all finite dimensional models are only approximations. We can formally estimate βh by TSLS, but estimates of individual coefficients are not of particular concern. Although the least squares estimator has an interpretation as the projection even if the conditional expectation is misspecified, the TSLS estimator does not have such an interpretation. Thus, once we view models as approximations, we need to answer the following questions: (i) What does TSLS estimate? and (ii) How do we evaluate the goodness of IV models? To answer the first question, we need to characterize the function µ(·). The true model implies that µ(·) is characterized as the solution to the following MSE minimization problem:  2  ∞ ∑   min E  E[yi |zi ] − γj E[xji |zi ]  . (2.9) {γ1 ,γ2 ,... } j=1 Also, (2.2) implies that E[yi |zi ] ≃

ph ∑

βjk E[xjk i |zi ].

k=1

4

This form suggests that we can estimate βh by regressing E[yi |zi ] on E[xhi |zi ]. Although the conditional expectations are unknown, they are estimated by regressing yi and xhi on zhi . Let P (h) = Zh (Zh′ Zh )−1 Zh′ be a projection matrix. Then the estimator of βh is obtained by solving the empirical counterpart of (2.9): 1 2 βˆh = arg min ∥P (h)y − P (h)Xh γh ∥ , γh n

(2.10)

where ∥ · ∥ is the Euclidean norm. The estimator (2.10) is the same as (2.8). Therefore, even though the TSLS estimator µ ˆ(h) ≡ Xh βˆh does not have the best approximation interpretation, P (h)ˆ µ(h) can be interpreted as the best approximation of P (h)y in terms of the sample L2 norm. The two step estimator (2.10) is suggested by Newey and Powell (2003) in the context of nonparametric IV models.

2.2

Loss function and selection criterion

To evaluate the performance of a model selection criterion, we need a measure to evaluate the goodness of models. A commonly used loss function for regression is the squared error loss: 1 2 ˜ L(h) = ∥µ − µ ˆ(h)∥ . n

(2.11)

In the case of the IV model, however, it is difficult to find a selection rule that is optimal with respect to (2.11). This is because, unlike OLS, TSLS does not minimize the sum of squared residuals. The objective function of TSLS doe not have a correspondence with the squared error loss. If the conditional distribution of xi given zi is known, then the so-called predictive mean square error: }2 1 ∑{ E[µ(xi )|zi ] − E[x′hi βˆh |zi ] n i=1 n

PMSEn (h) =

is sometimes used as a loss function in the statistical literature to address the deconvolution problem (e.g., Nychka, Wahba, Goldfarb, and Pugh (1984), O’Sullivan (1986)). In our case, the conditional distribution is not specified. Then, a possible loss function to evaluate the IV models is Ln (h) =

1 2 ∥P (h)(µ − µ ˆ(h))∥ . n

(2.12)

That is, we evaluate the goodness of the model by the projection of µ ˆ(h) onto the space spanned by instruments Zh rather than evaluate it by µ ˆ(h) itself. As we will see later, (2.12) is mathematically convenient to work with. A possible criticism will be that (2.12) is not of intrinsic interest when µ is of interest and (2.11) is the relevant loss function. However, the best model is sometimes insensitive to the choice of the loss function. A good selection rule with respect to (2.12) might give a good guidance even when the models are evaluated by other loss functions. Note that if xi is exogenous and xhi is a subset of zhi , then (2.11) and (2.12) have little difference.

5

We can rewrite (2.12) as Ln (h) =

1 ˜ (h))b(h) + 1 e′ W ˜ (h)e, b(h)′ (P (h) − W n n

(2.13)

˜ (h) = P (h)Xh (X ′ P (h)Xh )−1 X ′ P (h) is a projection matrix. where b(h) = µ − Xh βh and W h h The first term of (2.13) can be interpreted as the squared bias term, while the second term can be interpreted as the variance term. For fixed lh , the bias term is decreasing in ph while the variance term is increasing in ph . Hence, there is a usual trade-off between the bias and the variance. We propose the following model selection criterion: Cn (h) =

1 σ2 2 ∥P (h)(y − µ ˆ(h))∥ − (lh − 2ph ). n n

(2.14)

The criterion depends on the unknown parameter σ 2 , which can be replaced with a consistent estimator. For now, we suppose that σ 2 is known. The idea behind (2.14) is similar to that of the Mallows’ Cp . We consider an unbiased estimator for the risk. By expanding n−1 ∥P (h)(y − µ ˆ(h))∥2 , we obtain 2 1 1 2 ˜ 2 ˜ (h))b(h). (h)e + e′ (P (h) − W ∥P (h)(y − µ ˆ(h))∥ = Ln (h) + e′ P (h)e − e′ W n n n n Let X = {x1 , . . . , xn } and Z = {z1 , . . . , zn }. If xi is exogenous in the sense that E[ei |xi , zi ] = 0 ˜ (h)e|X, Z] = ph σ 2 , and and if E[e2 |xi , zi ] = σ 2 , then we have E[e′ P (h)e|X, Z] = lh σ 2 , E[e′ W i

˜ (h))b(h)|X, Z] = 0. Thus, Cn (h) is an unbiased estimator of E[Ln (h)]. UnforE[e′ (P (h) − W tunately, this is not true when xi is endogenous. However, even when xi is endogenous, Cn (h) is approximately an unbiased estimator of E[Ln (h)] under certain conditions. The idea is the ˜ (h) and b(h) depend on X. However, W ˜ (h) depends on following: The problem arises because W X only through the form P (h)Xh . Since P (h) is a projection matrix, P (h)Xh well approximates P (h)E[Xh |Z] for large n. Similarly, P (h)b(h) approximates P (h)E[b(h)|Z]. Therefore, we have ˜ (h)e|Z] ≈ E[e′ P (h)E[Xh |Z](E[Xh |Z]′ P (h)E[Xh |Z])−1 E[Xh |Z]′ P (h)e|Z] = ph σ 2 . MoreE[e′ W ˜ (h))b(h)|Z] ≈ 0. Hence, we have E[Cn (h)|Z] ≈ E[Ln (h)|Z] for sufficiently over, E[e′ (P (h) − W large n. ˆ = arg minh∈H Cn (h). Another justification for (2.14) is the asymptotic optimality. Let h n In the next section, we show that ˆ Ln (h) p →1 inf h∈Hn Ln (h)

(2.15)

as n → ∞. This means that our criterion asymptotically selects the pair of model and set of instruments that attains the minimum loss among all candidates. This result is parallel to the result that the Cp criterion is asymptotically optimal with respect to the squared error loss in the regression model. Finally, we note the relationship between our criterion and the criteria proposed by Andrews and Lu (2001). Notice that if the error term is homoskedastic, then the efficient GMM estimator is equivalent to the TSLS estimator. Thus, in our notation, their AIC- and BIC-like criteria can

6

be written as MMSC − AIC

=

MMSC − BIC =

1 2 ∥P (h)(y − µ ˆ(h))∥ − n 1 2 ∥P (h)(y − µ ˆ(h))∥ − n

2σ 2 (lh − ph ), n 2 σ (lh − ph ) log n, n

The only difference between our criterion and their AIC-like criterion is either to use lh − 2ph or 2(lh − ph ) as the penalty term. However, the penalty term of MMSC-AIC is specified in an ad hoc manner and does not have a theoretical justification.

3

Regularity conditions and optimality

Our proof of (2.15) is an application of Li (1987). Let xhi = fhi + ηhi ≡ E[xhi |zi ] + ηhi . Then, E[ηhi |zi ] = 0 by construction. Let Fh = (fhi , . . . , fhn )′ and W (h) = P (h)Fh (Fh′ P (h)Fh )−1 Fh′ P (h). Also, let g(h) = E[b(h)|Z] and let νh = b(h) − g(h). Define Rn (h) =

1 1 g(h)′ (P (h) − W (h))g(h) + E[e′ W (h)e|Z]. n n

(3.1)

For large n, we have P (h)Xh ≈ P (h)Fh and P (h)b(h) ≈ P (h)g(h). Thus, (3.1) can be interpreted as an approximation of E[Ln (h)|Z]. To obtain the desired result, we impose some assumptions. In what follows, C denotes a generic positive constant which will be different in different uses. Let λmin (A) denote the minimum eigenvalue of a matrix A. Also, let νhi be the i-th element of νh . Assumption 3.1 For some natural number m, the following conditions hold almost surely. (i) E[e4m i |zi ] < C; ∑ −m → 0; (ii) h∈Hn (nRn (h)) 4m (iii) suph∈Hn E[νhi |zi ] < C; ∑ 2 (iv) h∈Hn ph E[νhi |zi ] < C.

Assumption 3.2 limn→∞ λmin (n−1 Fh′ P (h)Fh ) > C uniformly in h ∈ Hn . Assumption 3.3 (i) lh /ph < C for all h ∈ Hn ; (ii) maxh∈Hn ph = o(n1/5 ). Now we discuss our assumptions. Assumptions 3.1 (i) and (ii) are similar to Assumptions ∑k (A.2) and (A.3) of Li (1987). Assumption 3.1 (iv) requires that E[{µ(xi ) − j=1 βj xji }2 |zi ] converges to zero at a certain rate as k → ∞. The cardinality of Hn may be bounded or may diverge to infinity as n → ∞. If the cardinality of Hn is bounded, then a necessary and sufficient condition for Assumption 3.1 (ii) is inf nRn (h) → ∞.

h∈Hn

7

(3.2)

This condition is not satisfied if the first term of the right-hand side of (3.1) is equal to zero for some h ∈ Hn . Thus it must be the case that there exists no finite dimensional correct model h0 ∈ Hn such that b(h0 ) = 0. Assumption 3.2 excludes the case where the instruments are weak. The matrix could be nearly singular when E[xji |zi ] is not variable enough. Assumption 3.3 restricts the complexity of the models. This assumption is required so that ˜ (h))b(h) and e′ W ˜ (h)e are well approximated by g(h)′ (P (h) − W (h))g(h) and b(h)′ (P (h) − W e′ W (h)e respectively uniformly over h ∈ Hn . The following theorem provides the optimality of our selection criterion. Theorem 3.1 Suppose that Assumptions 3.1-3.3 hold. Then the criterion (2.14) is asymptotically optimal, i.e., (2.15) holds.

4

Nonparametric IV estimation and ill-posed inverse prob-

lem In this section we consider the nonparametric IV estimation problem. We extend the result of the previous section and propose a selection method of the smoothing parameters for the sieve IV estimator of Newey and Powell (2003) and Blundell, Chen, and Kristensen (2007). The model is yi

= µ(xi ) + ei ,

(4.1)

E[ei |zi ] = 0,

(4.2)

E[e2i |zi ] =

(4.3)

σ2 .

Now, xi and zi denote finite dimensional random vectors. The functional form of µ(·) is not specified. We assume that the function µ(·) is identified. Newey and Powell (2003) and Chen, Chernozhukov, Lee, and Newey (2011) provide some results about identification. The difficulty of estimating the structural form µ(·) stems from the fact that consistency of the reduced form estimator does not necessarily imply consistency of the structural form estimator. Notice that (4.1) and (4.2) yield the integral equation: ∫ E[y|z] = µ(x)dF (x|z), where F denotes the conditional distribution of xi given zi . The function µ(·) is characterized as the solution to the Fredholm integral equation of the first kind, which causes the ill-posed inverse problem (Kress (1999)). Let {ψ1 (x), ψ2 (x), . . . } and {q1 (z), q2 (z), . . . } be sequences of basis functions of xi and zi . Let xhi = ψ h (xi ) = (ψ1 (xi ), . . . , ψph (xi ))′ be a ph × 1 vector. Also, let zhi = q h (zi ) = (q1 (zi ), . . . , qlh (zi ))′ be an lh × 1 vector. Suppose that there exists βh such that [ [ ]2 ] E E µ(xi ) − ψ h (xi )′ βh |zi →0

8

(4.4)

as ph → ∞. Blundell, Chen, and Kristensen (2007) recommend to solve the following minimization problem: min ∥P (h)(y − Xh γh )∥ + λγh′ Ωh γh , 2

(4.5)

γh

where Ωh = Ω0h + Ωrh with 1∑ h ψ (xi )ψ h (xi )′ , n i=1 }{ r h }′ ∫ { r h d ψ (x) d ψ (x) = dx dxr dxr n

Ω0h Ωrh

=

(4.6) (4.7)

for some integer r and λ ≥ 0. The second term of (4.5) works as a penalty for a large Sobolev norm of the estimator. The unconstrained minimization problem (4.5) is equivalent to a constrained minimization problem that solves minγh ∥P (h)(y − Xh γh )∥ subject to γh′ Ωh γh ≤ B 2

for some constant B. Newey and Powell (2003) suggest to solve the constrained minimization problem. (4.5) is also a special case of the penalized sieve minimum distance estimator of Chen and Pouzo (2012). The minimization problem (4.5) has the analytical solution: βˆh,λ ≡ (Xh′ P (h)Xh + λΩh )−1 Xh′ P (h)y. The estimator for µ(x) is µ ˆλh (x) = ψ h (x)′ βˆh,λ . In an actual implementation, we need to specify the smoothing parameters λ and (ph , lh ) as well as the basis functions (ψ(x), q(z)). Especially, the choice of λ and ph has a significant effect on the performance of the estimator. Blundell, Chen, and Kristensen (2007) show a certain interdependence between λ and ph by simulation. For instance, a large ph should be associated with a slightly large λ to control a wiggly behavior of the estimator due to over-fitting. However, there is no existing criterion for selecting λ and ph simultaneously in a data-driven way. Again, it is difficult to find an optimal selection rule with respect to the squared error loss: 2 ˜ n (h, λ) = 1 ∥µ − µ L ˆλ (h)∥ , n

(4.8)

where µ ˆλ (h) = Xh βˆh,λ . The reason is that the convergence rate of the estimator with respect to (4.8) depends not only on the smoothing parameters but also on the unknown degree of the ill-posedness of the inverse problem (see e.g., Blundell, Chen, and Kristensen (2007) and Chen and Pouzo (2012)). The loss function we use is Ln (h, λ) =

1 2 ∥P (h)(µ − µ ˆλ (h))∥ , n

(4.9)

which is the same as (2.12) when λ = 0. We will discuss the relationship between (4.8) and (4.9) in Section 5. Also, in the Monte Carlo study, we investigate if the smoothing parameters that are selected on the basis of (4.9) provide a good estimator with respect to (4.8). By a simple manipulation, we obtain 1 1 2 ˆ 2 ′ 2 ˆ λ (h))µ. ∥P (h)(y − µ ˆλ (h))∥ = Ln (h, λ) + e′ P (h)e − e′ W e (P (h) − W λ (h)e + n n n n

9

A similar reasoning as in Section 2 implies the following selection criterion: Cn (h, λ) =

) 1 σ2 ( 2 ˆ λ (h)) . ∥P (h)(y − µ ˆλ (h))∥ − tr(P (h)) − 2tr(W n n

(4.10)

Again, if λ = 0 then Cn (h, λ) is the same as (2.14). Before proving the optimality of Cn (h, λ), we introduce some notations. Let b(h) = µ−Xh βh with βh satisfying (4.4). Define Rn (h, λ) =

1 1 g(h)′ (P (h) − Wλ (h)) (P (h) − Wλ (h)) g(h) + E[e′ Wλ (h)Wλ (h)e|Z], n n −1

where Wλ (h) = P (h)Fh (Fh′ P (h)Fh + λΩh )

Fh′ P (h). Let Λn (h) be a set of possible values of

λ for a given h ∈ Hn . We restrict Λn (h) to be a discrete set. The discreteness assumption is a little restrictive, but not harmful in an actual implementation. For simplicity, we treat Ωh as a non-random matrix in our proof. However, strictly speaking, Ωh is random if we use (4.6) as the penalizing smoothing matrix. We can also prove the result in the case of random matrix with additional assumptions. We impose the following assumptions. Assumption 4.1 For some natural number m, the following conditions hold almost surely. (i) E[e4m i |zi ] < C; ∑ ∑ −m (ii) → 0; h∈Hn λ∈Λn (h) (nRn (h, λ)) 4m |zi ] < C; (iii) suph∈Hn E[νhi ∑ 2 (iv) h∈Hn ph E[νhi |zi ] < C.

Assumption 4.2 limn→∞ λmin (n−1 Fh P (h)Fh + n−1 λΩh ) > C uniformly in h ∈ Hn and λ ∈ Λn (h). Assumption 4.3 (i) lh /ph < C for all h ∈ Hn ; (ii) suph∈Hn supλ∈Λn (h) ph /tr(Wλ (h)Wλ (h)) < C; (iii) maxh∈Hn ph = o(n1/5 ). Assumption 4.4 (i) Λn (h) is bounded for all h ∈ Hn ; (ii) suph∈Hn supλ∈Λn (h) ∥λΩh ∥ = O(n); (iii) suph∈Hn βh′ Ωh βh < C. Assumptions 4.1-4.3 are similar to Assumptions 3.1-3.3 in Section 3. A necessary condition for Assumption 4.1 (ii) is inf

inf

h∈Hn λ∈Λn (h)

nRn (h, λ) → ∞.

Assumption 4.1 (iv) requires that the function µ(·) is sufficiently smooth. Assumption 4.3 (ii) is satisfied if λ is not too large. If λ = 0, then ph /tr(Wλ (h)Wλ (h)) = 1. Thus, Assumption 4.3 (ii) is trivially satisfied when λ = 0. The term tr(Wλ (h)Wλ (h)) can be arbitrarily close to zero by choosing very large λ. If that is the case, then the smoothing matrix

10

λΩh dominates Xh′ P (h)Xh , which induces an extremely large bias. In most cases, Assumption 4.3 (ii) will be satisfied under Assumption 4.4. Assumption 4.4 (iii) states that the function ψ h (x)′ βh is bounded with respect to the Sobolev norm. Under these assumptions, we have the following theorem. Theorem 4.1 ˆ λ) ˆ = arg minh∈H Let (h,

n ,λ∈Λn (h)

Cn (h, λ). Suppose that Assumptions 4.1-4.4 hold. Then we

have ˆ λ) ˆ Ln (h, p → 1. inf h∈Hn inf λ∈Λn (h) Ln (h, λ) Theorem 4.1 shows that the selection criterion (4.10) is asymptotically optimal. The selected combination of the smoothing parameters is asymptotically equivalent to the best infeasible combination among h ∈ Hn and λ ∈ Λn (h).

5

Implementation

5.1

Relationship between two loss functions

We consider the relationship between our loss function and the squared error loss in the case of nonparametric IV estimation. For simplicity, we consider the case where λ = 0. Let µ∗h (xi ) = ψ h (xi )′ βh and suppose that E[E[µ(xi ) − µ∗h (xi )|zi ]2 ] = O(b2ph ) for some bph = o(1). Since Ln (h) is asymptotically equivalent to Rn (h) (see the proof of Theorem 3.1), (3.1) implies that √ √ ) ( 1 ph 2 ∥P (h) (µ − µ ˆ(h))∥ = Op bph + , n n which is a usual nonparametric rate of convergence. On the other hand, the convergence rate of µ ˆ(h) with respect to the squared error loss depends on the measure of ill-posedness of the inverse problem. Suppose that E[{µ(xi ) − µ∗h (xi )}2 ] = Op (p−2α ) for some α > 0. Following h Blundell, Chen, and Kristensen (2007), we have √ √ √ 1 1 1 2 2 ′ ∥µ − µ ˆ(h)∥ ≤ b(h) b(h) + τˆh ∥P (h)(µ∗ (h) − µ ˆ(h))∥ n n n √ )) ( ( ph = Op p−α + τ ˆ × b + , h ph h n where µ∗ (h) = (µ∗h (x1 ), . . . , µ∗h (xn ))′ and τˆh = sup √ γh

√ n−1 γh′ Xh′ Xh γh n−1 γh Xh′ P (h)Xh γh

.

Note that τˆh is a sample analog of the sieve measure of ill-posedness defined by Blundell, Chen, and Kristensen (2007). In our case, the sieve measure of ill-posedness is given by √ ′ γh E[ψ h (xi )ψ h (xi )′ ]γh √ τh = sup . γh γh′ E[E[ψ h (xi )|zi ]E[ψ h (xi )|zi ]′ ]γh Therefore, roughly speaking, the order of n−1 ∥µ − µ ˆ(h))∥2 is approximately equal to the order of τh2 n−1 ∥P (h)(µ − µ ˆ(h))∥2 . Whether a small value of (4.9) implies a small value of (4.8)

11

depends on the behavior of τh . Since the conditional expectation is a contraction, τh ≥ 1 and τh = 1 when xi is exogenous. Blundell, Chen, and Kristensen (2007) show that the sieve measure of ill-posedness is closely related to the eigenvalues of the conditional expectation operator. Although τh is unknown, τˆh is obtained by ( ( )−1 ) 1 ′ 1 ′ 2 τˆh = λmax X Xh X P (h)Xh , n h n h where λmax (A) is the maximum eigenvalue of a matrix A. See the Appendix for derivation. We can use τˆh to check the robustness of our selection criterion. If τˆh takes a large value even after selection, then it suggests that our selection method given in Section 2 may work poorly in terms of the squared error loss and that we should use the penalization to stabilize the estimator.

5.2

Estimation of σ 2

Theorems 3.1 and 4.1 are valid if σ 2 is replaced with a consistent estimator. However it is rather difficult to obtain a consistent estimator of σ 2 for the nonparametric IV case because of the ill-posed inverse problem. If we estimate σ 2 by σ ˆ 2 = n−1 ∥y − µ ˆλ (h)∥2 , then consistency of σ ˆ 2 depends on consistency of µ ˆλ (h) with respect to the squared error loss. The theory does not tell us how to estimate σ 2 . In the nonparametric regression, model independent methods are sometimes used to estimate σ 2 (see e.g., Gasser, Sroka, and JennenSteinmetz (1986) and Hall, Kay, and Titterington (1990)). However, these methods are not readily applicable in our case. Also, it is not necessarily clear whether the estimator should be model independent or not. The result of the Monte Carlo study suggests that the following criterion works sufficiently well in practice: Cn (h, λ) =

σ ˆ 2 (h) 1 2 ˆ λ (h)), ∥P (h)(y − µ ˆλ (h))∥ − λ tr(P (h) − 2W n n

(5.1)

where σ ˆλ2 (h) = n−1 ∥y − µ ˆλ (h)∥2 . That is, we estimate σ 2 by using different h and λ in each pair. When λ = 0, we have Cn (h) = n−1 ∥P (h)(y − µ ˆλ (h))∥ − n−1 σ ˆ 2 (h)(lh − 2ph ). This criterion is 2

close the AIC rather than the Mallows criterion. One may consider that a GCV-type method can be used to avoid estimating σ 2 . Unfortuˆ λ (h)y, the “hat” nately, a GCV type method does not work in our case. Since P (h)ˆ µλ (h) = W ˆ λ (h). Thus a natural criterion will be matrix is W GCV(h, λ) =

n−1 ∥P (h)(y − µ ˆλ (h))∥2 . ˆ λ (h)))2 (1 − n−1 tr(W

ˆ λ (h)) is small, a Taylor expansion yields If n−1 tr(W GCV(h, λ) ≈

1 2 ˆ λ (h)). ∥P (h)(y − µ ˆλ (h))∥2 + 2 ∥P (h)(y − µ ˆλ (h))∥2 tr(W n n

Since the second term converges to zero much faster than the first term, the GCV essentially does not put any penalty. Hence, unlike to the case of OLS estimation, the GCV criterion is not asymptotically optimal.

12

6

Monte Carlo study

To evaluate the performance of our selection method, we conduct a small Monte Carlo simulation. The design is similar to that of Newey and Powell (2003). The DGP is yi = µ(xi ) + ei = log(|xi − 1| + 1)sgn(xi − 1) + ei , xi = zi + vi , where the errors ei and vi and the instrument zi are generated as       ei 0 1 0.5 0        vi  ∼ N  0  ,  0.5 1 0  .       zi 0 0 0 1 We estimate the function µ(·) by the sieve IV estimator. As stated by Blundell, Chen, and Kristensen (2007), if (xi , zi ) has a bivariate normal distribution, then the inverse problem is severely ill-posed. The sieve measure of ill-posedness τh grows at a rate O(exp(ph )). The optimal convergence rate of the sieve IV estimator with respect to the L2 norm will be very slow (logarithmic rate). To approximate µ(·), we use Hermite and power series: µ(x) ≃ β0 x +

p∑ h −1

j−1

βj x

j=1

{ } (x − ν1 )2 exp − , 2ν22

(6.1)

and µ(x) ≃

ph ∑

βj xj−1

j=1

with ph = 4, 5, 6, 7. Parameters ν1 and ν22 in (6.1) are chosen as the sample mean and sample variance of xi . Following Newey and Powell (2003), we include β0 x in the approximation (6.1). The candidate values of λ are λ = lh /nρ with ρ = 0.5, 1.2, 1.9, 2.6 and λ = 0. These values are selected so that there is a notable difference in RMSE between estimators using different λ. The instrument is fixed. We choose zh = q h (z) to be a cubic spline with 4 knots, where knots are chosen as 20%, 40%, 60% and 80% sample quantiles of zi . Thus we consider the problem of selecting the pair of (ph , λ) for the given lh . We obtain the estimator by solving (4.5). The smoothing penalization matrix is Ωh = Ω0h + Ω2h with ∫ { Ω2h =

d2 ψ h (x) dx2

}{

d2 ψ h (x) dx2

}′ dx.

The selection criterion we use is (5.1). There are 1000 replications for three sample sizes 100, 400 and 1000. We report the weighted root mean squared error (WRMSE), which is calculated by WRMSE =



mean{∥P (h)(µ − µ ˆλ (h))∥2 /n},

13

where mean{·} denotes the average over 1000 repetitions. We also report the root mean squared √ error (RMSE): mean{∥µ − µ ˆλ (h)∥2 /n}. The results are summarized in Tables 1-4. To evaluate the performance of the selection criterion, we compare our criterion with infeasible best selections. Mallows denotes the estimator based on the selection criterion (5.1). Oracle 1 denotes the estimator based on the pair of ph and λ which minimizes WRMSE, while Oracle 2 denotes the estimator which minimizes RMSE. Ratio 1 refers to the ratio of WRMSE of an estimator relative to that of Oracle 1, whereas Ratio 2 is the ratio of RMSE of an estimator to that of Oracle 2. Table 1 shows that the results of Hermite series and power series are quite similar. This suggests that the sieve IV estimator is relatively insensitive to the choice of basis functions if we select the smoothing parameters properly. Our selection criterion performs remarkably well both in terms of WRMSE and RMSE. The Ratio 1 is close to one for all sample sizes. Since RMSE is more volatile than WRMSE (see Table 4), Ratio 2 is relatively larger than Ratio 1. But it is still quite close to one for all cases. Both WRMSE and RMSE of Mallows decrease as the sample size increases, but RMSE declines slowly, which would be expected for the severely ill-posed inverse problem. Table 1 also reveals that the pair of (ph , λ) which minimizes WRMSE is quite close to the pair which minimizes RMSE. Ratio 1 of Oracle 2 and Ratio 2 of Oracle 1 are very close to one. This result also confirms that our selection criterion has a certain robustness property. For comparison, we also report the result of no smoothing. We only select ph for fixed lh and λ = 0. The result is summarized in Table 2. Table 2 shows that the criterion successfully selects the number of sieve associated with small loss. However, actual values of WRMSE and RMSE are rather large compared to the case of smoothing. This result verifies the importance of smoothing for nonparametric IV estimation. Tables 3 and 4 report WRMSE and RMSE for all pairs of (ph , λ). Table 4 shows that RMSE is very sensitive to the choice of λ and is rather insensitive to the choice of ph . However, without smoothing, RMSE is sensitive to the choice of ph . Although actual values are quite different, WRMSE and RMSE show a similar pattern of movement. WRMSE and RMSE attain their minimum around λ = lh /n1.2 and increase as λ decreases for λ ≤ lh /n1.2 . Table 4 also shows that if we do not select ph and λ properly, RMSE of the estimator does not improve even if the sample size increases. For example, in the case of Hermite series with ph = 7 and λ = lh /n1.9 , RMSE actually increases as the sample size increases. In our design, the sieve IV estimator performs best around λ = ln /n1.2 . However this is not always true. The author conducted simulations under different designs. Although the results are not reported, the results reveal that the optimal choice of λ depends on the DGP. These results manifest the importance of selecting smoothing parameters in a data-driven way. In summary, our selection criterion performs very well in our design. Even when the sample size is small, the selection criterion successfully selects the smoothing parameters associated with small loss. The simulation result suggests that our selection procedure is robust even if we evaluate the selected model by using other loss functions.

14

Table 1: Simulation Result

Hermite

Power

Hermite

Power

Hermite

Power

Oracle 1 Oracle 2 Mallows Oracle 1 Oracle 2 Mallows Oracle 1 Oracle 2 Mallows Oracle 1 Oracle 2 Mallows Oracle 1 Oracle 2 Mallows Oracle 1 Oracle 2 Mallows

WRMSE Ratio 1 n = 100 0.143 1.000 0.145 1.009 0.163 1.142 0.150 1.000 0.151 1.008 0.171 1.146 n = 400 0.080 1.000 0.081 1.012 0.089 1.110 0.084 1.000 0.085 1.010 0.095 1.128 n = 1000 0.057 1.000 0.058 1.024 0.061 1.081 0.058 1.000 0.059 1.011 0.066 1.128

15

RMSE

Ratio 2

0.205 0.199 0.285 0.215 0.212 0.314

1.034 1.000 1.436 1.012 1.000 1.481

0.148 0.146 0.230 0.147 0.145 0.237

1.017 1.000 1.582 1.017 1.000 1.635

0.127 0.120 0.212 0.117 0.114 0.179

1.052 1.000 1.764 1.026 1.000 1.570

Table 2: Simulation Result (λ = 0)

Hermite

Power

Hermite

Power

Hermite

Power

Oracle 1 Oracle 2 Mallows Oracle 1 Oracle 2 Mallows Oracle 1 Oracle 2 Mallows Oracle 1 Oracle 2 Mallows Oracle 1 Oracle 2 Mallows Oracle 1 Oracle 2 Mallows

WRMSE Ratio 1 n = 100 0.200 1.000 0.202 1.006 0.207 1.032 0.200 1.000 0.201 1.004 0.208 1.036 n = 400 0.100 1.000 0.100 1.003 0.104 1.039 0.099 1.000 0.100 1.002 0.106 1.066 n = 1000 0.064 1.000 0.064 1.003 0.068 1.070 0.064 1.000 0.064 1.004 0.069 1.076

16

RMSE

Ratio 2

0.442 0.431 0.478 0.438 0.427 0.475

1.024 1.000 1.108 1.027 1.000 1.114

0.240 0.236 0.296 0.222 0.218 0.322

1.020 1.000 1.257 1.019 1.000 1.477

0.158 0.154 0.257 0.144 0.142 0.223

1.023 1.000 1.667 1.017 1.000 1.570

Table 3: WRMSE λ Hermite

Power

Hermite

Power

Hermite

Power

lh /n0.5

ph ph ph ph ph ph ph ph

=7 =6 =5 =4 =7 =6 =5 =4

0.400 0.400 0.412 0.412 0.386 0.387 0.387 0.387

ph ph ph ph ph ph ph ph

=7 =6 =5 =4 =7 =6 =5 =4

0.283 0.283 0.293 0.293 0.274 0.275 0.275 0.275

ph ph ph ph ph ph ph ph

=7 =6 =5 =4 =7 =6 =5 =4

0.212 0.212 0.219 0.219 0.207 0.208 0.208 0.209

lh /n1.2 n = 100 0.149 0.148 0.147 0.146 0.158 0.157 0.157 0.154 n = 400 0.084 0.084 0.082 0.081 0.088 0.088 0.088 0.087 n = 1000 0.059 0.059 0.058 0.057 0.060 0.060 0.061 0.061

17

lh /n1.9

lh /n2.6

0

0.194 0.193 0.189 0.186 0.182 0.182 0.177 0.176

0.234 0.230 0.217 0.200 0.199 0.197 0.191 0.191

0.265 0.244 0.223 0.201 0.263 0.243 0.222 0.201

0.109 0.107 0.104 0.099 0.098 0.097 0.094 0.094

0.127 0.120 0.110 0.100 0.108 0.103 0.102 0.099

0.131 0.121 0.111 0.100 0.131 0.121 0.111 0.100

0.074 0.073 0.069 0.064 0.066 0.064 0.063 0.064

0.083 0.077 0.071 0.064 0.071 0.070 0.070 0.065

0.083 0.077 0.071 0.064 0.083 0.077 0.071 0.065

Table 4: RMSE λ Hermite

Power

Hermite

Power

Hermite

Power

lh /n0.5

ph ph ph ph ph ph ph ph

=7 =6 =5 =4 =7 =6 =5 =4

0.503 0.503 0.511 0.511 0.499 0.499 0.499 0.499

ph ph ph ph ph ph ph ph

=7 =6 =5 =4 =7 =6 =5 =4

0.372 0.372 0.378 0.378 0.376 0.377 0.377 0.378

ph ph ph ph ph ph ph ph

=7 =6 =5 =4 =7 =6 =5 =4

0.288 0.288 0.291 0.291 0.297 0.299 0.299 0.302

lh /n1.2 n = 100 0.207 0.206 0.202 0.201 0.231 0.227 0.227 0.221 n = 400 0.150 0.150 0.149 0.148 0.155 0.155 0.156 0.154 n = 1000 0.123 0.122 0.127 0.129 0.123 0.125 0.128 0.128

18

lh /n1.9

lh /n2.6

0

0.376 0.375 0.366 0.352 0.292 0.291 0.273 0.269

0.822 0.787 0.618 0.434 0.407 0.391 0.337 0.337

3.011 1.173 0.739 0.442 2.831 1.188 0.699 0.438

0.384 0.382 0.319 0.234 0.203 0.203 0.178 0.180

1.153 0.790 0.506 0.239 0.347 0.253 0.251 0.221

2.681 0.880 0.525 0.239 2.731 0.820 0.440 0.223

0.405 0.373 0.285 0.157 0.160 0.143 0.139 0.142

1.248 0.575 0.362 0.157 0.251 0.227 0.221 0.147

1.682 0.589 0.364 0.157 1.714 0.562 0.253 0.147

7

Conclusion

This paper proposes a new model selection method for IV models. Rather than identifying a correct model, we address the issue of selecting a best approximating model. We introduce a loss function for evaluating IV models and propose a selection criterion that is similar to the Mallows’ Cp . We establish the asymptotic optimality of our selection criterion. There has been a long controversy as to which is better between AIC-like criterion and BIClike criterion. There is generally a conflict between efficiency and consistency. In the regression, the Cp criterion is asymptotically efficient but not consistent, whereas BIC is consistent but not asymptotically efficient. This paper establish a similar result in the case of IV estimation. Our selection criterion is asymptotically efficient but not consistent in the sense of Andrews and Lu (2001). In contrast, the BIC-like criterion of Andrews and Lu (2001) is consistent but not asymptotically efficient. We also propose a selection method for determining the smoothing parameters for the sieve IV estimator of Newey and Powell (2003) and Blundell, Chen, and Kristensen (2007). A drawback of the sieve IV estimator is that its property is highly sensitive to the choice of smoothing parameters. However, there has been no theoretical guidance to select the smoothing parameters, which makes it hard to apply the sieve IV estimator to empirical works. This paper is the first attempt to solve this difficulty. Admittedly, this paper is limited is some aspects. In particular, we do not provide an optimal selection rule with respect to the squared error loss. This important issue should be addressed in future research.

19

A

Appendix

Let C denote a generic constant which may take distinct values in different contexts. Let CSI refer to the Cauchy-Schwartz inequality, and MI to the Markov inequality. Also, let pH = maxh∈Hn ph and let lH = maxh∈Hn lh . The qualifier “with probability approaching one” is abbreviated as w.p.a.1. Lemma A.1 Suppose that Assumptions 3.2 and 3.3 hold. Then, λmin (n−1 Xh′ P (h)Xh ) > C w.p.a.1. uniformly over h ∈ Hn .

Proof. Let ηh = (ηh1 , . . . , ηhn )′ . Then, E[tr(ηn′ P (h)ηh )|Z] = tr(P (h)E[ηh ηh′ |Z]) ≤ Clh ph . Thus we have tr(n−1 ηn′ P (h)ηh ) = Op (lh ph /n) by MI. Also, a law of large numbers implies ′ ′ ∥n−1 Fh′ Fh − E[fhi fhi ]∥ = op (1) with ∥E[fhi fhi ]∥ = O(ph ) if p2h /n → 0. Hence, we have





1 ′



Xh P (h)Xh − 1 Fh′ P (h)Fh ≤ 1 ηh′ P (h)ηh + 2 1 ηn′ P (h)Fh

n



n n n √ (

)√ ( )

1 ′

1 ′ 1 ′

≤ ηh P (h)ηh + 2 tr F Fh tr η P (h)ηh n n h n h √ 1/2 = Op (lh ph /n) + Op (lh ph / n).

Since |λmin (A) − λmin (B)| ≤ ∥A − B∥ for any matrices A and B, we obtain the result. 2 Lemma A.2 Suppose that Assumptions 3.2 and 3.3 hold. Then e′ W ˜ (h)e − e′ W (h)e p sup → 0. nRn (h) h∈Hn

Proof. Observe that

1 ′ ˜ (h)e − 1 e′ W (h)e eW n n 1 ′ 1 −1 −1 ′ ′ ′ ′ = e P (h)Xh (Xh P (h)Xh ) Xh P (h)e − eP (h)Fh (Fh P (h)Fh ) Fh P (h)e n n )−1 ( 1 1 ′ 1 ′ ′ ≤ e P (h)ηh Xh P (h)Xh ηh P (h)e n n n ( ) −1 1 1 ′ 1 ′ ′ + 2 e P (h)Fh Xh P (h)Xh ηh P (h)e n n n [( )−1 ( )−1 ] 1 1 ′ 1 ′ 1 ′ ′ Xh P (h)Xh Fh P (h)Fh Fh P (h)e + e P (h)Fh − n n n n ≡ A1h + A2h + A3h ,

say.

Since E[e′ P (h)e|Z] = tr(P (h)E[ee′ |Z]) = σ 2 lh , e′ P (h)e/n = Op (lh /n) by MI. Thus, by CSI and Lemma A.1,

( A1h ≤ Ctr

1 ′ η P (h)ηh n h

)

1 ′ e P (h)e = Op (lh2 ph /n2 ). n

20

Also, by CSI,

2 ( )

1 ′

e P (h)Fh ≤ tr 1 Fh′ Fh 1 e′ P (h)e = Op (lh ph /n).

n

n n Hence, A2h



1 ′

1 ′



≤ C e P (h)Fh ηh P (h)e

n n 1/2 1/2

1/2

3/2

= Op (lh ph /n1/2 )Op (lh ph /n) = Op (lh ph /n3/2 ). Moreover, we have A3h

( )−1 ( ) 1 1 ′ 1 ′ 1 ′ ′ X P (h)Xh X P (h)Xh − Fh P (h)Fh = e P (h)Fh n n h n h n ( )−1 1 ′ 1 ′ × F P (h)Fh F P (h)e n h n h

2

1 ′

1 ′

2 ′

≤ C Fh P (h)e ηh P (h)ηh + Fh P (h)ηh

n n n 1/2

= Op (lh ph /n)Op (lh ph /n) + Op (lh ph /n)Op (lh ph /n1/2 ) ( ) ( ) 3/2 = Op lh2 p2h /n2 + Op lh p2h /n3/2 . Because nRn (h) ≥ E[e′ W (h)e|Z] = ph σ 2 , we obtain 3/2 3/2 3/2 e′ W l p2 lh p2h l pH ˜ (h)e − e′ W (h)e ≤ C h1/2 h ≤ C H 1/2 ≤ C 3/2 nRn (h) n Rn (h) n ph n for all h ∈ Hn w.p.a.1. Thus the result follows from Assumption 3.3. 2 Lemma A.3 Suppose that Assumptions 3.2 and 3.3 hold. Then b(h)′ W ˜ (h)b(h) − b(h)′ W (h)b(h) p sup → 0. nRn (h) h∈Hn

Proof. We see that

1 ˜ (h)b(h) − 1 b(h)′ W (h)b(h) b(h)′ W n n 1 1 −1 −1 ′ ′ ′ ′ ′ ′ = b(h) P (h)Xh (Xh P (h)Xh ) Xh P (h)b(h) − b(h) P (h)Fh (Fh P (h)Fh ) Fh P (h)b(h) n n ( ) −1 1 1 ′ 1 ′ Xh P (h)Xh ηh P (h)b(h) ≤ b(h)′ P (h)ηh n n n ( )−1 1 1 ′ 1 ′ ′ + 2 b(h) P (h)Fh Xh P (h)Xh ηh P (h)b(h) n n n [( ) ( )−1 ] −1 1 1 1 1 Xh′ P (h)Xh − Fh′ P (h)Fh Fh′ P (h)b(h) . + b(h)′ P (h)Fh n n n n

21

Also, observe that 1 b(h)′ P (h)b(h) n

( ≤ C =

1 1 g(h)′ P (h)g(h) + νh′ P (h)νh n n

)

C g(h)′ P (h)g(h) + Op (lh /n). n

Then, by doing a similar calculation as in the proof of Lemma A.2, we obtain 1/2 3/2 b(h)′ W ˜ (h)b(h) − b(h)′ W (h)b(h) n−1 g(h)′ P (h)g(h)lh p2h lh p2h + C ≤C nRn (h) n1/2 Rn (h) n3/2 Rn (h) for all h ∈ Hn w.p.a.1. Now we consider two possible cases. Since nRn (h) = g(h)′ (P (h)−W (h))g(h)+ph σ 2 , nRn (h) is asymptotically dominated by either g(h)′ (P (h)−W (h))g(h) or ph σ 2 . If limn→∞ (g(h)′ (P (h)− W (h))g(h))/nRn (h) > 0 a.s., then 1/2 1/2 n−1 g(h)′ P (h)g(h)l1/2 p2 lh p2h lH p2H h h ≤ C ≤ C . n1/2 Rn (h) n1/2 n1/2 On the other hand, if limn→∞ (g(h)′ (P (h) − W (h))g(h))/nRn (h) = 0 a.s., then 1/2 1/2 n−1 g(h)′ P (h)g(h)l1/2 p2 lh p2h lH p2H h h ≤ C . ≤ C n1/2 Rn (h) n1/2 n1/2 Therefore, we have 3/2 1/2 b(h)′ W ˜ (h)b(h) − b(h)′ W (h)b(h) l pH l p2 ≤ C H 1/2H + C H 1/2 nRn (h) n n for all h ∈ Hn w.p.a.1. The result follows from Assumption 3.3. 2 Lemma A.4 Suppose that Assumptions 3.1 (ii)-(iv) and 3.3 (i) hold. Then ′ ν (P (h) − W (h))νh p → 0. sup h nRn (h) h∈Hn

Proof. By the triangular inequality,

′ ′ ′ νh (P (h) − W (h))νh ≤ νh (P (h) − W (h))νh − E[νh (P (h) − W (h))νh |Z] nRn (h) nRn (h) ′ E[νh (P (h) − W (h))νh |Z] . + nRn (h)

By Theorem 2 of Whittle (1960), Assumption 3.1 (iii) gives that [ ] 2m E |νh′ (P (h) − W (h))νh − E[νh′ (P (h) − W (h))νh |Z]| |Z ≤ Ctr(P (h) − W (h))m = C(lh − ph )m . Also, by Assumption 3.3 (i), lh − ph lh − ph ≤ < C. nRn (h) ph σ 2

22

Thus, by MI and Assumption 3.1 (ii), for δ > 0, ′ ( ) νh (P (h) − W (h))νh − E[νh′ (P (h) − W (h))νh |Z] > δ|Z P sup nRn (h) h∈Hn ( ) ∑ ν ′ (P (h) − W (h))νh − E[νh′ (P (h) − W (h))νh |Z] > δ|Z ≤ P h nRn (h) h∈Hn [ ] 2m ∑ E |νh′ (P (h) − W (h))νh − E[νh′ (P (h) − W (h))νh |Z]| |Z ≤ δ −2m (nRn (h))2m h∈Hn



C ∑ (nRn (h))−m → 0. δ 2m h∈Hn

Also, E[νh′ (P (h) − W (h))νh |Z] sup nRn (h) h∈Hn



E[νh′ (P (h) − W (h))νh |Z] inf h∈Hn nRn (h) ( )−1 ∑ 2 ≤ C ph E[νhi |zi ] inf nRn (h) , h∈Hn



h∈Hn

h∈Hn

which converges to zero by Assumptions 3.1 (ii) and (iv). Hence we obtain the result. 2 Lemma A.5 p p ˆ = arg min Cn (h). If sup ˆ Let h h∈Hn |Cn (h)/Ln (h) − 1| → 0, then Ln (h)/ inf h∈Hn Ln (h) → 1. ˆ and Proof. Let h∗ be the minimizer of Ln (h). Then by construction, Ln (h∗ ) ≤ Ln (h) ˆ Thus we have Cn (h∗ ) ≥ Cn (h). 0

≤ 1−

ˆ − Ln (h∗ ) ˆ − Ln (h∗ ) Cn (h) ˆ − Ln (h) ˆ Ln (h∗ ) Ln (h) Cn (h) = = − ˆ ˆ ˆ ˆ Ln (h) Ln (h) Ln (h) Ln (h)

ˆ − Ln (h) ˆ Cn (h∗ ) − Ln (h∗ ) Cn (h) − ∗ ˆ Ln (h ) L (h) n Cn (h) − Ln (h) p → 0, ≤ 2 sup Ln (h) h∈H ≤

n

p ˆ → which shows inf h∈Hn Ln (h)/Ln (h) 1. The conclusion follows from the continuous mapping

theorem. 2

Proof of Theorem 3.1. First, we show that Cn (h) − Ln (h) p → 0. sup Rn (h) h∈Hn

(A.1)

˜ (h)Xh , we have (P (h) − W ˜ (h))µ = (P (h) − W ˜ (h))b(h). Thus, Since P (h)Xh = W Cn (h)

σ2 1 2 ∥P (h)(y − µ ˆ(h))∥ − (lh − 2ph ) n n 1 2 σ2 = Ln (h) + e′ P (h)e + e′ P (h)(µ − µ ˆ(h)) − (lh − 2ph ) n n n ) ( ( ) 2 ˜ (h))b(h) + 1 e′ P (h)e − lh σ 2 − 2 e′ W ˜ (h)e − ph σ 2 . = Ln (h) + e′ (P (h) − W n n n

=

23

Thus we have e′ (P (h) − W e′ W ˜ (h))b(h) e′ P (h)e − lh σ 2 ˜ (h)e − ph σ 2 Cn (h) − Ln (h) ≤ 2 + 2 + Rn (h) nRn (h) nRn (h) nRn (h) e′ (P (h) − W ˜ (h))b(h) − e′ (P (h) − W (h))b(h) ≤ 2 nRn (h) ′ e (P (h) − W (h))b(h) + 2 nRn (h) ′ e′ W ˜ (h)e − ph σ 2 e P (h)e − lh σ 2 + 2 + (A.2) . nRn (h) nRn (h) It follows from CSI and Lemmas A2 and A3 that e′ (P (h) − W ˜ (h))b(h) − e′ (P (h) − W (h))b(h) sup nRn (h) h∈Hn √ √ ˜ (h))e ˜ (h))b(h) p e′ (W (h) − W b(h)′ (W (h) − W ≤ sup sup → 0. nRn (h) nRn (h) h∈Hn h∈Hn Also,

′ e (P (h) − W (h))b(h) e′ (P (h) − W (h))g(h) e′ (P (h) − W (h))νh ≤ + . nRn (h) nRn (h) nRn (h)

By Assumption 3.1 (i), Theorem 2 of Whittle (1960) implies that [ ] 2m m E |e′ (P (h) − W (h))g(h)| |Z ≤ C (g(h)′ (P (h) − W (h))g(h)) ≤ C(nRn (h))m . Therefore, by MI, for any δ > 0, ′ ( ) e (P (h) − W (h))g(h) P sup ≤ > δ|Z nRn (h) h∈Hn

( ′ ) e (P (h) − W (h))g(h) P > δ|Z nRn (h) h∈Hn [ ] 2m E |e′ (P (h) − W (h))g(h)| |Z ∑ ≤ δ −2m (nRn (h))2m ∑

h∈Hn



C ∑ (nRn (h))−m → 0, δ 2m h∈Hn

which shows

′ e (P (h) − W (h))g(h) p → 0. sup nRn (h) h∈H n

Next, by CSI, ′ e (P (h) − W (h))νh nRn (h) Also, by triangular inequality, ′ e (P (h) − W (h))e ≤ nRn (h)

√ ≤

e′ (P (h) − W (h))e nRn (h)



νh′ (P (h) − W (h))νh . nRn (h)

′ e (P (h) − W (h))e − E[e′ (P (h) − W (h))e|Z] nRn (h) E[e′ (P (h) − W (h))e|Z] . + nRn (h)

24

It follows from Whittle’s inequality and MI that ′ e (P (h) − W (h))e − E[e′ (P (h) − W (h))e|Z] p → 0. sup nRn (h) h∈Hn Also by Assumption 3.3 (i), sup h∈Hn

E[e′ (P (h) − W (h))e|Z] (lh − ph )σ 2 ≤ sup < C. nRn (h) ph σ 2 h∈Hn

Thus by Lemma A.4, we have ′ e (P (h) − W (h))νh p → 0, sup nRn (h) h∈Hn and thus

′ e (P (h) − W (h))b(h) p → 0. sup nRn (h) h∈Hn

Then take the third term of (A.2). By Whittle’s inequality and Assumption 3.3 (i), [ 2m ] E e′ P (h)e − lh σ 2 |Z ≤ Ctr(P (h))m < Cpm h . By applying MI again, we have ′ e P (h)e − lh σ 2 p → 0. sup nRn (h) h∈Hn Finally, it follows from the Whittle’s inequality, MI and Lemma A.2 that e′ W ˜ (h)e − ph σ 2 p sup → 0, nRn (h) h∈Hn and hence (A.1) follows. Next, we prove Ln (h) − Rn (h) p → 0. sup Rn (h) h∈H

(A.3)

n

By a straightforward calculation, e′ W ˜ (h)e − e′ W (h)e Ln (h) − Rn (h) ≤ Rn (h) nRn (h) ′ e W (h)e − E[e′ W (h)e|Z] + nRn (h) b(h)′ (P (h) − W ˜ (h))b(h) − b(h)′ (P (h) − W (h))b(h) + nRn (h) ′ ′ b(h) (P (h) − W (h))b(h) − g(h) (P (h) − W (h))g(h) . + nRn (h) By Lemmas A2-A3 and the Whittle’s inequality, the supremums of first three terms are of order op (1). The supremum of the fourth term is bounded by ′ ′ ν (P (h) − W (h))νh + 2 sup νh (P (h) − W (h))g(h) . sup h nRn (h) nRn (h) h∈Hn h∈Hn

25

The first term is op (1) by Lemma A.4. Also, by the Whittle’s inequality and the MI, we can show that the second term is also op (1), and so (A.3) holds. Finally, by (A.1) and (A.3), we obtain Cn (h) − Ln (h) p → 0. sup Ln (h) h∈Hn Thus the desired result follows from Lemma A.5. 2 Lemma A.6 Suppose that Assumptions 4.2 and 4.3 hold. Then e′ W ˆ λ (h)e − e′ Wλ (h)e p sup sup →0 nRn (h, λ) h∈Hn λ∈Λn (h) and

e′ W ˆ λ (h)e − e′ Wλ (h)Wλ (h)e p ˆ λ (h)W sup sup → 0. nRn (h, λ) h∈Hn λ∈Λn (h)

Proof. By Assumption 4.3 (ii), nRn (h, λ) ≥ σ 2 tr(Wλ (h)Wλ (h)) > Cph . Then by using a similar argument as in the proof of Lemma A.2, we obtain 3/2 3/2 eW l p2 l pH ˜ (h)e − e′ W (h)e ≤ C H 1/2 ≤ C 3/2h h nRn (h, λ) n Rn (h, λ) n for all h ∈ Hn and λ ∈ Λn (h). Thus the first result follows from Assumptions 4.3 (i) and (iii). Also, we have 1 ′ ˆ λ (h)W ˆ λ (h)e − 1 e′ Wλ (h)Wλ (h)e eW n n ( ) ( ) 1 ˆ λ (h) W ˆ λ (h) − Wλ (h) e + 1 e′ Wλ (h) W ˆ λ (h) − Wλ (h) e ≤ e′ W n n ) 1 ( ˆ λ (h) − Wλ (h) e , ≤ 2 e′ W n ˆ λ (h)) ≤ 1 and λmax (Wλ (h)) ≤ 1. where the second inequality follows from the fact that λmax (W Thus we obtain the second result. 2 Lemma A.7 Suppose that Assumptions 4.2 and 4.3 hold. Then b(h)′ W ˆ λ (h)b(h) − b(h)′ Wλ (h)b(h) p sup sup →0 nRn (h, λ) h∈Hn λ∈Λn (h) and

b(h)′ W ˆ λ (h)W ˆ λ (h)b(h) − b(h)Wλ (h)Wλ (h)b(h) p sup sup → 0. nRn (h, λ) h∈Hn λ∈Λn (h)

26

Proof. Similarly to the proof of Lemma A.2, we obtain

1/2 3/2 b(h)′ W ˆ λ (h)b(h) − b(h)′ Wλ (h)b(h) n−1 g(h)′ P (h)g(h)lh p2h lh p2h + C . ≤ C nRn (h, λ) n1/2 Rn (h, λ) n3/2 Rn (h, λ)

Using a similar argument as in the proof of Lemma A.2, we can show that 1/2 n−1 g(h)′ P (h)g(h)l1/2 p2 lH p2H h h ≤ C n1/2 Rn (h, λ) n1/2 for all h ∈ H and λ ∈ Λn (h). Thus the first result follows from Assumption 4.3. Also,

1 ˆ λ (h)W ˆ λ (h)b(h) − 1 b(h)Wλ (h)Wλ (h)b(h) b(h)′ W n n ( ) ( ) 1 ˆ λ (h) W ˆ λ (h) − Wλ (h) b(h) + 1 b(h)′ Wλ (h) W ˆ λ (h) − Wλ (h) b(h) ≤ b(h)′ W n n ( ) 1 ˆ λ (h) − Wλ (h) b(h) . ≤ 2 b(h)′ W n

Thus we obtain the second result. 2 Lemma A.8 Suppose that Assumptions 4.1 (ii)-(iv) and 4.3 (i)-(ii) hold. Then ′ ν (P (h) − Wλ (h))νh p → 0. sup sup h nRn (h, λ) h∈Hn λ∈Λn (h)

Proof. The proof is similar to that of Lemma A.3. By Whittles inequality,

[ ] 2m E |νh′ (P (h) − Wλ (h))νh − E[νh′ (P (h) − Wλ (h))νh |Z]| |Z m

≤ C (tr(P (h) − Wλ (h))(P (h) − Wλ (h))) . Also, by Assumptions 4.3 (i) and (ii) tr((P (h) − Wλ (h))(P (h) − Wλ (h))) ≤ Ctr(Wλ (h)Wλ (h)) ≤ C(nRn (h, λ)). Thus we obtain ( P

) ′ νh (P (h) − Wλ (h))νh − E[νh′ (P (h) − Wλ (h))νh |Z] > δ|Z sup sup nRn (h, λ) h∈Hn λ∈Λn (h) [ ] 2m ∑ ∑ E |νh′ (P (h) − Wλ (h))νh − E[νh′ (P (h) − Wλ (h))νh |Z]| |Z

≤ δ −2m

(nRn (h, λ))2m

h∈Hn λ∈Λn (h)



C ∑ δ 2m



(nRn (h, λ)) → 0.

h∈Hn λ∈Λn (h)

Also, E[νh′ (P (h) − Wλ (h))νh |Z] nRn (h, λ) h∈Hn λ∈Λn (h) sup

sup

suph∈Hn supλ∈Λn (h) E[νh′ (P (h) − Wλ (h))νh |Z] inf h∈Hn inf λ∈Λn (h) nRn (h, λ) ( )−1 ∑ 2 ≤ C ph E[νhi |zi ] inf inf nRn (h, λ) ,



h∈Hn

h∈Hn λ∈Λn (h)

27

which converges to zero by Assumptions 4.1 (iii) and (iv). 2 Lemma A.9 Suppose that Assumptions 4.2, 4.3 (i)-(ii) and 4.4 hold. Then β ′ X ′ (P (h) − W ˆ λ (h))Xh βh p h h sup sup → 0. nRn (h, λ) h∈Hn λ∈Λn (h) ˜ (h)Xh . Hence, w.p.a.1, we have Proof. Notice that P (h)Xh = W

= = = ≤

1 ′ ′ ˆ λ (h))Xh βh βh Xh (P (h) − W n 1 ′ ′ ˜ (h) − W ˆ λ (h))Xh βh βh Xh (W n (( )−1 ( )−1 ) 1 ′ ′ 1 ′ 1 ′ 1 1 ′ βh Xh P (h)Xh X P (h)Xh − X P (h)Xh + λΩh X P (h)Xh βh n n h n h n n h ( )−1 1 ′ ′ 1 βh Xh P (h)Xh 1 Xh′ P (h)Xh + 1 λΩh λΩh βh n n n n C ′ C λβ Ωh βh ≤ λ. n h n

Hence, by Assumptions 4.3 (i), (ii) and 4.4 (i) β ′ X ′ (P (h) − W ˆ λ (h))Xh βh h h sup sup ≤ nRn (h, λ) h∈Hn λ∈Λn (h)

C suph supλ λ inf h∈Hn nRn (h, λ)

Hence we obtain the result. 2 Lemma A.10 Suppose that Assumptions 4.2, 4.3 (i)-(ii) and 4.4 hold. Then β ′ X ′ (P (h) − W ˆ λ (h))(P (h) − W ˆ λ (h))Xh βh p h h sup sup → 0. nRn (h, λ) h∈Hn λ∈Λn (h)

Proof. Observe that

= =

= ≤

1 ′ ′ ˆ λ (h))(P (h) − W ˆ λ (h))Xh βh βh Xh (P (h) − W n 1 ′ ′ ˜ (h) − W ˆ λ (h))(W ˜ (h) − W ˆ λ (h))Xh βh βh Xh (W n (( )−1 ( )−1 ) 1 ′ ′ 1 1 ′ ′ βh Xh P (h)Xh X P (h)Xh − X P (h)Xh + λΩh n n h n h (( )−1 ( )−1 ) 1 ′ 1 ′ 1 ′ 1 ′ × Xh P (h)Xh Xh P (h)Xh − Xh P (h)Xh + λΩh Xh P (h)Xh βh n n n n ( )−1 )−1 ( 1 ′ ′ 1 1 ′ βh Xh P (h)Xh 1 Xh′ P (h)Xh + λΩh X P (h)X + λΩ λΩ λΩ β h h h h h h n n n n Cλ ||λΩh ∥ βh′ Ωh βh ≤ Cλ.

28

Hence the result follows from Assumptions 4.3 (i), (ii) and 4.4 (i). 2 Lemma A.11 Suppose that Assumptions 4.2, 4.3 (i), (iii) and 4.4 (ii) hold. Then tr(W ˆ λ (h)) − tr(Wλ (h)) p sup sup → 0. nRn (h, λ) h∈Hn λ∈Λn (h)

Proof. Observe that { } −1 tr (Xh′ P (h)Xh + nλΩh ) Xh P (h)Xh {( } )−1 = ph − tr n−1 Xh′ P (h)Xh + λΩh λΩh .

ˆ λ (h)) = tr(W

Similarly, tr(Wλ (h)) = ph − tr

{(

n−1 Fh′ P (h)Fh + λΩh

)−1

} λΩh .

Thus, we have {( ) } 1 ′ 1 ′ ˆ Fh P (h)Fh − Xh P (h)Xh λΩh tr(Wλ (h)) − tr(Wλ (h)) ≤ C tr n n ( ( ) ) 1 1 ′ ′ ≤ C tr ηh P (h)ηh + C tr Fh P (h)ηh n n = Op (lh ph /n) + Op (lp1/2 ph /n1/2 ). Thus the result follows from Assumption 4.3 (i) and (iii). 2

Proof of Theorem 4.1 The outline of the proof is similar to that of Theorem 3.1. Observe that Cn (h, λ)

) 1 2 1( ′ Ln (h, λ) + e′ P (h)e + e′ P (h) (µ − µ e P (h)e − tr(P (h))σ 2 ˆ(h)) + n n n ) 2( ′ˆ 2 ˆ λ (h))σ − e Wλ (h)e − tr(W n ) ( ) 2 ( ˆ λ (h) Xh βh + 2 e′ P (h) − W ˆ λ (h) b(h) = Ln (h, λ) + e′ P (h) − W n n ) ) 2( ′ 1( ′ 2 ˆ λ (h)e − tr(W ˆ λ (h))σ 2 . e P (h)e − tr(P (h))σ − eW + n n

=

By the CSI, √ √ e′ (P (h) − W ′ (P (h) − W ˆ λ (h))Xh βh ˆ ˆ βh′ Xh′ (P (h) − W (h))X β e (h))e λ h h λ . ≤ nRn (h, λ) nRn (h, λ) nRn (h, λ) Moreover, e′ (P (h) − W ˆ λ (h))e nRn (h, λ)

′ e (P (h) − Wλ (h))e − E[e′ (P (h) − Wλ (h))e|Z] ≤ nRn (h, λ) ˆ λ (h))e E[e′ (P (h) − Wλ (h))e|Z] e′ (Wλ (h) − W + + . nRn (h, λ) nRn (h, λ)

29

By the Whittle’s inequality, [ ] 2m E |e′ (P (h) − Wλ (h))e − E[e′ (P (h) − Wλ (h))e|Z]| |Z m

≤ C (tr ((P (h) − Wλ (h))(P (h) − Wλ (h)))) . Thus, by the MI, (

) ′ e (P (h) − Wλ (h))e − E[e′ (P (h) − Wλ (h))e|Z] ≥ δ|Z sup P nRn (h, λ) h∈Hn ,λ∈Λn (h) C ∑ ∑ (tr((P (h) − Wλ (h))(P (h) − Wλ (h))))m ≤ → 0. δ2 (nRn (h, λ))2m h∈Hn λ∈Λn (h)

Also, tr(P (h) − Wλ (h))σ 2 E[e′ (P (h) − Wλ (h))e|Z] < C, sup sup sup sup ≤ h∈H nRn (h, λ) tr(Wλ (h)Wλ (h))σ 2 h∈Hn λ∈Λn (h) n λ∈Λn (h) which shows

′ e (P (h) − Wλ (h))e = Op (1). sup nRn (h, λ)

sup

(A.4)

h∈Hn λ∈Λn (h)

Hence, by Lemmas A.5 and A.8, we have e′ (P (h) − W ˆ λ (h))Xh βh p sup sup → 0. nRn (h, λ) h∈Hn λ∈Λn (h) Next we have e′ (P (h) − W ˆ λ (h))b(h) nRn (h, λ)

e′ (P (h) − W ˆ λ (h))b(h) − e′ (P (h) − Wλ (h))b(h) ≤ nRn (h, λ) ′ ′ e (P (h) − Wλ (h))g(h) e (P (h) − Wλ (h))νh + . + nRn (h, λ) nRn (h, λ)

The supremum of the first term is op (1) by the CSI and Lemmas A.5 and A.6. By the Whittle’s inequality, [ ] 2m E |e′ (P (h) − Wλ (h))g(h)| |Z

≤ C (g(h)′ (P (h) − Wλ (h))(P (h) − Wλ (h))g(h)) ≤ C(nRn (h, λ))m .

Thus, the MI and Assumption 4.1 (iii) imply that ( ) ′ e (P (h) − Wλ (h))g(h) P sup sup > δ|Z nRn (h, λ) h∈Hn λ∈Λn (h) [ ′ ] 2m ∑ ∑ |Z −2m E |e (P (h) − Wλ (h))g(h)| ≤ δ (nRn (h, λ))2m h∈Hn λ∈Λn (h)



C ∑ δ 2m



(nRn (h, λ))−m → 0.

h∈Hn λ∈Λn (h)

30

m

Also, by CSI, (A.4) and Lemma A.7, ′ e (P (h) − Wλ (h))νh sup sup nRn (h, λ) h∈Hn λ∈Λn (h) √ √ νh′ (P (h) − Wλ (h))νh e′ (P (h) − Wλ (h))e ≤ sup sup sup sup nRn (h, λ) nRn (h, λ) h∈Hn λ∈Λn (h) h∈Hn λ∈Λn (h) p



0.

Combining these results, we obtain

e′ (P (h) − W ˆ λ (h))b(h) p sup sup → 0. nRn (h, λ) h∈Hn λ∈Λn (h)

Thus we have sup

Cn (h, λ) − Ln (h, λ) p → 0. sup Rn (h, λ)

h∈Hn λ∈Λn (h)

by Lemmas A.5 and A.10. Next, we have Ln (h, λ) − Rn (h, λ) Rn (h, λ) e′ W ˆ λ (h)e − e′ Wλ (h)Wλ (h)e e′ Wλ (h)Wλ (h)e − E[e′ Wλ (h)Wλ (h)e|Z] ˆ λ (h)W ≤ + nRn (h, λ) nRn (h, λ) β ′ X ′ (P (h) − W ˆ λ (h))(P (h) − W ˆ λ (h))Xh βh + h h nRn (h, λ) b(h)′ (P (h) − W ˆ λ (h))(P (h) − W ˆ λ (h))b(h) b(h)′ (P (h) − Wλ (h))(P (h) − Wλ (h))b(h) − + nRn (h) nRn (h, λ) b(h)′ (P (h) − W (h))(P (h) − W (h))b(h) g(h)′ (P (h) − W (h))(P (h) − W (h))g(h) λ λ λ λ − + nRn (h, λ) nRn (h, λ) µ′ (P (h) − W ˆ λ (h))W ˆ λ (h)e (A.5) + 2 . nRn (h, λ) The above inequality holds even if we take the supremum of each term. We suppress the symbol sup for notational simplicity. The first and third terms of (A.5) are op (1) by Lemmas A.5 and A.9. By Whittle’s theorem, we have [ ] 2m E |e′ Wλ (h)Wλ (h)e − E[e′ Wλ (h)Wλ (h)e| |Z] ≤ Ctr(Wλ (h)Wλ (h)Wλ (h)Wλ (h))m ≤ C(λmax (Wλ (h)))2m tr(Wλ (h)Wλ (h))m ≤ C(nRn (h, λ))m . Hence the second term of (A.5) converges to zero in probability. The forth term is op (1) by Lemma A.6. The fifth term is decomposed as b(h)′ (P (h) − W (h))(P (h) − W (h))b(h) g(h)′ (P (h) − W (h))(P (h) − W (h))g(h) λ λ λ λ − nRn (h, λ) nRn (h, λ) ′ νh (P (h) − Wλ (h))(P (h) − Wλ (h))νh + 2 νh (P (h) − Wλ (h))(P (h) − Wλ (h))g(h) . ≤ nRn (h, λ) nRn (h, λ)

31

By Whittle’s inequality, we have [ ] 2m E |νh′ (P (h) − Wλ (h))(P (h) − Wλ (h))g(h)| |Z ≤ C (g(h)′ (P (h) − Wλ (h))(P (h) − Wλ (h))g(h))

m

≤ C(nRn (h, λ))m . Thus by Lemma A.7, the fifth term is op (1). The sixth term of (A.5) is bounded by µ′ (P (h) − W b(h)′ (P (h) − W ˆ λ (h))W ˆ λ (h)e ˆ λ (h))W ˆ λ (h)e − b(h)′ (P (h) − Wλ (h))Wλ (h)e ≤ nRn (h) nRn (h) ˆ λ (h))W ˆ λ (h)e b(h)′ (P (h) − Wλ (h))Wλ (h)e βh Xh′ (P (h) − W + + nRn (h) nRn (h, λ) b(h)′ (P (h) − W ˆ λ (h))W ˆ λ (h)e − b(h)′ (P (h) − Wλ (h))Wλ (h)e ≤ nRn (h, λ) ′ ν (P (h) − Wλ (h))Wλ (h)e g(h)′ (P (h) − Wλ (h))Wλ (h)e + + h nRn (h, λ) nRn (h, λ) β X ′ (P (h) − W ˆ λ (h))W ˆ λ (h)e h h + . nRn (h, λ) By CSI and Lemmas A.5-A.6, we obtain b(h)(P (h) − W ˆ λ (h))W ˆ λ (h)e − b(h)(P (h) − Wλ (h))Wλ (h)e p sup sup → 0. nRn (h, λ) h∈Hn λ∈Λn (h) Also, by the CSI and Lemma A.7,

νh (P (h) − Wλ (h))Wλ (h)e p → 0. sup sup nRn (h, λ) h∈Hn λ∈Λn (h)

Moreover, the Whittle’s inequality gives that [ ] 2m E |g(h)′ (P (h) − Wλ (h))Wλ (h)e| |Z ≤ Ctr (g(h)′ (P (h) − Wλ (h))Wλ (h)Wλ (h)(P (h) − Wλ (h))g(h))

m

≤ C(nRn (h, λ))m . Thus we have sup

g(h)′ (P (h) − Wλ (h))Wλ (h)e p → 0. sup nRn (h, λ)

h∈Hn λ∈Λn (h)

Finally, by the CSI and Lemma A.9, we have β ′ X (P (h) − W ˆ λ (h))W ˆ λ (h)e p h h sup sup →0 nRn (h, λ) h∈Hn λ∈Λn (h) and thus the sixth term of (A.5) is op (1). Combining above results, we have sup

Ln (h, λ) − Rn (h, λ) p →0 sup Rn (h, λ)

h∈Hn λ∈Λn (h)

and hence the result follows. 2

32

Derivation of τh . Let Qh = Xh′ Xh /n and Hh = Xh′ P (h)Xh /n. Define γ¯h = Hh γh . Then we have =

sup γh

−1/2

=

sup

γ¯h′ Hh

γ ¯h

= =

−1/2

−1/2

γ ′ H Hh Qh Hh Hh γh γh′ Qh γh = sup h h ′ 1/2 1/2 γh Hh γh γh γh′ Hh Hh γh 1/2

τh2

−1/2

Qh Hh γ¯h′ γ¯h

1/2

γ¯h

( ) −1/2 −1/2 λmax Hh Qh Hh ( ) λmax Qh Hh−1 . 2

References Akaike, H. (1970): “Statistical Predictor Identification,” Annals of the Institute of Statistical Mathematics, 22, 203–217. (1973): “Information Theory and an Extension of the Maximum Likelihood Principle,” in Second International Symposium on Information Theory, ed. by B. Petroc, and F. Csake, pp. 267–281. Akademiai Kiado. Andrews, D. W. (1991): “Asymptotic Optimality of Generalized CL , Cross-Validation, and Generalized Cross-Validation in Regression with Heteroskedastic Errors,” Journal of Econometrics, 47, 359–377. (1999): “Consistent Moment Selection Procedures for Generalized Method of Moments Estimation,” Econometrica, 67, 543–564. Andrews, D. W., and B. Lu (2001): “Consistent Model and Moment Selection Procedures for GMM Estimation with Application to Dynamic Panel Data Models,” Journal of Econometrics, 101, 123–164. Blundell, R., X. Chen, and D. Kristensen (2007): “Semi-Nonparametric IV Estimation of Shape-Invariant Engel Curves,” Econometrica, 75, 1613–1669. Carrasco, M., J.-P. Florens, and E. Renault (2007): “Linear Inverse Problems in Structural Econometrics Estimation Based on Spectral Decomposition and Regularization,” in Handbook of Econometrics, ed. by J. Heckman, and E. Leamer, vol. 6. Elsevier. Chen, X., V. Chernozhukov, S. Lee, and W. K. Newey (2011): “Local Identification of Nonparametric and Semiparametric Models,” Discussion Paper 1795, Cowles Foundation. Chen, X., and D. Pouzo (2012): “Estimation of Nonparametric Conditional Moment Models with Possibly Nonsmooth Generalized Residuals,” Econometrica, 80, 277–321. Craven, P., and G. Wahba (1979): “Smoothing Noisy Data with Spline Functions: Estimating the Correct Degree of Smoothing by the Method of Generalized Cross-Validation,” Numerische Mathematik, 31, 377–403.

33

Darolles, S., Y. Fan, J. P. Florens, and E. Renault (2011): “Nonparametric Instrumental Regression,” Econometrica, 79, 1541–1565. Donald, S. G., G. W. Imbens, and W. K. Newey (2009): “Choosing Instrumental Variables in Conditional Moment Restriction Models,” Journal of Econometrics, 152, 28–36. Donald, S. G., and W. K. Newey (2001): “Choosing the Number of Instruments,” Econometrica, 69, 1161–1191. Gagliardini, P., and O. Scaillet (2012): “Tikhonov Regularization for Nonparametric Instrumental Variable Estimators,” Journal of Econometrics, 167, 61–75. Gasser, T., L. Sroka, and C. Jennen-Steinmetz (1986): “Residual Variance and Residual Pattern in Nonlinear Regression,” Biometrika, 73, 625–633. Hall, A. R., and F. P. M. Peixe (2003): “A Consistent Method for the Selection of Relevant Instruments,” Econometric Reviews, 22, 268–287. Hall, P., and J. L. Horowitz (2005): “Nonparametric Methods for Inference in the Presence of Instrumental Variables,” Annals of Statistics, 33, 2904–2929. Hall, P., J. W. Kay, and D. M. Titterington (1990): “Asymptotically Optimal Difference-Based Estimation of Variance in Nonparametric Regression,” Biometrika, 77, 521– 528. Hansen, B. E. (2007): “Least Squares Model Averaging,” Econometrica, 75, 1175–1189. Hong, H., B. Preston, and M. Shum (2003): “Generalized Empirical Likelihood-Based Model Selection Criteria for Moment Condition Models,” Econometric Theory, 19, 923–943. Horowitz, J. L. (2011): “Adaptive Nonparametric Instrumental Variable Estimation: Empirical Choice of the Regularization Parameter,” Unpublished Manuscript, Northwestern University. (2012): “Specification Testing in Nonparametric Instrumental Variable Estimation,” Journal of Econometrics, 167, 383–396. Kress, R. (1999): Linear Integral Equations. Springer. Li, K.-C. (1986): “Asympotic Optimality of CL and Generalized Cross-Validation in Ridge Regression with Application to Spline Smoothing,” Annals of Statistics, 14, 1101–1112. (1987): “Asympotic Optimality for Cp , CL , Cross-Validation, and Generalized CrossValidation: Discrete Index Set,” Annals of Statistics, 15, 958–975. Mallows, C. L. (1973): “Some Comments on Cp ,” Technometrics, 15, 661–675. Newey, W. K., and J. L. Powell (2003): “Instrumental Variable Estimation of Nonparametric Models,” Econometrica, 71, 1565–1578.

34

Nychka, D., G. Wahba, S. Goldfarb, and T. Pugh (1984): “Cross-Validated Spline Methods for the Estimation of Three-Dimensional Tumor Size Distibutions From Observations only Two-Dimensional Cross Sections,” Journal of the American Statistical Association, 79, 832–846. O’Sullivan, F. (1986): “A Statistical Perspective on Ill-Posed Inverse Problems,” Statistical Science, 1, 502–518. Pesaran, M. H., and R. J. Smith (1994): “A Generalized R2 Criterion for Regression Models Estimated by the Instrumental Variables Method,” Econometrica, 62, 705–710. Schwarz, G. (1978): “Estimating the Dimension of a Model,” Annals of Statistics, 6, 461–464. Shao, J. (1993): “Linear Model Selection by Cross-Validation,” Journal of the American Statistical Association, 88, 486–494. (1997): “An Asymptotic Theory for Linear Model Selection,” Statistica Sinica, 7, 221–264. Shibata, R. (1980): “Asymptotically Efficient Selection of the Order of the Model for Estimating Parameters of a Linear Process,” Annals of Statistics, 8, 147–164. (1981): “An Optimal Selection of Regression Variables,” Biometrika, 68, 45–54. (1984): “Approximate Efficiency of a Selection Procedure for the Number of Regression Variables,” Biometrika, 71, 43–49. Stone, M. (1974): “Cross-Validatory Choice and Assessment of Statistical Prediction,” Journal of the Royal Statistical Society. Series B, 36, 111–147. Whittle, P. (1960): “Bounds for the Moments of Linear and Quadratic Forms in Independent Variables,” Theory of Probability and Its Applications, 5, 302–305.

35

Model Selection Criterion for Instrumental Variable ...

Graduate School of Economics, 2-1 Rokkodai-cho, Nada-ku, Kobe, .... P(h)ˆµ(h) can be interpreted as the best approximation of P(h)y in terms of the sample L2 norm ... Hence, there is a usual trade-off between the bias and the ..... to (4.8) depends not only on the smoothing parameters but also on the unknown degree of the.

194KB Sizes 2 Downloads 286 Views

Recommend Documents

Inference complexity as a model-selection criterion for ...
I n Pacific Rim International. Conference on ArtificialIntelligence, pages 399 -4 1 0, 1 998 . [1 4] I rina R ish, M ark Brodie, Haiqin Wang, and ( heng M a. I ntelligent prob- ing: a cost-efficient approach to fault diagnosis in computer networks. S

A Criterion for Demonstrating Natural Selection in the ...
Aug 8, 2006 - mentioned above by Cooke et al. on the Lesser Snow Goose (Anser caerulescens caerulescens) of La Pérouse Bay, Canada. To determine whether these studies are indeed rigorous in demonstrating natural selection in wild populations, we wil

Trace Ratio Criterion for Feature Selection
file to frontal views. Images are down-sampled to the size of ... q(b1+b2+···+bk) b1+b2+···+bk. = ak bk . D. Lemma 2 If ∀ i, ai ≥ 0,bi > 0, m1 < m2 and a1 b1. ≥ a2.

Variable selection for dynamic treatment regimes: a ... - ORBi
will score each attribute by estimating the variance reduction it can be associ- ated with by propagating the training sample over the different tree structures ...

Variable selection for Dynamic Treatment Regimes (DTR)
Jul 1, 2008 - University of Liège – Montefiore Institute. Variable selection for ... Department of Electrical Engineering and Computer Science. University of .... (3) Rerun the fitted Q iteration algorithm on the ''best attributes''. S xi. = ∑.

ACTIVE MODEL SELECTION FOR GRAPH ... - Semantic Scholar
Experimental results on four real-world datasets are provided to demonstrate the ... data mining, one often faces a lack of sufficient labeled data, since labeling often requires ..... This work is supported by the project (60675009) of the National.

Variable selection for Dynamic Treatment Regimes (DTR)
Department of Electrical Engineering and Computer Science. University of Liège. 27th Benelux Meeting on Systems and Control,. Heeze, The Netherlands ...

Variable selection for dynamic treatment regimes: a ... - ORBi
n-dimensional space X of clinical indicators, ut is an element of the action space. (representing treatments taken by the patient in the time interval [t, t + 1]), and xt+1 is the state at the subsequent time-step. We further suppose that the respons

Bayesian linear regression and variable selection for ...
Email: [email protected]; Tel.: +65 6513 8267; Fax: +65 6794 7553. 1 ..... in Matlab and were executed on a Pentium-4 3.0 GHz computer running under ...

Variable selection for Dynamic Treatment Regimes (DTR)
University of Liège – Montefiore Institute. Problem formulation (I). ○ This problem can be seen has a discretetime problem: x t+1. = f (x t. , u t. , w t. , t). ○ State: x t. X (assimilated to the state of the patient). ○ Actions: u t. U. â—

Model Selection for Support Vector Machines
New functionals for parameter (model) selection of Support Vector Ma- chines are introduced ... tionals, one can both predict the best choice of parameters of the model and the relative quality of ..... Computer Science, Vol. 1327. [6] V. Vapnik.

Variable selection for dynamic treatment regimes: a ... - ORBi
Nowadays, many diseases as for example HIV/AIDS, cancer, inflammatory ... ical data. This problem has been vastly studied in. Reinforcement Learning (RL), a subfield of machine learning (see e.g., (Ernst et al., 2005)). Its application to the DTR pro

Kin Selection, Multi-Level Selection, and Model Selection
In particular, it can appear to vindicate the kinds of fallacious inferences ..... comparison between GKST and WKST can be seen as a statistical inference problem ...

A Discriminative Latent Variable Model for ... - Research at Google
attacks (Guha et al., 2003), detecting email spam (Haider ..... as each item i arrives, sequentially add it to a previously ...... /tests/ace/ace04/index.html. Pletscher ...

Quantum Model Selection
Feb 14, 2011 - Quantum Model Selection. Examples. 1. Non-equiliblium states in QFT ωθ = ∫. B ρβ,µ dνθ(β, µ), where β > 0 is the inverse temparature, µ ∈ R another parameter such as the chemical potential. 2. Reducible representations i

Consistent Variable Selection of the l1−Regularized ...
Proof. The proof for Lemma S.1 adopts the proof for Lemma 1 from Chapter 6.4.2 of Wain- ..... An application of bound (3) from Lemma S.4 with ε = φ. 6(¯c−1).

Variable selection in PCA in sensory descriptive and consumer data
Keywords: PCA; Descriptive sensory data; Consumer data; Variable selection; Validation. 1. Introduction. In multivariate analysis where data-tables with sen-.

MUX: Algorithm Selection for Software Model Checkers - Microsoft
model checking and bounded or symbolic model checking of soft- ware. ... bines static analysis and testing, whereas Corral performs bounded goal-directed ...

Split Intransitivity and Variable Auxiliary Selection in ...
Mar 14, 2014 - Je suis revenu–j'ai revenu `a seize ans, j'ai revenu `a Ottawa. ... J'ai sorti de la maison. 6 ..... 9http://www.danielezrajohnson.com/rbrul.html.

Anatomically Informed Bayesian Model Selection for fMRI Group Data ...
A new approach for fMRI group data analysis is introduced .... j )∈R×R+ p(Y |ηj,σ2 j. )π(ηj,σ2 j. )d(ηj,σ2 j. ) is the marginal likelihood in the model where region j ...

MUX: Algorithm Selection for Software Model Checkers - Microsoft
mation, and have been hugely successful in practice (e.g., [45, 6]). Permission to ..... training the machine learning algorithms and the validation data V S is used for .... plus validation) and the remaining 2920 pairs were used in the online.

Mutual selection model for weighted networks
Oct 28, 2005 - in understanding network systems. Traffic amount ... transport infrastructure is fundamental for a full description of these .... work topology and the microdynamics. Due to the .... administrative organization of these systems, which

Regularization and Variable Selection via the ... - Stanford University
ElasticNet. Hui Zou, Stanford University. 8. The limitations of the lasso. • If p>n, the lasso selects at most n variables. The number of selected genes is bounded by the number of samples. • Grouped variables: the lasso fails to do grouped selec