Principal Components for Regression: a conditional ...

Viewer
Transcript

Principal Components for Regression: a conditional point of view Liliana Forzani January 16, 2007

ii

Contents Introduction

1

1 A connection 1.1 Principal Components and Sufficient Dimension Reduction in Regression 1.2 Principal Component Analysis in Regression . . . . . . . . . . . . . . . . 1.3 Sufficient Dimension Reduction in Regression . . . . . . . . . . . . . . . 1.4 A connection between PCR and Sufficient Dimension Reduction . . . . 1.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Some results 2.1 Introduction . . . 2.2 The results . . . 2.3 Inference . . . . . 2.4 Some simulations 2.5 Conclusions . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3 Further work 3.1 Discrimination . . . . . . . . . . . . . . . . . . . 3.2 Subpopulations . . . . . . . . . . . . . . . . . . . 3.3 Exact distribution theory for testing . . . . . . . 3.4 Diagonal case without fitting νy . . . . . . . . . . 3.5 Correlation matrix theory of principal component 3.6 Other possible questions . . . . . . . . . . . . . . 3.7 Infinite dimensional case . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . analysis . . . . . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . . . .

. . . . .

. . . . .

. . . . . . .

. . . . .

. . . . .

. . . . . . .

. . . . .

. . . . .

. . . . . . .

. . . . .

. . . . .

. . . . . . .

. . . . .

. . . . .

. . . . . . .

. . . . .

. . . . .

. . . . . . .

. . . . .

3 3 3 6 7 10

. . . . .

11 11 11 13 16 22

. . . . . . .

23 23 24 25 25 25 26 26

A Proofs of the results from Chapter 2

37

B Algorithm for the Diagonal case

47

C The generalized inverse

49

Bibliography

52

iii

iv

Introduction Many statistical applications deal with high dimensional data. The difficulty with this kind of data lies in visualization and computations. These are areas which have become more important with the advent of computer and graphics technology and the availability of more data. To have more data should be in our favor; nevertheless there are very well known problems arising in ordinary least squares for the linear regression of a response Y onto a set of predictors X = (X1 , . . . , Xp )T , if the number of the predictors p is large and the predictors are very correlated. In some ways, to have more information makes the problem harder. Because of this, a great deal of work has been done trying to reduce the dimensionality of the data. Generally speaking, that means substituting the set of predictors X = (X1 , . . . , Xp )T by some function of them. There are many methods to reduce predictors dimensions. A partial list is Principal Component Regression, Partial Least Square, Continuum Regression, LASSO, Factor Analysis, Latent variables and the inverse regression methods like SIR (Sliced Inverse Regression; Li, 1991), SAVE (Sliced Averaged Variance Estimator; Cook and Weisberg, 1991). The focus of this proposal is reduction of the dimension of the predictors from the Inverse Regression point of view. Assuming that the inverse regression X|Y follows a normal distribution with covariance independent of Y , Cook (2007) was able to find the minimal sufficient reduction for the regression of Y |X. This means the linear combinations of the predictors X that say everything in the regression of Y |X, in the sense that Y |X ∼ Y |β T X for β ∈ Rd×p with d ≤ p the smallest possible number. In this proposal we present the maximum likelihood estimators for such combinations (Chapter 2). Chapter 1 is about Principal Component Regression, the setting for this new way of studying reduction in regression, and a connection between Principal Component Regression and this new setting. Chapter 2 presents the maximum likelihood estimators for the dimension reduction combinations (the proofs are in Appendix A). Chapter 3 is about further work and some extensions for the Functional Data case, when the predictors are functions instead of vectors.

1

2

Chapter 1

A connection 1.1

Principal Components and Sufficient Dimension Reduction in Regression

When the Google search engine is used to obtain information about Principal Component Analysis (PCA) almost 1,400,000 entries appear (the exact number depends on the browser, computer and internet connection) and these are almost always associated with techniques to reduce dimensions. Therefore it seems difficult to add new knowledge about PCA for dimension reduction. Nevertheless Cook (2007) has introduced a connection between PCA and dimension reduction in the context of regression. In Section 2, I will provide the definition of PCA and I will explain the use of PCA in regression. Section 3 is about dimension reduction in regression, as defined by Li (1991) and Cook (1998). Finally, in Section 4 I will present a new connection between these two theories.

1.2

Principal Component Analysis in Regression

In multiple regression, when least squares estimation is used, one of the major difficulties is presented by multicollinearity. In most cases, it can be seen in the high correlation between the predictor variables and it is generally associated with unstable estimates due to the large variance of the predictors. When such an issue is present, there are different ways to deal with it. A possible approach it to use only the covariates that do not contain multicollinearity. Several selection methods have been proposed and some of them are based on Principal Components (PCs). Principal Component Regression (PCR) for the regression of Y on X = (X1 , . . . , Xp )T is a general technique that is used to replace a set of p variables X1 , . . . , Xp by some linear combinations that satisfy specific properties. The first linear combination considered is α1T X such that ||α1 ||2 = 1 and α1T X maximizes var[α1T X]=α1T Σα1 , where Σ is the covariance or correlation matrix of X. The constraint is needed, otherwise this maximum would be infinite. It is a straightforward computation 3

4

Chapter 1

to prove that the maximum value will be achieved at α1 equal to the eigenvector corresponding to the largest eigenvalue of Σ. The first principal component (PC) will be Z1 = α1T X. Once Z1 , . . . , Zk T X is defined as the one that maximizes var[αT Z]=αT Σα (k < p) are defined, Zk+1 = αk+1 k+1 k+1 k+1 T X such that ||αk+1 ||2 = 1 and αkT αj = 0 for j = 1, . . . , k. The k + 1-st PC of X will be Zk+1 = αk+1 with αk+1 the eigenvector of Σ corresponding to the k + 1 eigenvalue. The constraint αiT αj = 0 for i 6= j is equivalent to requiring that Zi and Zj be uncorrelated. Actually, if λk is the eigenvalue of Σ corresponding to the eigenvector αk , we have cov(Zk , Zj ) = αkT Σαj = λj αkT αj = 0. One of the simplest and most popular uses of PC in regression is to substitute the p covariates by the first k principal components Z1 , . . . , Zk , with k < p. After the principal components are obtained, the regression of Y on Z1 , . . . , Zk is done. Sometimes an analogous procedure considers the correlation matrix of X instead of the covariance matrix. The algorithm can be summarized as 1. Do the transformation Z = (diag(Cov(X)))−1/2 X. 2. Compute the first k eigenvectors A = (a1 , . . . , ak ) of the covariance matrix for Z (correlation matrix for X). 3. Regress Y on (diag(Cov(X)))−1/2 A)T X Once the data are observed, the sample covariance or correlation matrix is used to estimate the PCs. Since the covariance matrix of the marginal distribution of X is used, the reduction is done without taking into account that the ultimate goal is to predict Y as a function of X. The rationale behind this use of PC for regression is that because the Principal Components are orthogonal the problem of multicollinearity disappears completely, and the regression equation will always contain all the variables in X. Deletion based on variance is an attractive and simple strategy and the next property adds some respectability to it. Property 1.2.1 (Jolliffe, 2002) Suppose that X consists of n observations and p predictor variables X measured about their sample means and that the corresponding regression equation is Y = Xβ + ²,

(1.1)

where Y is the vector of n observations on the dependent variables, again measured about the sample means. Suppose X is transformed by the equation Z = XB, where B is a p × p orthogonal matrix.

A connection

5

The regression equation can be rewritten as Y = Zγ + ²,

(1.2)

where γ = B−1 β. The usual least square estimator for γ is γ b = (ZT Z)−1 ZT y. Then the elements of γ b have, successively, the smallest possible variance if B = A, the matrix whose k-th column is the k-th eigenvector of XT X, and hence the k-th eigenvector of the sample covariance matrix. Thus Z consists of values of the sample principal components for X. This property seems to imply that replacing the predictor variables in the regression analysis by their first few principal components is a good idea, since the principal components omitted have coefficients that are estimated without much precision. The problem with this argument is that in regression analysis we should take into account the mean squared error; i.e., we should take into account not only the variance, but also the bias of the estimation. Considering the regression of Y on the first principal components takes into account the variance part. Nevertheless, since it does not take into account the relation between the dependent variable and the predictors, nothing can be said about the bias in the regression, as we will now show (Jolliffe, 2002). If we denote by lk the k-th eigenvalue with corresponding eigenvector ak of the covariance of the predictors, it is not difficult to prove that the ordinary least squares estimator for β will be βˆ =

p X

lk−1 ak aTk XT y

OLS regression,

k=1

while if we consider the regression on the first m PCs we get the estimation (in the X scale) as βˆm =

m X

lk−1 ak aTk XT y

PC regression.

k=1

From these two equations we get that ˆ =β E(β)

ˆ = σ2 var(β)

p X

lk−1 ak aTk

k=1

and E(βˆm ) = β −

p X k=m+1

ak aTk β

var(βˆm ) = σ 2

m X

lk−1 ak aTk .

k=1

From these equations it is clear that if we throw away the last principal components in the regression we may get a great deal of reduction in the variance of the estimation. But we may also be introducing bias. There are other strategies to delete PCs based essentially on values of t-statistics measuring

6

Chapter 1

the marginal contribution of each PC to the regression equation, but there are two issues with this approach. The first is that the principal components were built without taking into account the response and the second one is that the power of the t-test for the last principal components is small compared with the high-variance components. Some authors (Mansfield, 1977, Gunst and Mason, 1980, Mosteller and Tukey, 1977) argue that the components with small variance are unlikely to be important in regression, namely in the bias part. However, it is not difficult to find real data examples where the last components, i. e. the ones with smallest variance, are the only ones needed. Jolliffe (1982) gave four real data sets where he showed that the first principal components contain almost no information about the regression. Instead of once again presenting those data sets, I will try to explain his idea with an easy example. Suppose we have a model like Y = α + β(X1 − X2 ) + ², where X1 and X2 have equal variance and they are very correlated. That means the first eigenvector will be in the direction of X1 + X2 and the second one in the direction of X1 − X2 . Considering regression based on only the first principal component throws away all the information we have, since in this case clearly the second component is the one of interest. In this example the bias we add by throwing away the second principal component is much bigger than the reduction in variance gained by its elimination. It is clear what is the problem here: the fact that we ignore the response that is our ultimate goal. Jolliffe, (1982) demonstrated with some examples that components with small variance can be as important as those with large variance. He pointed out that people should have buried forever the idea of selection based solely on size of variance. Nevertheless, his suggestion has gone unheeded and practitioners use principal components now even more than before. Bair at al. in 2006 introduced Supervised Principal Components. The idea is to reduce first the number of predictors, throwing away those with low correlation with the response. The next step is to apply principal component analysis to the remaining predictors. It is clear from the example above that we can add predictors without any correlation with the response that are going to be thrown away after the first screening and have only X1 and X2 survive it. But the first linear combinations of these two remaining ones again loses all the information.

1.3

Sufficient Dimension Reduction in Regression

Regression of a response Y on a vector of predictors X = (X1 , . . . , Xp )T is the study of the conditional distribution of Y |X. When p is small, with the help of graphs, it is possible to study this type of regression in a straightforward form. When p is large, additional methodology is required to make progress. A model-free approach is through dimension reduction (Li, 1991 and Cook, 1998), in which X is replaced by a function R(X) of X in such a way that there is no loss of information.

A connection

7

R(X) will be a sufficient reduction if Y |X ∼ Y |R(X).

(1.3)

One approach to dimension reduction focuses on linear combinations of the predictors and therefore S = R(Rp ) is a subspace of Rp of dimension d ≤ p called a Dimension Reduction Subspace (DRS). In particular, there is special interest in the subspace with smallest dimension. Such a subspace does not necessarily exist, but under mild conditions the minimum subspace is uniquely defined and coincides with the intersection of all subspaces that are the image of a linear function R of X with property (1.3) (Cook, 1998). The intersection is called the central subspace and it is denoted by SY |X . Let d =dim (SY |X ) and η be a p × d matrix whose columns span the central subspace. Then SY |X can be represented by Sη , the subspace of Rp spanned by the columns of η. The central subspace is a tool to reduce high dimensional regressions since it compress all the information in a few combinations of the predictors. This facilitates model building and when the dimension d of SY |X is small the regression can be totally managed by regression graphics, as discussed by Cook (1998). Many methods have been proposed to estimate the central subspace. The first one and the one most discussed in the literature is SIR (Sliced Inverse Regression; Li, 1991). Under mild conditions (linearity of the predictors), SIR estimates a portion of the central subspace: Σ−1 SE(X|y) = Σ−1 span{E(X|y) − E(X) : y ∈ domain of Y } ⊂ SY |X , where Σ is the marginal covariance of X. The idea is that in order to say something about the regression of Y on X the inverse regression X on Y is studied. The idea of considering the inverse regression has appeared before in discriminant analysis and case control studies, but Li was the first to introduced it in the context of regression. In the population setting, the portion of the central subspace that is estimate is spanned by Σ−1 times the first dim(SE(X|y) ) eigenvectors of the cov(E(X|Y )). For the estimation the quantities are replaced by their sample versions. Although SIR did not become the second method for regression as Chen and Li (1998) suggested, a great amount of work was done around SIR. Nevertheless, the basic concept behind SIR did not change.

1.4

A connection between PCR and Sufficient Dimension Reduction

In Section 1.2 we said that PCA is used in regression to reduce the dimension, replacing the p predictors by some linear combinations of them. Specifically, instead of regressing on X we use

8

Chapter 1

the linear combinations α1T X, . . . , α1T X where k < p and the αi are the the eigenvectors of the covariance matrix for X. In terms of subspaces, we hope that Sη with η = (α1 , . . . , αk ) is a DRS at least approximately. However, there is no guarantee that these vectors or this subspace will have any connection with the regression. Actually, they only indicate the combinations of X with biggest variance and there is no theory that says it is a DRS. Although there are many differences between the PCR and sufficient dimension reduction, the one I would like to highlight is that in PC the covariance or correlation of the marginal X is used, while in sufficient reduction Y is involved. The first question that Cook (2007) aims to answer is if there is a case where the central subspace and the space formed by the first PCs coincide. He has an answer, if the following inverse regression model is assumed: The Principal Component Model (PC) model is Xy = µ + Γνy + σ²,

(1.4)

where µ = E(X), Γ ∈ Rp×d , d < p, ΓT Γ = Id , σ ≥ 0 and d is assumed to be known. The coordinate vector νy ∈ Rd is unknown with a positive definite covariance matrix Σν and E(νY ) = 0. The error vector ² ∈ Rp is considered to be independent of Y and normally distributed, with mean 0 and the identity as covariance operator. This model does not have restrictions on the mean but it has a very restrictive covariance. This implies that if we condition the predictors on the response they not only have to be independent but also must have the same variance. The following proposition connects the Principal Component Model (1.4) with the forward regression of Y on X and with the Principal Component Regression. Lemma 1.4.1 (Cook, 2007) Under the Principal Component Model (1.4), the distribution of Y |X is the same as the distribution of Y |ΓT X for all values of X. Moreover, SΓ is the central subspace and it is spanned by the first d eigenvectors of the marginal covariance matrix of X. Proof of Lemma 1.4.1: The proof of the first part of the Lemma (1.4.1) can be found in Cook (2007). The second part follows directly from the fact that the marginal covariance of X, is Σ = Γ(σ 2 Id + Σν )ΓT + σ 2 Γ0 ΓT0 where Γ0 ∈ Rp×(p−d) denotes an orthogonal completion of Γ; that is, (Γ, Γ0 ) ∈ Rp×p is an orthogonal matrix. The lemma says that the central subspace for the regression of Y given X is SΓ , where Γ can be taken as the first d eigenvectors of the covariance matrix of the marginal distribution of X. This says that, under this model, ΓT X can be taken instead of X for the regression of Y |X without lost of information.

A connection

9

An estimate of Γ is needed for the central subspace SΓ but from Lemma 1.4.1 the first d eigenvectors of the marginal sample covariance matrix of X can be used. It follows from Cook (2007), that the first d eigenvectors of the sample covariance matrix of X will be the maximum likelihood estimators for a base of the subspace generated by Γ. The maximum likelihood estimator P b ordered. for σ 2 is σ ˆ 2 = pi=d+1 λj /p with λ1 ≥ · · · ≥ λp the eigenvalues of Σ, In the Principal Component Model (1.4) we did not use explicitly the response. But if Y is observed it does not make sense to throw away that information and we should be able to use it when we estimate our reduction. One way to do this is to model νy . This can be facilitated as suggested by Cook (2007), using graphical analysis investigating E(Xj |y) where Xj denotes the j-th predictor of X. Or one can use any favorite base to model νy = βfy , where fy ∈ Rr is a known function of y with E(fY ) = 0. β ∈ Rd×r d ≤ min(r, p) has rank d and is unknown. Substituting νy into the Model 1.4 we get The Principal Fitted Component model (PFC; Cook, 2007) is Xy = µ + Γβfy + σ².

(1.5)

In this model ΓT X is still a sufficient reduction. It was proved in the same paper that a maximum b fit , the likelihood estimate for a basis of SΓ is given by the span of the first d eigenvectors of Σ covariance matrix of the fitted values from the multivariate linear regression of Xy on fy , including an intercept. As we said before the only restriction in the PC and PCF models is their covariance structure. We consider now an extension of Models 1.4 and 1.5 where the covariance of the conditional error ² is a general positive definite matrix ∆ that is independent of y. We call those models PC∆ and PFC∆ respectively.

Lemma 1.4.2 (Cook, 2007) Under Models PC∆ and PFC∆ , the distribution of Y |X is the same as the distribution of Y |ΓT ∆−1 X for all values of X. Moreover, ∆−1 SΓ is the central subpace for the regression of Y on X.

If ∆ is known, the maximum likelihood estimates for ∆−1 SΓ will be the span of ∆−1/2 times the b fit ∆−1/2 b −1/2 for Model PC∆ , and the first d eigenvectors of ∆−1/2 Σ first d eigenvectors of ∆−1/2 Σ∆ for Model PFC∆ . If ∆ is unknown, there is not enough information in Model PC∆ to estimate ∆ using maximum likelihood estimators and therefore Model PFC∆ should be used. In the following chapter we will give a formula for the maximum likelihood estimator of ∆ and for a base for the central subspace ∆−1 SΓ for Model PFC∆ as well as a connection between Model PFC∆ and Principal Component Regression.

10

1.5

Chapter 1

Conclusions

In practice, when Principal Components are used in the regression context, there is some discussion over the proper scaling in order to estimate them. Usually, the discussion is whether the sample correlation or sample covariance matrix should be used. As we stated before, for the case of the predictors being independent given the response and with equal variance, the first d population Principal Components give a sufficient reduction and a maximum likelihood estimator for the sufficient reduction are the first d sample Principal Components. In contrast, if the variance is not the same or conditional independence is not assumed, the scaling to be used should be different. We will show that under the Model PFC∆ , to get a maximum likelihood estimator for the central subspace for the regression of Y |X, we need to get b −1/2 ), after that, re-scale the first d principal components for the standardized predictors (using ∆ them again to the original scale.

Chapter 2

Some results 2.1

Introduction

In this chapter we will state and prove some results about Model PFC∆ : Xy = µ + Γνy + σ²,

(2.1)

where νy = βfy with fy ∈ Rr a known function of y with E(fY ) = 0 and β ∈ Rd×r has rank d (d ≤ min(r, p)) but is unknown. ² is normal with mean 0 and has as covariance matrix a general positive definite matrix ∆ that is independent of y. As we said in Chapter 1, in this model ∆−1 SΓ is the central subspace. For the case of ∆ equal to a constant times the identity matrix, the maximum likelihood estimate for a basis of the b fit , the covariance matrix of central subspace is given by the span of the first d eigenvectors of Σ the fitted values from the multivariate linear regression of Xy on fy , including an intercept. In this chapter we derive the maximum likelihood estimator for ∆−1 SΓ for Model PFC∆ with ∆ a general positive definite matrix. After that we present a connection between this model and principal components regression. We also present ways of testing the dimension of the sufficient dimension subspace, predictors and the conditional independence of the predictors. At the end we show some simulations and revisit Fearn’s data (Fearn, 1983).

2.2

The results

First we show the maximum likelihood estimators for the parameters of Model PFC∆ (µ, Γ, ∆). After that we show how we need to scale the predictors in order that the principal components give a sufficient reduction for the regression of Y |X for Model PFC∆ . Let X denote the n × p b = PF X ¯ T , let F denote the n × r matrix with rows (f T − f )T , and let X matrix with rows (Xy − X) y denote the n × p matrix of centered fitted values from the multivariate linear regression of X on fy , 11

12

Chapter 2

including an intercept. Holding Γ and ∆ fixed at G and D, and maximizing the likelihood over µ ¯ Gβb = PG(D−1 ) XT F(FT F)−1 , and the partially maximized log likelihood and β, we obtain µ b = X, n n b + n trace(D−1 PG(D−1 ) Σ b fit ), f (G, D) = − log |D| − trace(D−1 Σ) 2 2 2

(2.2)

b T X/n b b fit = X where PG(D−1 ) = G(GT D−1 G)−1 GT D−1 , Σ is the sample covariance matrix of b fit is at most r and typically rank(Σ b fit ) = r. We assume that the fitted values. The rank of Σ b fit ) ≥ d. rank(Σ The log likelihood in equation (2.2) is a function of possible values D and G for ∆ and Γ. For fixed D this function is maximized by choosing D−1/2 G to be a basis for the span the first d b fit D−1/2 , yielding another partially maximized log likelihood f (D): eigenvectors of D−1/2 Σ p X n n b res ) − n b fit ), f (D) = − log |D| − trace(D−1 Σ λi (D−1 Σ 2 2 2

(2.3)

i=d+1

b res = Σ b −Σ b fit and λi (A) indicates the ith eigenvalue of the matrix A. where Σ The following theorem, whose proof can be find in Appendix A, gives the maximum on D of the function given in (2.3). In what follows, S+ p indicates the space of p × p positive definite matrices +,0 and Sp that of semipositive definite ones. b res ∈ S+ and Σ b fit ∈ S+,0 b fit is Theorem 2.2.1 Suppose that Σ are known, and that the rank of Σ p p r. Let d be a integer no bigger than min(r, p). Then, the maximum of −1 b

f (D) = − log |D| − trace(D

Σres ) −

p X

b fit ) λi (D−1 Σ

(2.4)

i=d+1

for D ∈ S+ p is attained at b =Σ b res + Σ b 1/2 VKVT Σ b 1/2 , ∆ res res

(2.5)

with K = diag(0, . . . , 0, λd+1 , . . . , λp ) and V, Λ= diag(λ1 , . . . , λp ) are the matrices of the orb b −1/2 b −1/2 dered eigenvectors and eigenvalues, respectively, of Σ res Σfit Σres . We assume the λi s, i = 1, . . . , min(r, p) are all different.

The following corollary, whose proof can be find in Appendix A, give us the value of the log likelihood function at the maximum. b given by the formula on Corollary 2.2.2 The log likelihood function attains its maximum at ∆

Some results

13

(2.5) and the maximum value is p np np n n X b log(1 + λi ), − − log(2π) − log |Σres | − 2 2 2 2

(2.6)

i=d+1

−1/2 b b −1/2 b res where the λi indicate the ordered eigenvalues of Σ Σfit Σres , λ1 ≥ λ2 ≥ · · · ≥ λp .

The following results, whose proof can be find in Appendix A, tells us that even if the maximum b res or Σ, b for maximum likelihood estimation of the central likelihood estimator for ∆ is not Σ b res or Σ b can be used in place of ∆. b subspace Σ b −1/2 Corollary 2.2.3 The maximum likelihood estimator of the ∆−1 SΓ is span of Σ times the first res −1/2 b b −1/2 b d eigenvectors of Σres Σfit Σres . b −1/2 times the first Corollary 2.2.4 The maximum likelihood estimator of ∆−1 SΓ is the span of Σ b fit Σ b −1/2 . b −1/2 Σ d eigenvectors of Σ Cook (2007) proved that when Model PFC∆ holds and Y is categorical, the SIR estimator for b −1/2 times the first d eigenvectors of Σ b −1/2 Σ b fit Σ b −1/2 . This in the central subspace is the span of Σ combination with Corollary 2.2.4 says that the SIR estimator under Model PFC∆ with Y categorical is the maximum likelihood estimator of ∆−1 SΓ . That makes a connection between Model PFC∆ and the model-free approach given by SIR. The following corollary, whose proof can be find in Appendix A, gives the right scaling for using principal component analysis on the regression on Y |X when PFC∆ is assumed. b −1/2 Corollary 2.2.5 The maximum likelihood estimator of ∆−1 SΓ is the span of Σ times the first res −1/2 b b −1/2 −1/2 b b d eigenvectors of Σres ΣΣres . Equivalently is the span of ∆ times the first d eigenvectors of −1/2 −1/2 b b b ∆ Σ∆ From the corollary we can say that if we are under Model PFC∆ the maximum likelihood estimator for the central subspace for the regression of Y |X will be a re-scaling of the principal b −1/2 components of the data normalized using Σ res .

2.3

Inference

We will consider Model PFC∆ , given by Xy = µ + Γβfy + ² where µ = E(X), Γ ∈ Rp×d , d ≤ min(p, r) and the error vector ² ∈ Rp is considered to be independent of Y and has a normal distribution, with mean 0 and ∆ ∈ S+ p as covariance operator.

14

2.3.1

Chapter 2

Inference about Conditional Independence of the Predictors

When the predictors are conditionally independent given the response, the number of parameters to be estimated in Model PFC∆ decrease from p(p + 1)/2 + rd + d(p − d) to p + rd + d(p − d), giving us more degrees of freedom for further tests. The statistic to be used is the log likelihood ratio test. Because of the normality assumption, -2 times the log likelihood ratio for comparing a model for the case of ∆ diagonal to the full model will have an approximately chi-square distribution. The degrees of freedom to use in the chi-square distribution will be the difference in the numbers of parameters to be estimated, in this case p(p − 1)/2, independent of d. To compute the likelihood ratio test we need the maximum likelihood estimator of ∆ under the full model (given in Theorem (D) 2.2.1) and under the reduced one. To be able to get this estimation we compute ∂f∂D with f given ∂f (D) in (2.3), D diagonal. We solve iteratively on D the equation ∂D = 0. The starting point used −1/2 b res by the algorithm will be D0 = diag(Σ ). A sketch of the algorithm used is given in Appendix B.

2.3.2

Inference about d

The dimension d of the reductive subspace was assumed known for all the models we consider here, but inference on d will normally be needed in practice. The number d is a model selection parameter and there are at least two ways to estimate it. The first is to test among nested models using log likelihood ratio tests and the second one is to use model selection techniques like AIC or BIC. We will call the model with d = min(r, p) the full model.

Testing Our goal is to get a d such that when we compare with the full model there is no or little loss of information. Therefore we will be testing sequentially d = j vs d = min(r, p) for j = 0, . . . , min(r, p) − 1. To do this we will first test the hypothesis d = 0 vs d = min(r, p) at a 0.05 level and if we reject the null hypothesis we continue the process, incrementing d by 1. We finish when the null hypothesis is not rejected or when the null hypothesis d = min(r, p) − 1 is rejected, whichever occurs first. We again use the log likelihood ratio test. Because of the normality assumption, -2 times the log likelihood ratio for comparing a model for d = 0, . . . , min(r, p) − 1 to the full model will, once more, have an approximately chi-square distribution, with degrees of freedom equal to the difference in the numbers of parameters to be estimated. This difference is (r − d)(p − d). Even if it is not consistent, this estimation method has an asymptotic probability of 5 percent of choosing a d larger than the true one, and, clearly, if we get a larger d, we are not losing information but only, at worst, using more information than necessary.

Some results

15

AIC, BIC Since we have normal distributions, a method that is consistent in selecting the true model is BIC, and AIC is minimax-rate optimal for estimating the regression. The application of these two methods is straightforward. Let us denotes by Log[d] the log-likelihood function evaluated at the maximum likelihood estimators for a fixed d, for µ, Γ, β and ∆. For d ∈ {0, . . . , min(r, p)} we need to evaluate the function IC(d) = −2 Log[d] + h(n)g(d), where g(d) is the number of parameters to be estimated as a function of d (in our case, p(p+3) + 2 rd + d(p − d)) and h(n) is equal to log n for the BIC case and 2 for AIC. If AIC and BIC give different values of d, we choose the larger one.

2.3.3

Testing predictors

In this section we develop tests of the hypothesis of no effect for selected predictors when we are under Model PFC∆ . Let us consider in Model PFC∆ the partition Ã X=

X1 X2

!

Ã , Γ=

Γ1 Γ2

!

and Ã ∆−1 =

∆11 ∆12 ∆21 ∆22

! .

The kind of hypotheses we would like to test here are of the form Y

X2 |X1 .

(2.7)

In order to develop statistics to test this kind of hypothesis we need the following result, whose proof can be find in Appendix A. Lemma 2.3.1 Let us consider PFC∆ with the partition described above. Then Y only if Γ2 = −∆−22 ∆21 Γ1 .

X2 |X1 if and

(2.8)

Here ∆−ii indicates the inverse of ∆ii .

In order to test (2.7) we will use the log likelihood ratio test. Again, due to normality, that

16

Chapter 2

statistic will have an approximate chi square distribution. The degrees of freedom to be used is the difference of the number of parameters needed to be estimated for the full model (no restrictions on Γ) and the reduced model with Γ restricted as in Lemma 2.3.1. The maximum likelihood for the full model was given in Corollary 2.2.2. The following lemma, whose proof can be find in Appendix A, gives a formula for the maximum likelihood estimators for the reduce model. Lemma 2.3.2 Let us assume Γ2 = −∆−22 ∆21 Γ1 and d ≤ min(r, p1 ). Then, the maximum likelihood estimator of ∆ is given by: b 1/2 , b 11 = Σ b 1/2 V(Id + K)VT Σ ∆ 11,res 11,res

(2.9)

with K = diag(0, . . . , 0, λd+1 , . . . , λp1 ) and V and λ1 , . . . , λp1 the ordered eigenvectors and eigenb −1/2 Σ b b −1/2 values, respectively, of Σ 11,res 11,fit Σ11,res ; b 12 = ∆ b 11 Σ b −1 Σ b ∆ 11 12 and b 22 = Σ b 22.1 + Σ b 21 Σ b −1 ∆ b 11 Σ b −1 Σ b ∆ 11 11 12 . −1/2

b The maximum likelihood estimator for a basis of the reductive subspace ∆−1 Γ is (Σ G , 0)T , ! 11,res 1 Ã b 11 Σ b 12 Σ b −1/2 Σ b b −1/2 b and the same with G1 the first d eigenvectors of Σ 11,res 11,fit Σ11,res . Here Σ = b 22 b 21 Σ Σ b res and Σ b fit , and for a matrix A, for i, j = 1, 2, we denote Aii.j = partition is consider for Σ Aii − Aij A−1 jj Aji . By setting d = min(r, p1 ) we can test predictors without first inferring about d. In the same way, we can set d = min(r, p) to test conditional independence.

2.4

Some simulations

We present here some simulations using the testing procedure explained above.

2.4.1

Inference about conditional independence

We want to see how the correct inference of the test for conditional independence (that is, of ∆ equal to a diagonal matrix) depends on p. As stated before, we use an asymptotic distribution of the statistics. To see the rate of convergence with n we study its behavior when the sample size

Some results

17

increases. The model considered here is Xy = Γy + ²,

(2.10)

√ with y ∼ N (0, 1), Γ ∈ Rp , Γ = (1, . . . , 1)T / p and ² ∼ N (0, ∆) where ∆ is a diagonal matrix with entry (i, i) equal to 10i−1 . In this example d = 1 and we consider p = 6 and p = 15. The null hypothesis to be tested is if ∆ is diagonal. To do this we fit the model using d = r for different values of r and fy = (y − y, . . . , y r − y r )T . Testing was done at the 5 percent level and the number of repetitions was 500. Figure 2.1 gives a graph of percentage of the time the null hypothesis d = 1 was accepted vs sample size (from 40 to 200).

0.80

r=6

50

100

150

200

1.0 0.8 0.6 0.4

r=5

r=10

0.2

Percentage of times the null hypothesis is accepted

0.85

r=3

r=1

r=15 0.0

1.00 0.95 0.90

r=1

0.75

Percentage of times the null hypothesis is accepted

As expected, when the sample size increases the percentage of the time the null hypothesis is accepted converges to 95. The convergence is faster for p = 6 than for p = 15, reflecting the fact that the number of predictors affect the asymptotic behavior of the likelihood ratio test. For the case of p = 15 the convergence is faster if we consider a smaller d = r for fitting the models, suggesting that when the number of predictors is large it may not be a good idea to use d = r = p.

50

100

150

Sample size for p=6

Sample size for p=15

a. p = 6

b. p = 15

200

Figure 2.1: Inference about conditional independence

2.4.2

Inference about d

Here we present some simulations about inference on d, using the likelihood ratio tests, AIC and BIC. We generated the random variable Y as a N (0, σy2 ), and then generated Xy as Xy = Γg(y) + ², with ² ∼ N (0, ∆). For p, Γ, g(y) and ∆ we consider different choices as follows:

(2.11)

18

Chapter 2

Figures 2.2, a. and b., are for σy equal to .1 and 1., respectively, and p = 10, d = 1, Γ = √ (1, 1, ..., 1)T / 10, g(y) = y, ∆ generated as ∆ = AT A, where A is a p × p matrix of independent standard normal random variables, yielding predictor variances of about 10 and correlations ranging between -.67 and 0.75. We fit the model using fy = (y − y, . . . , y r − y r )T for different values of r. The x-axis represents r = 1, 2, 3, 4, 5 and the y-axis the percentage of time d = 1 was chosen using log-likelihood ratio test, AIC or BIC. The sample size is n = 100 and the number of repetitions is 500.

2

4

6 r

a. σy = .1

8

1.0 0.8 0.6

LRT

0.4

AIC

0.0

BIC

BIC

0.2

0.2

LRT

Percentage of times d=1 was chosen

0.8 0.6 0.4

AIC

0.0

Percentage of times d=1 was chosen

1.0

From the graphs we can say that if we increase the signal on y the results are better. This is not surprising because increasing the signal of y increases the importance of Γy, letting the model choose d = 1 instead of d = 0. If the y signal is small we need to overfit (increase r) to be able to choose d = 1. For the case of σy = 1 overfitting makes the model choose a bigger d.

10

2

4

6

8

10

r

b. σy = 1

Figure 2.2: Inference about d, for d = 1, vs r

Figure 2.3a. and b. are for σy equal to 1 and 5., respectively, and p = 5, d = 2, Γ = (Γ1 , Γ2 ) with √ √ Γ1 = (1, 1, −1, −1, 0)T / 4, Γ2 = (−1, 0, 1, 0, 1)T / 3, g(y) = (y, |y|), ∆ generated as ∆ = AT A, where A is a p × p matrix of independent standard normal random variables, yielding predictor variances of about 5 and correlations ranging between 0.65 and -0.73. We fit the model using fy = (y − y, |y| − |y|, y 3 − y 3 , . . . , y r − y r )T for different values of r. The x-axis represents r from 2 to 10 and the y-axis the percentage of time d = 2 was chosen using log-likelihood ratio test, AIC or BIC. The sample size is n = 50 and the number of repetitions is 500. Again, larger y signal let us choose the right d more frequently. Here overfitting seems not to have much effect, specially for the case of a large signal.

BIC

0.8

LRT

0.4

0.6

AIC

0.2

Percentage of times d=2 was chosen

0.8 0.6 0.4

AIC LRT

0.2

Percentage of times d=2 was chosen

1.0

19

1.0

Some results

0.0

0.0

BIC

2.0

2.5

3.0

3.5 r

a. σy = 1

4.0

4.5

5.0

2.0

2.5

3.0

3.5

4.0

4.5

5.0

r

b. σy = 5

Figure 2.3: Inference about d, for d = 2, vs r

Figure 2.4a., b., c. and d. are for σy equal to .5, 1, 2 and 5., respectively, and p = 5, d = 2, Γ = √ √ (Γ1 , Γ2 ) with Γ1 = (1, 1, −1, −1, 0)T / 4, Γ2 = (−1, 0, 1, 0, 1)T / 3, g(y) = (y, |y|), ∆ generated as ∆ = AT A, where A is a p × p matrix of independent standard normal random variables, yielding predictor variances of about 5 and correlations ranging between -0.4 and .6. We fit the model using fy = (y − y, |y| − |y|, y 3 − y 3 )T . The x-axis represents sample size n (from 20 to 400) and the y-axis the percentage of times that likelihood ratio test, AIC or BIC, choose d = 2. The number of repetitions is 500. As expected all the testing procedures are better when we increase the sample size. BIC and AIC go close to 100 percent and LRT to 95 percent. From the figures again we see that the size of the y signal is important for the testing procedures. Smaller y signals take more time to give good answers than bigger signals.

Figure 2.5, is for σy equal 2. Sample size equal 200. d = 2, Γ = (Γ1 , Γ2 ) with Γ1 = √ √ (1, 1, −1, −1, 0, 0, . . . , 0)T / 4, Γ2 = (−1, 0, 1, 0, 1, 0, . . . , 0)T / 3, g(y) = (y, |y|), ∆ generated as ∆ = AT A, where A is a p × p matrix of independent standard normal random variables. The x-axis in the Figure represents p (from 6 to 84) and the y-axis represents the percentage of time the likelihood ratio test, AIC or BIC choose d = 2. The sample size is n = 200 and the number of repetitions is 500. For the fitting was used fy = (y − y, |y| − |y|, y 3 − y 3 )T . This is a sliced view of Figure 2.4 c. The graph shows how BIC and likelihood ratio test start failing as p increases with n fixed. Let us note that the number of parameters we estimate when we have p predictors is of order p2 . Nevertheless the tests are still good for p2 > n. Let us note that AIC is pretty good for all values of p considered in this simulation. If we compare this figure with Figure 2.4 c. we can see that for

100

200

300

0.6

0.8

AIC

0.4

BIC

0.0

BIC

LRT

0.2

0.2

0.4

LRT

Percentage of times d=2 was chosen

0.8 0.6

AIC

0.0

Percentage of times d=2 was chosen

1.0

Chapter 2

1.0

20

400

100

Sample size from 20 to 400

a. σy = .5

300

400

b. σy = 1

1.0 0.8 0.6 0.4

Percentage of times d=2 was chosen

AIC

0.0

0.2

0.4

0.6

0.8

AIC

BIC LRT

0.2

1.0

BIC LRT

0.0

Percentage of times d=2 was chosen

200

Sample size from 20 to 400

100

200

300

Sample size from 20 to 400

c. σy = 2

400

100

200

300

400

Sample size from 20 to 400

d. σy = 5

Figure 2.4: Inference about d, for d = 2 vs n

n = 200 and p = 5, BIC gives better answers than AIC. Nevertheless that is reverses after p ∼ 10 where AIC start giving better results and those results are still good for big ps.

2.4.3

Testing predictors

For Figure 2.6 we consider the model: Xy = Γy + ², with ² ∼ N (0, ∆) with y ∼ N (0, σy2 ), ∆ generated as ∆ = AT A, where A is a p × p matrix of independent standard normal random variables, yielding predictor variances of about 10 and correlations ranging between 0.75 and -.67. Γ = c(Γ1 , Γ2 )T where Γ1 ∈ R7 , Γ1 = (1, . . . , 1)T , Γ2 = −∆−22 ∆21 Γ1 , where ∆ii are defined as in Section 2.3 and c is a normalizing constant, Γ ∈ R10 . We want to test Y X2 |X1 where X = (XT1 , XT2 )T and dim(X1 ) = 7. The graphs are for

21

0.8

AIC

0.6

LRT

0.2

0.4

BIC

0.0

Percentage of time d=2 was chosen

1.0

Some results

20

40

60

80

p

Figure 2.5: Inference about d vs p

1.0

σy = 3

0.9

σy = 1

0.6

0.7

0.8

σy = 0.5

0.5

Percentage of times the null hypothesis is accepted

σy taking the values .5, 1 and 3. The x-axis is the sample size (between 20 and 300) and the y-axis the percentage of time the null hypothesis is accepted. The number of replicators is 500. From the graph we can see again that larger y signals give us better results. The test gives us better results for bigger sample size because for those cases the asymptotic distribution of the statistics is close to the chi-square distribution. For smaller sample sizes the not so good results should be read as that the statistics does not follow closely a chi-square distribution.

50

100

150

200

250

300

Sample Size

Figure 2.6: Getting rid of predictors

22

Chapter 2

2.4.4

Study of the Fearn’s data

Fearn’s (1983, see also Cook, 1998, p. 175) calibration data is the basis for this example. The response is the protein content of a sample of ground wheat, and the predictors are -log(reflectance) of NIR radiation at p = 6 wavelengths. The predictors are highly correlated in these data, with pairwise sample correlations ranging between 0.92 and 0.9991. fy equal to (y − y, y 2 − y 2 , y 3 − y 3 )T seemed to be a reasonable choice. Fitting Model PFC∆ with this fy made AIC, BIC and loglikehood ratio test all choose d = 2. Consequently, we infer that the sufficient reduction is comprised of two linear combinations of the predictors, which can be viewed in a 3D plot with the response. For this data there is no clear relationship between the sufficient reduction and the principal components. The test for Y X|(X2 , X3 , X5 ) gave a statistic of 9.20737 with 6 degrees of freedom, giving a p-value of 0.1622472. Therefore, for further studies with this data we need to consider only the second, third and fifth variable.

2.5

Conclusions

As we stated before, principal component regression (using the correlation matrix) goes as follows: −1/2

\ • Normalize the data using Z = diag((Cov(X)) \ • Regress Y on (diag((Cov(X))

−1/2

)X.

)V)T X, with V the first d eigenvectors of cov(Z)

If we use Model PFC∆ , the maximum likelihood estimator for the central subspace on the regression of Y on X is obtained as follows b −1/2 • Normalize the data using Z = Σ res X. T b −1/2 • Regress Y on (Σ res V) X, with V the first d eigenvectors of cov(Z).

\ for principal components The difference between these approaches is the scaling; diag((Cov(X)) b res for PFC (Model PFC∆ ). Now, as I showed in the toy example, if we use principal and Σ components we do not necessarily get, even in the linear regression context, a close approximation to the sufficient dimension reduction subspace. On the other hand, if we are under Model PFC∆ , b res gives the maximum likelihood estimator for the central subspace. the scaling Σ As general conclusion we can say that Model PFC∆ gives a framework where principal components with this new scaling has a support theory. If we use that model, not only we know what is the central subspace for the regression of Y |X but we can also make inferences about the dimension of the central subspace and about predictors to see which of them are important for the regression of Y |X. The simulations presented in this chapter corroborate the theory, as developed in our work.

Chapter 3

Further work 3.1

Discrimination

The kind of discriminant analysis I will consider here can be summarized by saying that given an object that is known to come from one of the g groups Gi (i = 1, . . . , g) in a population P we wish to assign it, with the least possible error, to one of these groups, on the basis of p measured characteristics X associated with it. In general it is assumed that the random vector X associated with each observation has a particular distribution depending on its group. Information about the frequencies of occurrence of each group can be known. The idea is that, after studying some available set of data Xi (the training data) for which the group membership is known, we should be able to give a rule to assign future observations to one of the g groups. The best known case in the theory is when there are g populations and Xi for i = 1, . . . , g is assumed to have a multivariate normal distribution that differs on the mean but not in the covariance matrix. For this case Fisher’s sample linear discriminant rule is based on finding the linear combinations of X such that the quotient between the sum of the squared distances from populations to the overall mean of the linear combinations and the variance of the linear combinations is b −1/2 ei where ei are the s non-zero as big as possible. One choose the s = min(g − 1, p) vectors Σ pooled −1/2 b b −1/2 b b eigenvectors of the matrix Σpooled Σf Σpooled where Σpooled is the pooled covariance matrix, and b f , usually called the sample within groups matrix, is the difference between the usual sample Σ covariance matrix and the pooled variance matrix. The hypothesis about normality with equal variance can be written using Model PFC∆ where fy has g − 1 coordinates and the ith coordinate equals 1 − nni if y represents the population i and − nni otherwise. Here ni is the sample size for population y and n is the total sample size. b res from Chapter 2 is Σ b pooled and therefore Σ b f is Σ b fit . From the theory Under these conditions, Σ developed on Chapter 2, if the mean of the populations lives in an space of dimension d ≤ min(g − 1, p), the central subspace for the regression of Y on X, and therefore for discrimination of the g b −1/2 ei , where ei are the first d eigenvectors of the matrix Σ b −1/2 Σ b b −1/2 populations, will be Σ pooled pooled f Σpooled . 23

24

Chapter 3

As stated the maximum likelihood estimator for the discrimination of the g populations, are the first d ≤ s = min(g −1, p) linear combinations of X used in the standard way. The theory developed in Chapter 2 should allows us to choose d, to test about predictors, conditional independence. Further study should be done to see how these linear combinations can be used to classify new observations. We know these linear combinations are sufficient to do discriminations. Nevertheless it is not clear that we should consider linear distance to classify observations. Another interesting problem would be to consider not necessarily equal variance among populations.

3.2

Subpopulations

Consider again our Model PFC∆ but suppose now that we have two or more different subpopulations. That regression can be seen as having continuous and categorical predictors. In this case, we would like to get reduction conditional not only on the continuous predictors but also on the categorical predictors. That says we are interested in finding a function R(X) such that if W indicates a categorical predictor with values w = 1, . . . , g, Y X|(R(X), W). For the case of sliced inverse regression Chiaromonte, Cook and Li (2002) introduced a way of estimating a subset of a DRS for these kind of regressions. That is called partial sliced inverse regression. To be able to estimate and test they need to require the condition that the covariance of the different populations (condition to W) be the same. Let us consider the case of normality assumption like Model PFC∆ , and assume X|(Y, W = w) = µw + Γw βw fy,w + ²,

(3.1)

with ² ∼ N (0, ∆) and ∆ independent of Y and w. It is not difficult to see that X|(Y, ΓTw ∆−1 X, W = w) and X|((ΓTw ∆−1 X, W = w) have the same distribution and therefore a central subspace for the regression of Y |(X, W = w) will be R(X, W = w) = ΓTw ∆−1 . Using a result from Chiaromonte, P Cook and Li (2002) we can say that R(X) = gw=1 R(X, W = w) is the partial sufficient reduction P P subspace for Model 3.1 (here gi=1 A = { gi=1 vi , vi ∈ Ai } for Ai subspaces). Further work should be done to determine a likelihood estimator of that reductive subspace as well as develop procedures for testing dimension, predictors and conditional independence. Re-writting in terms of subspaces Model 3.1 can be state as X|(Y, W = w) = µw + Γβw fy,w + ², with Γ a base for

Pg

i=1 SΓi .

Later Ni and Cook (2006) extended partial sliced inverse regression by removing the restrictive homogeneous covariance condition among subpopulations. They estimated a subset of the reductive subspace via minimization of a non-linear optimization function. The first problem to solve in this context is to find the central subspace. After that maybe to get the maximum likelihood estimator

Further work

25

we need to do an iterative algorithm where we get, for fixed and supposedly known covariances, the reductive subspace and then estimate the covariances to go again to the estimation of the reductive subspace, and so on.

3.3

Exact distribution theory for testing

When the Likelihood ratio test was used for testing, we worked with its asymptotic distribution. Since we were able to get explicitly the maximum likelihood estimators, further study can be done in order to get the exact distributions of the statistic used in the testing procedures. The starting point will be equation (2.6), that gives the value of the likelihood function under the Model PFC∆ for d and ∆ a general positive definite matrix.

3.4

Diagonal case without fitting νy

As we said before, if ∆ is a general matrix, the maximum likelihood estimator of ∆ can not be found under Model PC∆ . This is due to the fact that all the information was used to estimate νy . That can be seen from the fact that in this case we need to maximize on D the function f (D) = − log |D| −

p X

b λi (D−1 Σ),

i=d+1

b is the sample covariance matrix for the marginal X. The non-existence of the maximum where Σ b = Ip , since we can take D = diag( 1 , . d. . , 1 , 1, . . . , 1, ), t > 1 and can be seen if we consider Σ t t the value of the function in such D will go to infinity when t goes to infinity. This says that the maximum does not exist, even when ∆ is diagonal and we reduce by p(p−1) the number of 2 unknowns. But the non-fitting case (using νy without modeling) can be important in situations where there are predictors without influence in the regression of Y on X. That can be the case when we do not know how to model νy . For these cases we need to study under which situations the maximum likelihood estimator exists for some particular diagonal matrices. This problem is connected with the problem I will present in the following section.

3.5

Correlation matrix theory of principal component analysis

Cook (2007) gives the connection between Principal Component regression and Model 1.4. For that model the first eigenvectors of the covariance matrix gives the central subspace for the regression of Y on X. When principal component regression is used, sometimes practitioners need to scale the predictors using the variance in order to get good estimations. This means the use of the first eigenvectors of the correlation matrix. In the theory presented here the correlation matrix does not

26

Chapter 3

appear. That should be related to ∆ being a diagonal matrix, but, as we saw above, the maximum likelihood estimator does not exist for the general diagonal case if we use Model 1.4. If we want to estimate a general diagonal matrix we need to use fitting. In practice, when the correlation matrix is used for the principal components, it is due to the fact that the predictors shows different variance, but they usually have a patterned relation, like exponentially decreasing or increasing or the lats k eigenvalues of ∆ (or Σ) are constant. In such a case, even if the matrix is not a constant times the identity we have to estimate fewer parameters. Further work should be done to see if in those cases we can estimate using maximum likelihood the ∆ matrix for Model PC∆ .

3.6

Other possible questions

Other possible research lines could be to extend models 1.5 and PFC∆ when the conditional covariance of the predictors given the response is not independent of the response, or to non-normal situations. Another possibility is to consider an intermediate model between Model 1.5 and Model PFC∆ where some of the νy are modeled by βfy and some are left as they are.

3.7 3.7.1

Infinite dimensional case What is Functional Data Analysis?

Most statistical analysis involve one or more observations taken on each of a number of individuals in a sample, with the aim of making inferences about the general population from which the sample is drawn. In an increasing number of statistics papers the considered observations are curves. And curves are examples of functions, but as everybody knows data are numbers and/or vectors and/or matrices but never functions. Therefore why do they say we have curves as observed data? Because even if we know a curve only in a finite number of points, the way we think about and treat these kind of data is as if they were functions (and that means for example we can talk about smoothness), rather than a string of numbers. Suppose, for example, that the data are recorded densely over time, often by machine. They are then typically termed functional or curve data, with one observed curve (or function) per subject. This is often the case even when the data are observed with experimental error, since the operation of smoothing data recorded at closely spaced time points can greatly reduce the effects of noise. In such cases we may regard the entire curve for the i-th subject, represented by the graph of the function Xi (t) say, as being observed in the continuum, even though in reality the recording times are discrete. For this reason, we said we observed curves and we call them functional data. Statistical methods for analyzing such data are described by the term functional data analysis, FDA from now on, coined by Ramsay and Dalzell (1991). The goals of functional data analysis are essentially the same as those of any other branch

Further work

27

of statistics: exploration, confirmation and/or prediction. The books of Ramsay and Silverman, (2002, 2005) describe in a systematic way some techniques for exploration of functional data. In the same books, some confirmatory theory is given and some issues about prediction are discussed. After those books (the first edition was in 1997) a great deal about functional data analysis has been discussed in the literature.

3.7.2

The assumptions

The theory developed here can be found on the books of Ramsay and Silverman (2005) or Bogachev (1998). Let us take a Hilbert space H ⊂ L2 [0, 1], equipped with a norm (||.||) where the immersion is continuous with respect to the usual norm (||.||2 ) in L2 [0, 1] defined through the following interior product: for f, g ∈ L2 [0, 1] Z hf, gi =

1

f (t)g(t)dt.

(3.2)

0

The stochastic processes we will consider here are Z X ∈ {Xt : Ω → R, E||X. ||2 = ||X. (ω)||2 dP (ω) < ∞} Ω

where (Ω, F, P ) denotes a complete probability space. When we use the ||.||2 norm in L2 [0, 1], the space will be denoted by L2 (Ω × [0, 1]) We will work in general with H = L2 [0, 1] or a Sobolev space H m immersed in L2 [0, 1], defined as H m = {f ∈ L2 [0, 1] : f (m) ∈ L2 [0, 1]}, where f (m) indicates the m-th weak derivative of f ; but in general the results are true for any H ⊂ L2 [0, 1] Hilbert space. The mean of the process X is defined as µ(t) = E(Xt )

(3.3)

and the covariance as K(s, t) = E((Xt − µ(t))(Xs − µ(s))).

(3.4)

Because of the hypothesis imposed on the process, E(Xt ) belongs to L2 [0, 1] and K(t, s) ∈ L2 ([0, 1] × [0, 1]). Therefore, we can define the covariance operator as a linear, continuous operator

28

Chapter 3

from L2 [0, 1] to L2 [0, 1]: K : L2 [0, 1] → L2 [0, 1] Z 1 K(f )(s) = f (t)K(t, s)dt.

(3.5) (3.6)

0

Moreover, this operator is a positive, symmetric, compact operator. Let us note that the same is true if we consider K : H → L2 [0, 1] with the respective norms. Since K is a linear, positive, symmetric, compact operator, there exists an orthonormal basis 2 {φi }∞ i=1 of L [0, 1] such that each φi is an eigenvector of K with eigenvalue λi . That means that K(φi ) = λi φi ,

(3.7)

and K(s, t) =

∞ X

λi φi (s)φi (t).

(3.8)

i=1

P 2 Moreover, ∞ i=1 λi < ∞ and λi ≥ 0. We denote by N (K) the kernel of K, that is, the functions in L2 [0, 1] such that K(f ) = 0. If N (K) = {0}, K is positive definite, injective and all its eigenvalues are positive. From (3.7) and (3.8) we have that in the L2 sense (means in the norm sense) K(f ) =

∞ X

λi ci φi ,

i=1

where ci is the Fourier coefficient of f with respect to the {φi }i basis, i.e. ci = hf, φi i. If the process satisfies the above characteristics, we have the Karhunen-Lo´evethe expansion of the process Xt − µ(t): Xt − µ(t) =

∞ p X

λi ψi φi (t)

(3.9)

i=1

where ψi are independent real random variables with mean 0 and variance λi . Let us note that the generalization of the multivariate normal distribution with covariance matrix the identity does not have a place here since the kernel of such operator will be the delta function (not a function and not in L2 [0, 1]). Moreover, to define such a stochastic process, we need to work with stochastic processes in Banach spaces. More precisely, we need to consider the domain to be C[0, 1] and since this is not a Hilbert space (as in the case of L2 [0, 1]) we have to see everything in a completely different way to continue with the theory.

Further work

29

Gaussian measures on L2 [0, 1] (An example of the above) Gaussian processes defined in Hilbert spaces (here in particular in L2 [0, 1]) are the natural generalization of normal random variables in Rp . Definition: A stochastic process X ∈ L2 (Ω × [0, 1]) is said to be Gaussian if any finite linear combination of the real variables Xt , t ∈ [0, 1], is a real Gaussian random variable. Some equivalent definitions: • X ∈ L2 (Ω×[0, 1]) with mean E(X) and covariance operator Σ is Gaussian iff Xh = hX(ω, .), hi R1 is a real random variable for each h ∈ L2 [0, 1] where hX(ω, .), hi = 0 X(ω, t)h(t)dt. The mean of Xh will be hE(X), hi and the variance hh, Σhi. • X ∈ L2 (Ω × [0, 1]) with mean E(X) and covariance operator Σ is Gaussian iff the Fourier transform of its associated meassure is 1

Fµ (h) = eihE(X),hi− 2 hΣh,hi . In the finite dimensional case we know that any linear transformation of a normal random variable is a random variable. Here we have a similar result: Proposition: If X ∈ L2 (Ω × [0, 1]) is Gaussian with mean µ and covariance operator H, and T : L2 [0, 1] → Rd is linear and bounded (with the usual norms) then H(X) is a normal distribution with mean H(µ) and covariance matrix given by the matrix generate by the operator T HT T where T T represent the adjoint operator of T T defined from Rd to L2 [0, 1]. An example of a Gaussian Process in L2 [0, 1] The standard Wiener process is an example of a Gaussian Process in L2 [0, 1]. (It can be defined in a more general Banach space too). The covariance operator will be R(s, t) = min(s, t). In this case, the covariance operator maps L2 [0, 1] into his range R(K) characterized by Z R(K) = {m : m(t) =

t

γ(s)ds γ ∈ L2 [0, 1]}.

0

The space R(K) plays a fundamental roll in Gaussian Processes and R(K) is itself a Hilbert space. If f ∈ R(K), then f = K(u) for some u ∈ L2 [0, 1] and therefore we can define ||f ||RKHS = ||u||2 . Since there can be more than one element u in L2 [0, 1] with such property, we need to define the Hilbert space as equivalent classes, or we can consider N (K) = {0} or restrict our attention to the orthogonal complement of N (K).

30

Chapter 3 In our case, R(K) is the Sobolev space: R(K) = {f ∈ L2 [0, 1] : f (1) ∈ L2 [0, 1] and f (0) = 0}.

One special characteristic of these particular Hilbert spaces is the reproductive property: there exists a function that generates all the other functions. In this example, if we consider R(s, t) = min(s, t) then R(., t) = R(t, .) ∈ R(K) such that for any f ∈ R(K), Z f (t) =

1

R(s, t)f (s)ds. 0

In particular R(s, t) =

R1 0

R(s, h)R(h, t)dh.

More about Gaussian processes In the finite dimensional context, covariance operators needs to be inverted in order to consider, for example, density functions. Sometimes it will be the case that we need to consider the inverse of such operator, but in the functional case this task is more complicated since the inverse of a compact operator is not in general a bounded, well defined operator. In Appendix C we define the generalized inverse of a compact operator and later if N (K) = {0} we extend the operator K to a bigger set H + ⊃ L2 [0, 1] in such a way that the inverse can be defined now for any f ∈ L2 [0, 1]. It should be clear than in that case K † (the general generalized inverse) is not a function anymore, but a distribution.

3.7.3

Some models for functional data

The classical techniques of linear regression, analysis of variance, and linear modeling, all investigate the way in which variability in observed data can be accounted for by other known or observed variables. They can all be placed within the framework of the general linear model y = Zβ + ² where y can be a vector or a matrix and Z can be observed covariates or independent variables. In the case of functional data, this can be given in the independent variable Z or/and in the dependent variable y. The best known case is the one where the response is a scalar and the covariates are functions. For this case the model presents the form y = α + hZ, βi + ²; here y, α and ² (the error) are numbers, and Z and β are functions on L2 [0, 1] and h, i represent the usual interior product in L2 [0, 1].

Further work

31

In general, given a vector of responses y = (y1 , . . . , yn )T and a vector of functions Z = (Z1 (t), . . . , Zn (t))T , the goal will be to find a number α b and a function βb ∈ L2 [0, 1] such that LM SS(α, β) =

n X b 2 (yi − α b − hZi , βi) i=1

is minimum. The results: • Order of convergence for βˆ to β. The best result was obtained by Hall and Horowitz (2007), and the order of convergence (in L2 [0, 1] norm and under specific conditions) is nγ with γ < 21 . This means that the estimator is not square root of n consistent in L2 norm. Moreover, they show some examples where these are the best possible results. ˆ x • Once we estimate β, the estimation of α follows from the equation α ˆ = y¯ − hβ, ¯i. Therefore, since the interior product smooths things, the convergence of α ˆ to α cannot be worse than ˆ the convergence of β to β. • Order of convergence of a new prediction. Again, since the process of taking an interior product smooths things, the order of convergence of a new prediction is better than the order of convergence of the parameters, as it was showed by Cai and Hall (2006). But to get square root of n consistency they need to predict a new y from a curve x that has more smoothness than the sample curves they consider to estimate β and α. Clearly, for the exposition above, there are a lot of unknowns in the theory of functional data analysis even for the most elemental case: the linear case with scalar response. In the context of dimension reduction with functional predictors and scalar response, Sliced inverse regression (SIR) was introduced by Dauxois, Ferr´e and Yao, A. F. (2001) and later developed by Ferr´e and Yao (2003, 2005). There are several recent papers about Principal Component, SIR and other extensions. (Amato, Antoniadis and De Feis, 2006; Ren, 2005 among others) Other linear models have the response being functional with multivariate covariates, or both the response and the covariates as functions. Those cases are less known and a lot of their theories are to be done yet.

3.7.4

The proposal (new stuff )

I will consider here the functional version of the model consider by Cook (2007). In this model one has a real response variable Y to be predicted by curve X ∈ L2 [0, 1]. The idea of dimension reduction without loosing information will be to find a function T : L2 [0, 1] → Rd such that the distribution of Y given Y |X ∼ Y |T (X), where Y |X means the conditional distribution of Y given the curve X.

32

Chapter 3

Assumptions R1 1. X ∈ L2 (Ω × [0, 1]) = {X(t) : Ω → R, t ∈ [0, 1], E 0 X 2 (t)dt < ∞} where (Ω, F, P ) denotes R1 R R1 a complete probability space and E 0 X 2 (t)dt = Ω 0 X 2 (t)dtdP .

2. The pair (Y, X) is a random variable that takes values in R × L2 [0, 1]

3. Y is a random variable that chooses one of h populations with probability fy = P (Y = y), y = 1, . . . , h.

4. Given the population y, the stochastic process Xy ∼ X|(Y = y) for each value of Y = y has a Hilbert-Smith covariance operator K with kernel equal to {0}. For the mean of Xy we assume that for each t ∈ [0, 1], E(Xy )(t) = Γ(vy )(t).

(3.10)

• Assumptions for Γ:

Γ : Rd → L2 [0, 1] Γ(u1 , . . . , ud )(t) =

(3.11) d X

ui G(t, i),

i=1

for some functions G : [0, 1] × {1, . . . , d} → R with G(., i) ∈ L2 [0, 1], for i = 1, . . . , d. We assume that the operator Γ has an adjoint ΓT , where we consider L2 [0, 1] and Rd with their usual inner products. ΓT : L2 [0, 1] → Rd , is defined by Z T

Γ (v)i =

1

G(s, i)v(s)ds. 0

Proof: Let us check that ΓT is the adjoint of Γ. For any u ∈ Rd and for all v ∈ L2 [0, 1]

Further work

33

we have T

(u, Γ v) =

d X

Z ui

i=1

Z =

d 1X

0

Z =

1

G(s, i)v(s)ds 0

ui G(s, i)v(s)ds

i=1 1

Γu(s)v(s)ds 0

= hΓ(u), vi. We need to ask an additional condition on Γ that is trivial in the finite dimensional case and that always appears in the functional data case. The condition is that R(Γ) ⊂ R(K) where R(H) indicates the range of the operator H. (See Appendix C for details). This implies that for each i = 1, . . . , d we have G(., i) ∈ R(K). P • Assumptions for vy : vy ∈ Rd for each y = y1 , . . . , yh and hy=1 vy = 0. We ask that the variance matrix of vy be positive definite, which means that Var(vY ) > 0, but is otherwise unconstrained. This model specifies that the conditional means of X|y fall in the d-dimensional subspace of spanned by R(Γ). The vector vy contains the coordinates of the translated conditional mean E(X|Y = y) relative to a basis of R(Γ). The mean function is quite flexible, even when d is small. Aside from the subscript y, nothing on the right side of this model is observable. L2 [0, 1]

The condition in the covariance operator is restricted only in the sense that the covariance operator of the curves X|y has to be independent of y. An example During the growth of grain crops it is important to determine which parts of the land need nutrients. The process of getting that information can be very expensive and slow since the study of samples of the soil in different parts of the land is needed. Some times the nutrients are added without study, with the consequent problems of money and overuse. With the advance of satellite images it is possible to get pictures of the grain crops in different steps of the process. After processing the pictures it is possible to get for a fine grid of points, a curve that represents the spectra of the colors. The colors of the soil are connected with the amount of nutrients that the soil has. The goal is to get rid of the lab work and to be able to predict from the pictures, that makes possible to do the curves with the spectra of the colors, where the soil needs some kind of nutrients. A model that can be good for these kind of data is to consider that the curves X are the spectral of the colors and Y represent percentage of some nutrient (for example nitrogen). If the curve given the percentage of nutrient where Y can be thought categorical has covariance operator independent of the percentage of the nutrient and the mean lives in a space of finite dimension, the model can be

34

Chapter 3

written as there exists an operator Γ : Rd → L2 [0, 1], such that E(Xy )(t) = Γ(vy )(t), and the stochastic curves Xy have a Hilbert-Smith covariance operator K with kernel equal to {0}. From the theory of graphical devices, may be is possible to model νy . We will prove below that for the regression of Y |X, i.e., to predict nitrogen from the colors we only need ΓT ∆−1 X, so that instead of the whole curve we will only need d vectors.

Some results

Proposition 1: Under assumptions (1), (2), (3), and (4) we have Y |X ∼ Y |ΓT K † X ∼ Y |AΓT K † X

(3.12)

for all A invertible operator (independent of Y ) from Rd to Rd (i.e. an invertible matrix). Here we interpret K † as the general generalized inverse of K (See Appendix C for details). Note: For X ∈ L2 [0, 1] we have K † (X) ∈ L(R(K)). Now, by the assumptions over Γ, we can extend the operator ΓT from L2 [0, 1] to L(R(K)) in the following way: since span(Γ) ⊂ R(K) we have that for each i = 1, . . . , d, G(t, i) = KH(t, i) for H(., i) ∈ L2 [0, 1]. Now, since L2 [0, 1] is dense in L(R(K) we have that for each g ∈ L(R(K)) there exists a sequence gn ∈ L2 [0, 1] such that ||gn ||− = ||K(gn )|| → ||g||− , as n → ∞. Therefore we define for g ∈ L(R(K)) and for each i = 1, . . . , d, Z T

Γ (g)i = lim

n→∞ 0

1

gn G(t, i)dt

The definition is correct since Z | 0

Z

1

gn G(t, i)dt| = |

0

1

Kgn H(t, i)dt|

≤ ||K(gn )||||H(., i)|| ≤ ||K(gn )||||H(., i)||.

Further work

35

The new operator is a linear, bounded operator from L(R(K)) to Rd . The operator ΓT K † : L2 [0, 1] → Rd is then linear and bounded from L2 [0, 1] into Rd . In fact, for X ∈ L2 [0, 1], |(ΓT K † X)i | ≤ ||K † (X)||− = ||X||. Therefore, since Xy ∈ L2 [0, 1] is a Gaussian Process with mean Γvy and covariance operator K, we have that ΓT K † Xy is a Gaussian in Rd with mean ΓT K † Γvy and covariance matrix given by the kernel of the operator ΓT K † Γ, defined from Rd to Rd . Now, the transformation ΓT K † Γ is an invertible operator and we can then define the operator Γ(ΓT K † Γ)−1 from Rd to Rd . From the properties of Γ this operator is one to one and onto; this means that it is an invertible operator that can be identify with a d × d matrix. Now we consider in R(K) the norm defined as follows (see Appendix C): for f ∈ L(R(K)), there exists an unique u ∈ L2 [0, 1] such that Ku = f ; then ||f ||− = ||u||2 . The operator defined from L2 [0, 1] onto span ΓT K † as T = Γ(ΓT K † Γ)−1 ΓT K † is the orthogonal projection onto span ΓT K † with this norm ||.||− . Proof of Proposition 1: Let us decompose X onto two random variables ξ and λ where ξ is the orthogonal projection onto the span of ΓT K † using the norm − and λ on the orthogonal complement of span of ΓT K † . Then X|y = ξ|y + λ|y. Now we prove that the random variable λ|y does not depend on y. In fact, T Γvy = Γvy and therefore λ|y is a Gaussian random variable with mean 0 and covariance operator independent of y. Therefore λ|y does not depend on y, which means that λ is independent of y. Now, y|X = y|ξ + λ = y|ξ, since λ is independent of y. Since ξ = BΓT K † X with B = Γ(ΓT K † Γ)−1 is an invertible matrix independent of y, therefore y|ξ ∼ y|ΓT K † X ∼ y|AΓT K † X, for any A invertible d × d matrix. We propose to study the estimators for the parameters under this setting. There are two points to take into account. One is that the maximum likelihood estimators usually do not exist in this context. In general they have to be replaced by sieve estimators, Beder (1988). In our case, since we assume that Γ has finite rank, we will be able to estimate it using maximum likelihood estimators;

36

Chapter 3

but for µ and ∆ we should use sieve estimators. The other point is that in general from what we said above in the case of functional data analysis, root-n consistency of the estimators is not expected, since estimation involves the inverse of the covariance operator that has infinite rank. For functional sliced inverse regression (SIR; Li, 1991), Ferr´e and Yao (2005) stated that estimation can be done in such a way that the estimator is root-n consistent. Forzani and Cook (2006) proved that their results contains an implicit assumption that the covariance operator Σ of the covariable X has a structured form. More specifically, under that assumption, Σ includes eigenfunctions of the covariance operator Γe of E(X|Y ) or of the orthogonal complement of Γe . The sufficient subspace can then be estimated as the span of the eigenfunctions of Γe , and therefore root-n consistency follows from the root-n consistency of Principal Component Analysis for functional data (Dauxois, Pousse, and Romain, 1982). This says that, in this context, we do not expect root-n consistency for the estimations, and the asymptotic results are much more complex than in the finite dimensional case, even for the Gaussian case.

Appendix A

Proofs of the results from Chapter 2 To be able to prove Theorem 2.2.1 we need a couple of lemmas. q1 q2 + + Lemma A.0.1 (Eaton, 1972, proposition 5.41) Consider the spaces S+ p and Sq1 × Sq2 × R where q1 + q2 = p. Consider the transformation

Ã S=

S11 S12 S21 S22

!

Ã =

V11 (V12 V11 )T

V12 V11 T V VT V22 + V12 11 12

!

−1 + q1 q2 . Thus S where V11 ∈ S+ 11 = V11 , V12 = S11 S12 and V22 = q1 , V22 ∈ Sq2 and V12 ∈ R + + q1 q2 → S+ is 1-1 and onto. S22 − S21 S−1 p 11 S12 so that the map g : Sq1 × Sq2 × R

A modification of the following lemma can be found on Anderson (1983) Theorem A.4.7. Lemma A.0.2 Let A a diagonal positive definite matrix with eigenvalues 0 < a1 ≤ · · · ≤ ar and B a symmetric positive definite matrix with eigenvalues b1 ≥ · · · ≥ br > 0 and let H be a r × r orthogonal matrix. Then min trace(HBHT A) = H

r X

ai bi .

i=1

From the proof of that lemma it can be shown that if ai and bi are all different the minimum will occur only when H = Ir . Lemma A.0.3 Let S, J ∈ S+,0 and we consider the partitions p Ã S=

S11 S12 S21 S22

!

Ã , J=

J11 J12 J21 J22

! ,

+ where S11 , J11 ∈ S+ q1 with q1 ≤ p. Suppose J11 ∈ Sq1 , J12 = 0 and J22 = 0, then the eigenvalues of S11 J11 are the same than the nonzero eigenvalues of SJ.

37

38

Appendix A

Proof of Lemma A.0.3: Let v be an eigenvector with eigenvalue λ 6= 0 of SJ. Let V1 be the first q1 rows of v. Then S11 J11 V1 = λV1 and S21 J11 V1 = λV2 . Thus, V1 is eigenvector of S11 J11 with eigenvalue λ.

Proof of Theorem 2.2.1: From what follows f is used as a generic function whose definition changes and is given in context. We will make a series of changes of variables to rewrite the problem, the first one being S = D−1 . This transforms (2.4) into b res ) − f (S) = log |S| − trace(SΣ

p X

b fit ). λi (SΣ

(A.1)

i=d+1 1/2

1/2

b res SΣ b res so that the problem becomes that of maximizing Now let U = Σ f (U) = log |U| − trace(U) −

p X

b fit Σ b −1/2 ). b −1/2 Σ λi (UΣ res res

(A.2)

i=d+1

b −1/2 b b −1/2 has rank τ = min(r, p) and is symmetric and semipositive definite, we can Since Σ res Σfit Σres use the singular value decomposition and write it as b −1/2 = VΛτ VT b fit Σ b −1/2 Σ Σ res res where V ∈ Rp×p is an orthogonal matrix and Λτ = diag(λ1 , . . . , λτ , 0, . . . , 0), with λ1 ≥ λ2 ≥ b res and Σ b fit are known, so are the matrices V and Λτ . Using this notation, . . . λτ > 0. Since both Σ (A.2) becomes f (U) = log |U| − trace(U) −

τ X

λi (UVΛτ VT ).

(A.3)

i=d+1

Let us change variables once more, to H = VT UV ∈ S+ p . Problem (2.4) is then equivalent to maximizing f (H) = log |H| − trace(H) −

τ X

λi (HΛτ ).

(A.4)

i=d+1

We now partition H as Ã H=

H11 H12 HT12 H22

!

+ with H11 ∈ S+ τ , H22 ∈ Sp−τ (for p = τ we take H = H11 and go directly to (A.7)). Consider the + τ ×(p−τ ) given by V + transformation in Lemma A.0.1 from S+ 11 = H11 , p to the space Sτ × Sp−τ × R −1 −1 T V22 = H22 − H12 H11 H12 and V12 = H11 H12 . This transformation is one to one and onto. As a

Proofs of the results from Chapter 2

39

function of V11 , V22 and V12 , using Lemma A.0.3, (A.4) can be written as f (V11 , V12 , V22 ) = log |V11 ||V22 | − trace(V11 ) − trace(V22 ) τ X T ˜ τ ), −trace(V12 V11 V12 ) − λi (V11 Λ

(A.5)

i=d+1

˜ τ = diag(λ1 , . . . , λτ ). where Λ

In (A.5), the only term where V12 appears is T −trace(V12 V11 V12 ). T V V Since V11 is positive definite, V12 11 12 is semipositive definite. This, the maximum occurs when b b V12 = 0. This implies that H12 = 0, H11 = V11 and H22 = V22 ; therefore we can come back to the expression on H and our problem reduces to maximizing the function

f (H11 , H22 ) = log |H11 | + log |H22 | − trace(H11 ) − trace(H22 ) −

τ X

˜τ) λi (H11 Λ

(A.6)

i=d+1

This function is the sum of two parts, one involving only H11 , the other only H22 , that we can therefore maximize separately.

b 22 = Ip−τ . As for The maximum of log |H22 | − trace(H22 ) is reached at H f (H11 ) = log |H11 | − trace(H11 ) −

τ X

˜ τ ), λi (H11 Λ

(A.7)

i=d+1

˜ 1/2 ˜ 1/2 leads us to maximize calling Z = Λ τ H11 Λτ ˜ −1 ) − f (Z) = log |Z| − trace(ZΛ τ

τ X

λi (Z).

(A.8)

i=d+1

Since Z ∈ S+ τ , there exists an F = diag(f1 , . . . , fτ ) with fi > 0 in decreasing order and an orthogonal matrix W in Rτ ×τ such that Z = WT FW. As a function of W and F, (A.8) can be rewritten as ˜ −1 ) − f (F, W) = log |F| − trace(W FWΛ τ T

˜ −1 WT ) − = log |F| − trace(FWΛ τ

τ X i=d+1 τ X i=d+1

fi fi .

40

Appendix A

Now, using Lemma A.0.2, τ X

˜ −1 WT ) = min trace(FWΛ τ W

fi λ−1 i .

(A.9)

i=1

Knowing this, we can rewrite the problem one last time, as that of maximizing in (f1 , . . . , fτ ), all greater than zero, the function f (f1 , . . . , fτ ) =

τ X

log fi −

τ X i=1

i=1

τ X

fi λ−1 i −

fi .

(A.10)

i=d+1

Clearly the maximum will occur at fbi = λi for i = 1, . . . , d

fbi =

λi for i = d + 1, . . . , τ. λi + 1

Since λi are positive and decreasing order, fbi are positive and decreasing in order. Since all the λi c = Iτ and therefore are different, the fbi are different, and the minimum on (A.9) will occur when W b = F. b Collecting all the results, the value of D that maximizes (2.4) is Z b = S b −1 ∆ b 1/2 U b −1 Σ b 1/2 = Σ res res b 1/2 VH b −1 VT Σ b 1/2 = Σ res res Ã =

b 1/2 V Σ res

¯ 1/2 Z b −1 Λτ ˜ 1/2 Λτ O(p−τ )×τ

 

Id     0      ... b 1/2 V  = Σ  res  ...   

Oτ ×(p−τ ) Ip−τ ×(p−τ )

!

0

0 ... λd+1 + 1 0 ... .. . ... ... 0 0 λτ + 1

b 1/2 VT Σ res 



    

     T b 1/2  V Σres    

O(p−τ )×τ 

0d   0  b res + Σ b 1/2 V  = Σ  ... res   0  0

0 λd+1 ... 0 0

0 0 .. . λτ 0

... ... ... 0 0p−τ ×(p−τ )

Oτ ×(p−τ )

Ip−τ ×(p−τ )      T b 1/2  V Σres .   

Proofs of the results from Chapter 2

41

Proof of Corollary 2.2.2: The log likelihood function is proportional to the function f b given by (2.5). defined on (2.4) therefore the maximum of the log likelihood function is attend at ∆ Denoting again f the log likelihood function we get b f (∆) = −

np n b − n trace(∆ b −1 Σ b res ) log(2π) − log |∆| 2 2 2 τ n X b −1 Σ b fit ). − λi (∆ 2

(A.11)

i=d+1

Now, the trace and the eigenvalues are cyclic operations therefore b −1/2 ) b −1 Σ b res ) = trace(∆ b −1/2 Σ b res ∆ trace(∆ = trace(V(Ip + K)−1 VT ) τ X 1 = d+ + (p − τ ), 1 + λi

(A.12)

i=d+1

and τ X

b −1 Σ b fit ) = λi (∆

i=d+1

= = =

τ X i=d+1 τ X i=d+1 τ X i=d+1 τ X i=d+1

b −1/2 Σ b fit Σ b −1/2 ) λi (V(I + K)−1 VT Σ res res λi (V(I + K)−1 VT VKVT ) λi ((I + K)−1 K) λi . 1 + λi

(A.13)

b res is an invertible operator we have Since Σ b = log |Σ b 1/2 V(I + K)VT Σ b 1/2 | log |∆| res res b res ||V(I + K)VT | = log |Σ b res ||(I + K)| = log |Σ τ X b = log |Σres | + log(1 + λi ). i=d+1

(A.14)

42

Appendix A

Plugging (A.12), (A.13) and (A.14) on (A.11) we get τ np n n n X −1 b b b b b −1 Σ b fit ) f (∆) = − log(2π) − log |∆| − trace(∆ Σres ) − λi (∆ 2 2 2 2 i=d+1

τ X np n n b res | − n = − log(2π) − log |Σ log(1 + λi ) − (p − τ + d) 2 2 2 2 i=d+1

−

n 2

τ X i=d+1

n 1 − 1 + λi 2

τ X i=d+1

λi 1 + λi

τ np n n X n n b = − log(2π) − log |Σres | − log(1 + λi ) − (p − τ + d) − (τ − d) 2 2 2 2 2 i=d+1

= −

τ X np np n b res | − n − log(2π) − log |Σ log(1 + λi ). 2 2 2 2 i=d+1

To prove Corollary 2.2.3 we need a lemma. 1/2 where M = (I +K)−1 , with V and K as in Theorem 2.2.1. ˜ =Σ b −1/2 Lemma A.0.4 Let V res VM p 1/2 −1/2 b ˜ b b fit ∆ b −1/2 Then ∆ V are the normalized eigenvectors of ∆ Σ

Proof of Lemma A.0.4: b = Σ b res + Σ b 1/2 VKVT Σ b 1/2 ∆ res res b 1/2 V(Ip + K)VT Σ b 1/2 . = Σ res res Then, b −1 = Σ b −1/2 V(Ip + K)−1 VT Σ b −1/2 ∆ res res b −1/2 VMVT Σ b −1/2 . = Σ res res

b b −1/2 we get b −1/2 Using the fact that V are the eigenvectors of Σ res Σfit Σres b −1 Σ b fit V ˜ = Σ b −1/2 VMVT Σ b −1/2 Σ b fit Σ b −1/2 VM1/2 ∆ res res res b −1/2 VM∆τ M1/2 = Σ res ˜ 1/2 ∆τ M1/2 = VM ˜ = VM∆ τ

Proofs of the results from Chapter 2

43

λd+1 λτ ˜ where M∆τ = diag(λ1 , . . . , λd , λd+1 +1 , . . . , λτ +1 , 0, . . . , 0). Therefore V are the eigenvectors of b −1 Σ b fit , with eigenvalues λ1 , . . . , λd , λd+1 , . . . , λτ , 0, . . . , 0 and ∆ λd+1 +1

λτ +1

˜T∆ bV ˜ = M1/2 VT Σ b −1/2 ∆ bΣ b −1/2 VM1/2 V res res = M1/2 VT V(Ip + K)VT VM1/2 = M1/2 M−1 M1/2 = Ip . b fit ∆ b −1/2 ˜ are the normalized eigenvectors of ∆ b −1/2 Σ b 1/2 V Summarizing ∆ Proof of Corollary 2.2.3: As we said before the maximum likelihood estimator for a basis b fit ∆ b −1/2 . Now, from b −1/2 times the first d eigenvectors of ∆ b −1/2 Σ of span(∆−1 Γ) is the span of ∆ b −1/2 ∆ b 1/2 V ˜ =V ˜ is the maximum likelihood estimator Lemma A.0.4, span of the first d columns of ∆ −1/2 ˜ =Σ b res VM1/2 and M is diagonal full rank with the first d elements for span(∆−1 Γ). Since V ˜ is the same of the first d columns of Σ b −1/2 equal 1, the span of the first d columns of V res V where −1/2 b b −1/2 b V are the eigenvectors of Σres Σfit Σres . Proof of Corollary 2.2.4: A proof of this corollary can be found in Cook (2007) and if follows from the fact that the eigenvectors of Σ−1 Σfit and Σ−1 res Σfit are identically, with corresponding λi eigenvalues λi and 1−λi . The corollary follows now from the relation between the eigenvectors of 1

1

the product of the symmetric matrices AB and the eigenvectors of A 2 BA 2 . b =Σ b res + Σ b fit . Proof of Corollary 2.2.5: Follows from Corollary 2.2.3 and from the fact that Σ

Proof of Lemma 2.3.1: Y X2 |X1 if and only if Y X|X1 . Suppose it is true that Y X|X1 . From Lemma 1.4.2, ΓT ∆−1 X is the minimal sufficient reductive subspace for the regression of Y on X, thus ΓT ∆−1 X should not depend on X2 . Now, Ã T

−1

Γ ∆

X=

(ΓT1 ∆11 + ΓT2 ∆21 )X1 (ΓT1 ∆12 + ΓT2 ∆22 )X2

! (A.15)

will not depend on X2 if and only if ΓT1 ∆12 + ΓT2 ∆22 = 0 equivalently Γ2 = −∆−22 ∆21 Γ1 . The reciprocal follows directly if we replace Γ2 by −∆−22 ∆21 Γ1 on equation (A.15). To proof Lemma 2.3.2 we need the following result

Lemma A.0.5 Proposition 5.21 (Eaton, 1972) Let Ã ∆=

∆11 ∆12 ∆21 ∆22

!

44

Appendix A

be a p × p matrix where ∆ii is square for i = 1, 2. Let Ã −1

∆

=

∆11 ∆12 ∆21 ∆22

! .

Then, −1 ∆11 = (∆11 − ∆12 ∆−1 22 ∆21 ) −1 ∆12 = −(∆11 − ∆12 ∆22 ∆21 )−1 ∆12 ∆−1 22 −1 ∆22 = (∆22 − ∆21 ∆−1 11 ∆12 ) −1 −1 ∆21 = −(∆22 − ∆21 ∆−1 11 ∆12 ) ∆21 ∆11 .

Remark A.0.6 From Lemma A.0.5 condition (2.8) from Lemma 2.3.1 is equivalent to Γ2 = ∆21 ∆−1 11 Γ1 .

(A.16)

Proof of Lemma 2.3.2: The function we need to maximize on G and D−1 (see (2.2)) is b + trace(D−1 PG(D−1 ) Σ b fit ). f (G, D−1 ) = log |D−1 | − trace(D−1 Σ)

(A.17)

From the hypotheses on Γ, we have ΓT ∆−1 = (ΓT1 ∆11.2 , 0) where ∆11.2 = ∆11 − ∆12 ∆−22 ∆21 . Then ΓT ∆−1 Γ = ΓT1 ∆11.2 Γ1 . We called G1 the matrix form for the first p1 rows of G. For fixed D the function on G1 b 11,fit ), trace(D11.2 G1 (G1 T D11.2 G1 )−1 G1 T D11.2 Σ is maximized by choosing (D11.2 )1/2 G1 to be a basis for the span the first d eigenvectors of b 11,fit (D11.2 )1/2 , yielding another partially maximized log likelihood (D11.2 )1/2 Σ b + f (D−1 ) = log |D−1 | − trace(D−1 Σ)

d X

b 11,fit (D11.2 )1/2 ). λi ((D11.2 )1/2 Σ

(A.18)

i=1

Let us take the one to one, onto transformation from Lemma A.0.1 defined by L11 = ∆11 − ∆12 ∆−22 ∆21 , L22 = ∆22 and L12 = ∆12 ∆−22 . On L11 , L22 , L12 we get the function b 22 + L22 LT Σ b f (L11 , L22 , L12 ) = log |L11 | + log |L22 | − trace(L22 Σ 12 12 ) b 11 + L12 L22 Σ b 21 ) + −trace((L11 + L12 L22 LT12 )Σ

d X i=1

1/2 b 1/2 λi (L11 Σ 11,fit L11 )

Proofs of the results from Chapter 2

45

Now, differentiating with respect to L12 in the last expression, we get that ∂f ∂L12 ∂2f ∂L212

b 12 L22 − 2Σ b 11 L12 L22 = −2Σ b 11 ⊗ L22 = −2Σ

∂f b −1 Σ b = 0 for L12 the maximum occurs when L12 = −Σ 11 12 . Replacing this in ∂L12 the last log likelihood function we get to the problem of maximizing on L11 and L22 the function

Therefore solving

b 22 ) − trace(L11 Σ b 11 ) f (L11 , L22 ) = log |L11 | + log |L22 | − trace(L22 Σ b 12 Σ b −1 Σ b +trace(L22 Σ 11 12 ) +

d X

1/2 b 1/2 λi (L11 Σ 11,fit L11 ),

(A.19) (A.20)

i=1

b −1 Σ b since for L12 = −Σ 11 12 , b 12 ) − trace(L12 L22 LT Σ b b b −1 b −2trace(L22 LT12 Σ 12 11 ) = trace(L22 Σ12 Σ11 Σ12 ). b −1 so that we need to maximize on L11 the function The maximum on L22 is at L22 = Σ 22.1 b 11 ) + f (L11 ) = log |L11 | − trace(L11 Σ

d X

1/2 b 1/2 λi (L11 Σ 11,fit L11 ).

(A.21)

i=1

From Theorem 2.2.1 the maximum likelihood estimator for L11 will be such that T b 1/2 b 1/2 L−1 11 = Σ11,res V(Id + K)V Σ11,res ,

with K = diag(0, . . . , 0, λd+1 , . . . , λτ1 , 0, . . . , 0) and V and λ1 , . . . , λτ1 , 0, . . . , 0 the eigenvectors and b −1/2 b −1/2 Σ b eigenvalues respectively of Σ 11,res 11,fit Σ11,res and τ1 = min(r, p1 ).

Now, L11 = ∆11 − ∆12 ∆−22 ∆21 = ∆−1 11 and we get as the maximum likelihood estimator for ∆11

b 11 = Σ b 1/2 V(Id + K)VT Σ b 1/2 , ∆ 11,res 11,res

(A.22)

with K = diag(0, . . . , 0, λd+1 , . . . , λτ1 , 0, . . . , 0) and V and λ1 , . . . , λτ1 , 0, . . . , 0 the eigenvectors and b −1/2 Σ b b −1/2 eigenvalues respectively of Σ 11,res 11,fit Σ11,res .

46

Appendix A

b −1 and −Σ b −1 Σ b b −1 The maximum likelihood estimators for ∆22 and ∆12 will be Σ 22.1 11 12 Σ22.1 , respectively. For the ∆ scale we get for ∆12 b 12 = ∆ b 11 Σ b −1 Σ b ∆ 11 12 and for ∆22 , b 22 = Σ b 22.1 + Σ b 21 Σ b −1 ∆ b 11 Σ b −1 Σ b ∆ 11 11 12 . Now, the maximum likelihood estimator for a basis of ∆−1 Γ = (∆11.2 Γ1 , 0)T b −1/2 Γ1 , 0)T with Γ1 the first d eigenvectors of ∆ b −1/2 Σ b 11,fit ∆ b −1/2 . will be span (∆ 11 11 11 Using the same logic than in Corollary 2.2.3 it can be proved that the maximum likelihood estib −1/2 Γ1 , 0)T , with Γ1 the first d eigenvectors mator for a base of span(∆−1 Γ) will in this case be (Σ 11,res −1/2 b −1/2 b b of Σ Σ11,fit Σ . 11,res

11,res

Appendix B

Algorithm for the Diagonal case The function to be maximize on D, with D ∈ S+ p diagonal is given by b res ) − f (D) = − log |D| − trace(D−1 Σ

p X

b fit ) λi (D−1 Σ

(B.1)

i=d+1

b res and Σ b fit positive definite, symmetric and known, and λi (A) indicate the i−th eigenvalue where Σ of the matrix A. Now since D is an structured matrix, its derivative should be calculated using the rule of change given by · ¸ ∂f ∂f 0 ∂D = trace [ ] , ∂Dij ∂D ∂Dij ∂f = 0 and denoting Jii the matrix with all entries equal 0 except for the ii then for i 6= j, ∂D ij element equal to 1 we have

∂f ∂Dii

·

∂f 0 ∂D = trace [ ] ∂D ∂Dij ¸ · ∂f 0 ii ]J = trace [ ∂D ∂f = ∂Dii

¸

We denote S = D−1 . Taking derivatives we get p X ∂(f (S)) −1 b fit )Σ b fit b b fit )u0 (SΣ = S − diag(Σres ) − diagvi (SΣ i ∂S

(B.2)

i=d+1

where ui and vi indicate respectively the right and left eigenvectors corresponding to the eigenvalue 47

48

Appendix B

b fit normalized so that v uj = δij . Therefore we have λi of SΣ i 0

∂(f (S)) =0 ∂S if and only if −1

S

p X

b res ) + = diag(Σ

³ ´ b fit )u0 (SΣ b fit )Σ b fit diag vi (SΣ i

(B.3)

i=d+1

b fit D−1/2 corresponding b fit S1/2 = D−1/2 Σ Now, let us denote by u ¯i the i−th eigenvector of S1/2 Σ to the λi eigenvalue (in decreasing order) normalized with unit norm. We get the relations • ui = S1/2 u ¯i = D−1/2 u ¯i • vi = S−1/2 u ¯i = D1/2 u ¯i b fit = λi u • u0i Σ ¯0i D1/2 • vi0 uj = δij and we can write (B.3) as

b res ) + D = diag(Σ

p X

³ ´ λi diag D1/2 u ¯i u ¯0i D1/2

i=d+1

The algorithm will be:

b res ) • D0 = diag(Σ ¯ k = D−1/2 • D k b fit D ¯ kΣ ¯k • u ¯ki and λki , i = d + 1, . . . , r the i eigenvector and eigenvalue respectively of D • Dk+1 = diag(Σres ) +

Pp

1/2 k 0 k k u ¯i u ¯i Dk 1/2 ) i=d+1 λi diag(Dk

(B.4)

Appendix C

The generalized inverse Here, we provide the basic theory of best-approximate solutions of linear operator equations and of the (Moore-Penrose) generalized inverse of linear operators between Hilbert spaces. Definition: Let K ∈ L(X, Y ) where X and Y are Hilbert spaces. 1. x ∈ X is called least-squares solution of Kx = y if ||Kx − y|| = inf{||Kz − y||, x ∈ X} 2. x ∈ X is called best-approximate solution of Kx = y if x is a least-squares solution of Kx = y and ||x|| = inf{||z|| : z is least-squares solution of Kx = y}

(C.1)

holds. The best-approximate solution (which is unique) is defined as the least-squares solution of minimal norm. The notion of a best-approximate solution is closely related to the Moore-Penrose (generalized ) inverse of K, which will turn out to be the solution operator mapping y onto the best-approximate solution of Kx = y. The way to define the Moore-Penrose inverse is to restrict the domain and range of K in such a way that the resulting restricted operator is invertible; its inverse will then be extended to its maximal domain: Definition: The Moore-Penrose generalized inverse K † of K ∈ L(X, Y ) is defined as the unique ˜ −1 to linear extension of K D(K † ) := R(K) ⊕ R(K)⊥

(C.2) 49

50

Appendix C

with N (K † ) = R(K)† ,

(C.3)

where ˜ := K|N (K)⊥ : N (K)⊥ → R(K). K

(C.4)

˜ = {0} and R(K) ˜ = R(K), K ˜ −1 exists. Due to (C.3) and the K † is well defined since N (K) requirement that K † is linear, for any y ∈ D(K † ) with the unique representation y = y1 + y2 , ˜ −1 y1 . y2 ∈ R(K), y1 ∈ R(K)⊥ , T † y has to be K Properties or the generalized inverse: (a proof can be found in [? ]). Let P and Q be the orthogonal projectors onto N (K) and R(K), respectively. Then, R(K † ) = N (K)⊥ , and the MoorePenrose equations hold: KK † K = K

(C.5)

K † KK † = K †

(C.6)

K †K = I − P

(C.7)

KK † = Q|D(K † )

(C.8)

Connection between least-square estimator and generalized inverse: For y ∈ D(K † ), Kx = y has a unique best-approximate solution, which is given by x† := K † y. The set of all least-squares solutions is x† + N (K). If y ∈ / D(K † ), no least-squares solution of Kx = y exists. (Note the difference with the finite dimensional case). More properties of the generalized inverse: 1. The Moore-Penrose K † is bounded if and only if R(K) is closed. 2. If R(K) is closed then Y

⊥

= R(K) ⊕ R(K)

= R(K) ⊕ R(K)⊥ = D(K † ) Note that when R(K) is finite dimensional (operators with finite rank) the generalized inverse exists in the whole space Y and therefore the least square solution of Kx = y exists for every y ∈Y.

The generalized inverse

51

3. Given K : X → Y linear and compact operator. • If dimR(K) = ∞, K † is a densely defined unbounded linear operator. Therefore approximations to such a generalized inverse by bounded operators must necessarily converge only in the pointwise sense at best. • If dimR(K) < ∞, K † is defined in all Y and is a bounded linear operator. 4. Let (σn ; vn , un ) be a singular system for the compact linear operator K, y ∈ Y . Then we have †

• y ∈ D(K ) ⇔

∞ X |hy, un i|2

σn2

n=1

< ∞.

• For y ∈ D(K † ), K †y =

∞ X hy, un i n=1

C.0.5

σn

vn .

(C.9)

How the extend K † to the whole space

We assume in this section that N (K) = {0} and K is a Hilbert-Smith operator. And we define in R(K) a new norm generated by the interior product: for f, g ∈ R(K), (f, g)k = (K −1 f, K −1 g). It can be proved that R(K) with such a norm is a Banach space. Now we define L(R(K)) = { continuous, antilinear functionals on R(K)}. For each f ∈ L2 [0, 1] we can define the operator Tf : R(K) → R, Tf (u) = (u, f ), and this operator belong to L(R(K)), therefore we get a triple of spaces R(K) ⊂ L2 ⊂ L(R(K)), which is called a Hilbert-Schmidt Rigging. Proposition: L(R(K)) is obtained from L2 [0, 1] by taking the completion with respect to the norm defined by (f, g)= (Kf, Kg), f, g ∈ L2 [0, 1] From here we can extend K : L(R(K)) → L2 [0, 1] as an isometry. And therefore we can define K † : L2 [0, 1] → L(R(K)) as the general generalized inverse of K.

52

References Anderson, T. W. (1971). An Introduction to Multivariate Statistical Analysis. New York: Wiley. Amato, U., Antoniadis, A and De Feis, I. (2006). Dimension Reduction in Functional Regression with Applications. Computational Satistics and Data Analysis 50, 2422-2446. Bair, E., Hastie, T., Debashis,P . and Tibshirani, R. (2006). Prediction by supervised principal components. Journal of the American Statistical Association 101, 119-137 Beder, J. H. (1988). A sieve estimator for the covariance of a Gaussian process. The Annals of Statistics 16, 648-660. Berk, K. (1984). Validating Regression Procedures with New Data. Technometrics 26, 331-338 Bogachev, V. I. (1998). Gaussian Measures (Mathematical Surveys and Monographs). American Mathematical Society. Cai, T., Hall, P. (2006). Prediction in functional linear regression. The Annals of Statistics 34, to appear Chen, C. H., and Li, K. C. (1998). Can SIR be as popular as multiple linear regression?. Statistica Sinica 8, 289-316. Chiaromonte, F. Cook, R. D. and Li , B. (2002). Partial dimension reduction with categorical predictors. The Annals of Statistics 30, 475-497. Conway, J. B. (1990). A Course in Functional Analysis, 2nd ed. New York: Springer. Cook, D. (2007). Fisher Lecture: Dimension Reduction in Regression. With discussion. Statistical Science. To appear. Cook. D. (1998). Regression Graphics. Ideas for studying Regressions through Graphics. New York: Wiley. Cook, R. D. (1998). Principle Hessian directions revisited (with discussion). Journal of the American Statistical Association 93, 84-100. Cook, R. D and Forzani, L. (2007). Work in progress. 53

Cook, R. D., Li, B. and Chiaromonte, F. (2006). Reductive Multivariate Linear Models. Preprint. Dauxois, J., Ferre, L, Yao, A. (2001). Un mod´ele semi-param´atrique pour variables al´atoires. C. R. Acad Paris 333, 947-952. Dauxois, J., Pousse, A. and Romain, Y. (1982). Asymptotic theory for the principal component analysis of a vector random function: Some applications to statistical inference. Journal of Multivariate Analysis 12, 136-154. Eaton, M. (1972). Multivariate Statistical Analysis. Institute of mathematical Statistics. University of Copenhagen. Engl, H., Hanke, M. and Neubaier, A. (2001). Regularization of Inverse Problems. New York: Springer Fearn, T. (1983). A misuse of ridge regression in the calibration of a near infrared reflectance instrument. Journal of Applied Statistics 32, 73-79. Ferre, L., Yao, A. F. (2003). Functional sliced inverse regression analysis. Statistics 37, 475-488. Ferre, L., Yao, A. F. (2005). Smoothed Functional Inverse Regression. Statistica Sinica 15, 665-683. Forzani, L and Cook, R. D. (2007). A note on Smoothed Functional Inverse Regression. Statistica Sinica. To appear. Groetsch, C.W. (1977). Generalized inverses of Linear Operators. New York and Basel: Dekker Inc.. Gunst, R. F. and Mason, R. L. (1980). Regression Analysis and Its Application: A Data-Oriented Approach. Technometrics 23, 136-137. Hadi, A. and Ling, R. (1998). Some cautionary notes on the use of principal components regression. American Statistician 52, 15-19. Hall, P., Horowitz, J.L. (2007). Methodology and convergence rates for functional linear regression. The Annals of Statistics 35, to appear. Jolliffe, I. (1982). A note on the use of Principal Components in Regression. Applied Statistics 31, 300-303. Jolliffe, I. (2002). Principal Components Analysis. New York: Springer, 2nd edition. Li, K. C. (1991). Sliced inverse regression for dimension reduction with discussion. Journal of the American Statistical Association 86, 316-342. Mansfield, E.R., Webster, J. T. and Gunst R.F. (1977). An Analytical Variable Selection Technique for Principal Component Regression. Applied Statistics 26, 34-40. 54

Mosteller, F. and Tukey, J. (1977). Data analysis and regression: a second course in statistics. Reading, Mass: Addison-Wesley. Ni, L and Cook, R. D. (2006). Sufficient dimension reduction in regressions across heterogeneous subpopulations. Journal of the Royal Statistical Society. Series B, statistical methodology 68, 89-107 Ramsay, J.O., Dalzell, C.J. (1991). Some tools for functional data analysis. Journal of the Royal Statistical Society, Series B 53, 539-572 Ramsay, J. O. and Silverman, B. W. (2002). Applied Functional Data Analysis: Methods and Case Studies. New York: Springer. Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. New York: Springer, 2nd. edition. Ren, H. (2005). Functional Inverse Regression and Reproducing Kernel Hilbert Space. Ph.D thesis. Texas A and M University

55

Principal Components for Regression: a conditional ...

Jan 16, 2007 - internet connection) and these are almost always associated with techniques to reduce dimensions. ...... With the advance of satellite images.

Download PDF

349KB Sizes 1 Downloads 278 Views

Report

Principal Components for Regression: a conditional ...

Recommend Documents