Regularizing Priors for Linear Inverse Problems

Viewer
Transcript

Regularizing Priors for Linear Inverse Problems∗ Jean-Pierre Florens Toulouse School of Economics 21, all´ ee de Brienne 31000 Toulouse (France) e-mail: [email protected]

and Anna Simoni Toulouse School of Economics 21, all´ ee de Brienne 31000 Toulouse (France) e-mail: [email protected] Abstract: We consider models described by a functional equation in an Hilbert space of the type Yˆ = Kx + U . We study estimation of the functional parameter x when function Yˆ is observed. In several applications this problem is ill-posed because the operator K is compact so that its inverse is not continuous on the whole space of reference and the estimator of x is inconsistent. We specify a prior distribution on x that is an extension of the Zellner’s g-prior and we show that, under mild assumptions, this prior is able to correct the ill-posedness also in infinite dimensional problems. The posterior distribution is then consistent in the sampling sense. Moreover, the prior-to-posterior transformation can be interpreted as a Tikhonov regularization in the Hilbert scale induced by the prior covariance operator. Finally, we propose an empirical bayes procedure for adaptively choosing the regularization parameter. This parameter is an hyperparameter of the prior distribution. AMS 2000 subject classifications: Primary 62C10, 45Q05; secondary 60G15, 62G05. Keywords and phrases: Hilbert Scale, g-prior, posterior consistency, adaptive estimator.

1. Introduction We consider the problem of estimating the solution x of the noisy functional equation Yˆ = Kx + U,

x ∈ X , Yˆ ∈ Y

(1.1)

∗ We thank S´ ebastien Van Bellegem for interesting discussions and the participants to the workshops ”Semiparametric and Nonparametric Methods in Econometrics” - Banff 2009 and ”Bayesian Inference in Stochastic Processes” - Bressanone 2009 for helpful comments.

1

J-P. Florens and A. Simoni/Regularizing Priors

2

where X and Y are infinite dimensional separable Hilbert spaces over R, supposed to be Polish, with inner product < ·, · > and norm || · ||. We require that these spaces are Polish because this guarantees, even in infinite dimensional problems, the existence of a regular version of the posterior distribution, see (23). For instance, X and Y could be L2 spaces, see (11). The residual U is a stochastic measurement error and K : X → Y is a known, bounded, linear operator with infinite dimensional range. The operator K ∗ will denote the adjoint of K, i.e. K ∗ is such that < Kϕ, ψ >=< ϕ, K ∗ ψ >, ∀ ϕ ∈ X and ψ ∈ Y. Very often, in inverse problems of the form of (1.1) the operator K is HilbertSchmidt and then compact. Compact operators are particularly suitable when they are unknown and need to be estimated since they can be approximated by a sequence of finite dimensional operators. On the other side, compactness of operator K and the infinite dimension of the space Y make the inverse K −1 not always defined and not always continuous on the whole Y so that some regularization of this inverse is demanded. The model in (1.1) is classical in inverse problem literature and it is encountered in many applications. For instance, in the statistical field, applications are e.g. density estimation, regression estimation, Gaussian white noise model. In Signal and Image Processing, examples of applications are image deblurring or extrapolation of a band- or time-limited signal. Operator K may frequently be characterized by a spectrum that decreases to 0, even when it is not compact, then the inverse problem (1.1) is ill-posed. Classical techniques of regularization, that solve the ill-posedness, consist in Spectral cut-off regularization, Tikhonov regularization, or Landweber-Fridman regularization, among other, see (15). In reverse, in this paper we focus on Bayesian methodologies for solving inverse problems, which propose the posterior distribution of x as solution for (1.1), overcoming in this way the problem of ill-posedness. However, since the dimension of the problem is infinite, the posterior distribution is in general non well-defined, in the sense that it is not consistent in the frequentist sense, that is with respect to the sampling distribution, see (3) for frequentist inconsistency in nonparametric Bayes estimation. To solve this problem of inconsistency, we have proposed in (6) to regularize the posterior distribution and we have defined a new object called Regularized Posterior distribution that plays the role of the posterior distribution. In (17) and (21) it is proposed to regularize through a restriction of the space of definition of Yˆ . Therefore, in infinite dimensional problems, the prior distribution has not the regularization power that it usually has in finite dimensional problems, see for instance the example of the Ridge Regression and its Bayesian interpretation, (13). The main result of this paper is to provide the assumptions that a signal-noise model and the prior distribution must satisfy in order the regularization can be automatically performed by the prior-to-posterior transformation and to prove that the posterior distribution obtained is consistent in the frequentist sense. Therefore, no ad-hoc regularization needs to be introduced. The prior distribution that we specify depends on the regularization parameter and on the degree of penalization chosen for measuring the variability of the solution (as, for instance, the highest order of derivatives in a Sobolev penalization).

J-P. Florens and A. Simoni/Regularizing Priors

3

In the next subsections, the prior and sampling distributions associated to model (1.1) are specified to be gaussian. Gaussian processes for functional estimation and frequentist properties of the posterior distribution have been extensively studied in (25), (26), (28). In Section 2 we construct the posterior distribution of x, we interpret it as a classical regularization scheme and we compute its rate of contraction. In Section 3, we analyze two cases particularly useful in real applications. We provide an adaptive method for optimally choosing the hyperparameter of the prior covariance operator in Section 4. All the proofs are in Section 5. 1.1. Sampling distribution and examples Let the measurement error U in model (1.1) induce a gaussian process (GP in the following) on Y. This implies that the sampling distribution of Yˆ is gaussian: Yˆ |x ∼ GP(Kx, δΣ)

(1.2)

with δ = δ(n) a function of the sample size n such that δ → 0 as n → ∞. We principally have in mind statistical and econometrics applications of model (1.1) where Yˆ is a function obtained as a transformation of an n-sample of finite dimensional objects (like the empirical cumulative distribution function, the empirical characteristic function, the Nadaraya-Watson estimator of the regression function) or models where Yˆ is the mean of functional data (like many example in signal and image processing). For general inverse problems, we may interpret δ as an increasing function of the measurement error in Yˆ . The covariance operator Σ : Y → Y is assumed to be a fixed and given operator. It follows that it is linear, bounded, nonnegative, self-adjoint, compact and trace-class. We stress the fact that the trace-class property excludes a covariance operator proportional to the identity operator. This is a very important fact since it implies that in general, the posterior mean cannot be interpreted as the outcome of a Tikhonov regularization (i.e. Ridge regression) like in the finite-dimensional case. Example 1. Inverse problems in image science. Let suppose that we observe n curves Y˜i independently generated from the model Y˜i = Kx + Ui ,

Ui ∼ GP(0, Σ).

The empirical mean on x. We P is then a sufficient statistics for doing inferenceP ˜i and compute Yˆ = n1 i Y˜i and we rewrite Yˆ = Kx + U , with U = n1 i U 1 V ar(U ) = n Σ. This example is often encountered in image science.The covariance operator is usually known in image science, but there exist cases (like the following example) where Σ is unknown. In this situation we can either estimate Σ in a frequentist way and use the asymptotic theory as developed in (7) or specify an Inverse-Wishart prior distribution on the covariance operator and develop a fully Bayesian estimation procedure (this point is in our research

J-P. Florens and A. Simoni/Regularizing Priors

4

agenda). Example 2. Density estimation. Let X = Y = L2 ([0, 1]) be the spaces of square integrable functions on [0, 1], integrable with respect to the uniform distribution. We suppose to observe an i.i.d. sample (ξ1 , . . . , ξn ) drawn from a distribution F admitting a density f (ξ). Function f (ξ) ∈ X is characterized as the solution of the inverse problem ¯ = Fˆn (ξ)

Z

0

1

¯ f (u)1{u ≤ ξ}du + U,

¯ = 1 Pn 1{ξi ≤ ξ}. ¯ with Fˆn (ξ) i=1 n The sampling distribution is inferred from asymptotic properties of the empirical distribution function, so that it is asymptotically a Gaussian measure with R mean F and covariance operator Σn = n1 R¯ F (tj ∧ tl ) − F (tj )F (tl )dtj . Even if the error term is only asymptotically gaussian, the estimation method that we propose in this paper stays valid since the gaussianity is only used to construct the estimator and not to prove our result of consistency of the Bayes estimator. Example 3. Deconvolution. Let (X, Y, Z) be a random vector in R3 such that Y = X + Z, X be independent of Z and ϕ(·), f (·), g(·) be the marginal density functions of X, Y and Z respectively. The density f (y) is defined to be the convolution of ϕ and g Z f (y) = ϕ ∗ g := ϕ(x)g(y − x)dx. The density g(·) is usually supposed to be known, x is not observable, f (·) is estimated nonparametrically and the object we want to recover is the density ϕ(·). The corresponding statistical model is fˆ(y) = Kϕ(y) + U, R where K = g(y − x)dx is a non-compact operator for general Hilbert spaces X and Y. However, as it is shown in (1), these spaces of reference can be constructed in such a way that K becomes compact. The sampling distribution must be inferred from asymptotic properties of U . The nonparametric estimation of f (y) prevents U to weakly converge towards a Gaussian process. To solve this problem it is necessary to transform the model through an operator able Rto smooth the trajectories. For instance, it could be an integral operator A = a(y, t)dy, between Hilbert spaces. 1.2. Prior measure and main assumptions In the following, we introduce three assumptions. Some of these assumptions, like Assumptions 1 (a) and 3 (a) below, are particular and not very common in inverse problems literature. However, they are necessary in order the prior

J-P. Florens and A. Simoni/Regularizing Priors

5

distribution is able to regularize and the resulting posterior distribution is consistent. The positive results of this paper hold under these assumptions, when these assumptions are not verified, then we are in the general case of posterior inconsistency for which a regularized posterior distribution has been proposed in (6). Let R(·) denote the range of an operator and D(·) its domain. We make the following assumption: Assumption 1. 1

(a) R(K) ⊂ D(Σ− 2 ); (b) there exists an unbounded densely defined operator L that is self-adjoint 1 and positive such that ||L−a x|| ∼ ||Σ− 2 Kx|| and L−1 is trace-class. 1

Part (a) of Assumption 1 ensures that operator Σ− 2 K, used in part (b) of the assumption, is well-defined and it is equivalent to say that we are demanding a compatibility between the sampling covariance operator Σ and the operator K in the sampling mechanism. This is very common in practical examples where the covariance operator is of the form Σ = (KK ∗ )r , for some r ≤ 1. We develop this particular case in Section 3. For all s ∈ R, operator L in Assumption 1 (b) induces the Hilbert T scale (Xs )s∈R , where Xs is an Hilbert space defined as the completion of s∈R D(Ls ) with respect to the norm ||x||s := ||Ls x||, see (14), (4), (22). Parameter a is the degree of ill-posedness in the statistical experiment. It is usually different than the degree of ill-posedness in the classical problem Yˆ = Kx since it is determined 1 by the rate of decreasing of the spectrum of operator Σ− 2 K and not by that one of K. Therefore, in the Bayesian framework there is less ill-posedness than in the classical framework for inverse problems. We assume that the functional parameter of interest x is characterized by the following gaussian distribution: 1 x|g, s ∼ GP x0 , L−2s , g

(1.3)

with x0 ∈ X and g = g(n) a function of n such that g → ∞ with n. The remark we pointed out for δ is also valid for g, so that g can be meant as a decreasing function of the measurement error in Yˆ . The two conditioning parameters g and s are for the moment treated as fixed. In Section 4 we partially relax this assumption and treat g as an hyperparameter. The operator L−2s plays the role of the prior covariance operator, then, following notation in (6), Ω0 = L−2s , where Ω0 : X → X is a covariance operator that is linear, bounded, nonnegative, self-adjoint, compact and trace-class. This choice of the prior covariance is aimed to link the prior distribution with the operator K and the sampling model. Such a link is evident from assumption 1 (b) and it is a natural idea in linear regression models, see for instance Zellner’s g-prior (29). Our prior (1.3) is an extension of the Zellner’g-prior so that we call it extended g-prior.

J-P. Florens and A. Simoni/Regularizing Priors

6

The marginal distribution on the sample, obtained by integrating out x with respect to the prior distribution, is 1 Yˆ |g, s ∼ GP(Kx0 , (δΣ + KΩ0 K ∗ )). g From a frequentist point of view, there exists a true value of the parameter of interest x having generated the data Yˆ . We denote this value with x∗ and it will be used in the asymptotic analysis since we care for the weak convergence of the posterior distribution of x towards a point mass in x∗ as n → ∞. It is a convergence with respect to the sampling probability and it is known as posterior consistency, see (9). Ω0 0 Let {λΩ j , ϕj } be the eigensystem associated to Ω0 . We introduce a regularity assumption about the deviation of the true value x∗ from x0 . Assumption 2. For some β ≥ s, we assume that (x∗ − x0 ) ∈ Xβ , i.e. there β

exists a ρ∗ ∈ X such that (x∗ − x0 ) = L−β ρ∗ (≡ Ω02s ρ∗ ). β

1

Because β ≥ s, it follows that R(Ω02s ) ⊂ R(Ω02 ) and Assumption 2 implies that β−s

1

there exists a ξ∗ such that (x∗ − x0 ) = Ω02 ξ∗ and ξ∗ = Ω0 2s ρ∗ . Moreover, by 1

Proposition 3.6 in (2), we can write R(Ω02 ) = H(Ω0 ), where H(Ω0 ) denotes the Reproducing Kernel Hilbert Space (R.K.H.S. in the following) associated to Ω0 and embedded in X , i.e. n H(Ω0 ) = ϕ : ϕ ∈ X

and ||ϕ||Ω0 :=

∞ 2 0 X | < ϕ, ϕΩ j >| j=1

0 λΩ j

o <∞ .

Hence, Assumption 2 implies that (x∗ − x0 ) ∈ H(Ω0 ). The R.K.H.S. is a subset of X that gives the geometry of the distribution of x. The support of a centered Gaussian process, taking its values in an Hilbert space X , is the closure in X of the R.K.H.S. associated with the covariance operator of this process (denoted with H(Ω0 ) in our case). Then, for our prior distribution, (x− x0 ) ∈ H(Ω0 ) with probability 1, but with probability 1, (x − x0 ) is not in H(Ω0 ), see (26). More properties about the R.K.H.S. associated to a gaussian measure can be found in (27). The parameter β refers to the regularity of the true function x∗ that we want to estimate. For instance, if the Hilbert scale Xβ is given by the Sobolev spaces, then Assumption 2 is equivalent to assume that x∗ − x0 has at least β square integrable derivatives, see (12). 1 1 ˜ = Σ− 12 K. Operator Hereafter, we use the notation: α = δg, B = Σ− 2 KΩ02 , B ˜ is well defined under Assumption 1 (a). A further assumption necessitates to B be introduced in order operator B be well-defined. Assumption 3. 1

(a) R(KΩ02 ) ⊂ D(Σ−1 );

J-P. Florens and A. Simoni/Regularizing Priors

7

(b) a, b and s are three real parameters satisfying the inequalities 0 < a ≤ s ≤ β ≤ 2s + a; (c) there exists γ ∈]0, 1] such that the operatorP (B ∗ B)γ is trace class, i.e. if {λ2j } denotes the eigenvalues of B ∗ B, then j λ2γ j < ∞ must be verified. 1

1

Under Assumption 3 (a), R(KΩ02 ) ⊂ D(Σ−1 ) and, since D(Σ−1 ) ⊂ D(Σ− 2 ), operator B is well-defined. Such assumption concerns the degree of regularity (i.e. the differentiability) of the prior covariance operator with respect to the sampling covariance operator. The restriction s ≤ β in Assumption 3 (b) means that x∗ − x0 has to be at least an element of Xs and it guarantees that the norm ||Ls x|| exists ∀x ∈ Xβ . The upper bound (2s + a) of β is the qualification of the regularization scheme: it says that we can at most exploit a regularity of x∗ equal to (2s + a). The last assumption will be exploited for computing the speed of convergence of the posterior distribution. When γ = 1, Assumption 3 (c) is the classical 1

1

Hilbert-Schmidt assumption of operator Σ− 2 KΩ02 . For γ < 1 this assumption is more demanding. The parameter α will be used as the index for the family of posterior distributions, it plays the role of a regularization parameter and it is linked to the error δ in the observations. It must satisfy the two classical properties required for a regularization parameter: α → 0 and, for some d > 1, αd n → ∞ as d−1 n → ∞. If δ ∝ n1 , this implies that ng ∼ op (1) and gn− d → ∞, i.e. g must d−1 increase faster than n d but slower than n. The more d is large, the more √ this condition is demanding. In particular, for d = 2, g must increase faster than n.

2. Main results The Bayesian solution of the inverse problem (1.1) is the posterior distribution of x, denoted with µY , see (5). As shown in (6) and (21), µY is a conditional probability on X that exists and is gaussian. It has mean function A(Yˆ −Kx0 )+ x0 and covariance operator g1 [Ω0 − AKΩ0 ], where A : Y → X is an operator such that its adjoint is defined as the solution of the functional equation: 1 1 δΣ + KΩ0 K ∗ A∗ ϕ = KΩ0 ϕ, g g

∀ϕ ∈ X .

(2.1)

This equation can be transformed through the following steps: (αΣ + KΩ0 K ∗ )A∗ ⇔

1 2

Σ (αI + Σ

− 12

⇔

∗

KΩ0 K Σ

− 12 ∗

=

KΩ0

1 2

∗

=

KΩ0

1 2

∗

=

BΩ02

Σ 2 A∗

=

(αI + BB ∗ )−1 BΩ02 .

)Σ A

(αI + BB )Σ A ⇔

1

1

1

J-P. Florens and A. Simoni/Regularizing Priors

8

By a little abuse of notation, due to the fact that the operator I in the lines above and that one in the lines below, even if they are written in the same way, are not the same operator because they act on two different spaces, we get ⇔

1

1

Σ 2 A∗

=

B(αI + B ∗ B)−1 Ω02

A∗

=

Σ− 2 B(αI + B ∗ B)−1 Ω02 .

⇔

1

1

1

Hence, A∗ is well-defined under Assumption 3 (a) since R(KΩ02 ) ⊂ D(Σ−1 ). We have therefore obtained the following writing for the operator A defining the posterior distribution: 1

1

A = Ω02 (αI + B ∗ B)−1 (Σ− 2 B)∗ .

(2.2)

A is continuous and defined everywhere. In general, it is not sure that the inverse of operator B ∗ B exists, since if it is compact its eigenvalues are countable and they accumulate only at zero, then (B ∗ B)−1 explodes. However, this possible problem is solved by the presence of operator αI that translates the eigenvalues sufficiently far from zero, or equivalently extends the range of B ∗ B to the whole space Y. In other words, when Assumption 3 holds, the prior-to-posterior transformation is equivalent to apply a Tikhonov regularization scheme to the inverse of B ∗ B, i.e. to regularize the solution of the equation Bϕ = r, with ϕ ∈ Y and r ∈ X . Remark 2.1. The construction of the posterior mean can be interpreted as a regularization in the Hilbert scale induced by Ls . Take for simplicity x0 = 0, then E(x|Yˆ , g, s) = = =

AYˆ 1 1 L−s (αI + L−s K ∗ Σ−1 KL−s )−1 L−s K ∗ Σ− 2 Σ− 2 Yˆ ˜ ∗ B) ˜ −1 B ˜ ∗ Σ− 12 Yˆ . (αL2s + B

This quantity results to be the regularization, in the prior variance Hilbert Scale (i.e. the Hilbert Scale induced by the prior covariance operator), of the classical solution of the model 1 Σ− 2 Yˆ

˜ + Σ− 12 U = Bx 1 = Bξ∗ + Σ− 2 U.

where the last equality is obtained by exploiting the assumption that x∗ ∈ H(Ω0 ) 1

and then x∗ = Ω02 ξ∗ . This model is the transformation of (1.1) through operator 1 1 1 Σ− 2 . We remark that there is no reason why the quantities Σ− 2 Yˆ and Σ− 2 U exist, so that this model is per se incorrect, but it is useful to interpret the

J-P. Florens and A. Simoni/Regularizing Priors

9

prior-to-posterior transformation as an Hilbert Scale regularization. Remark 2.2. In the specification of the prior distribution we may wish to stay as general as possible by choosing a prior variance of the form g1 Ω0 = 1 −2s ∗ Q , for some bounded operator Q not necessarily compact. Then, the g QL previous case is a particular case of this one for Q = I. Operator A takes the form 1

∗ A = QL−s (αI + BQ BQ )−1 (Σ− 2 BQ )∗ , 1

1

for BQ = Σ− 2 KQL−s . Hence, Ls is the Hilbert Scale for Σ− 2 KQ and Assump1 tion 1 (a) is replaced by the weaker assumption R(KQ) ⊂ D(Σ− 2 ). Moreover, −s −1 operator B is well-defined if R(KQL ) ⊂ D(Σ ) that is also less demanding than Assumption 3 (a). In order to obtain the same rate of convergence of the posterior distribution we have to replace Assumption 2 with the assumption that there exists an element ξ˜∗ ∈ R(L−(β−s) ) such that (x∗ − x0 ) = QL−s ξ˜∗ . 2.1. Asymptotic analysis The posterior distribution µY is a function of the sample size n, or equivalently of the measurement error in Yˆ , i.e. of the level of information we have from data. We stress this fact by adding the index n in the notation: µYn . Study of the asymptotic behavior of µYn allows to judge the goodness of the posterior distribution in order to construct estimators. We perform asymptotic analysis from a frequentist perspective, so that if posterior consistency is verified, the estimators constructed from µYn can be used from both bayesian and classical statisticians. We say that the posterior distribution is consistent at x∗ with respect to the sampling probability if it weakly converges towards the Dirac measure δx∗ concentrated in x∗ , i.e. if for every bounded and continuous functional a : X → X Z Z a(x)δx∗ (dx)|| → 0 || a(x)µYn (dx) − X

X

as n → ∞, see (3), (9). In this subsection we study posterior consistency and we recover an upper bound of the speed of convergence. We are able to avoid the posterior inconsistency that is stressed in (3) as a typical aspect of bayesian nonparametric estimation. For the gaussian signal-noise problem (1.1), study of posterior consistency reduces to study the consistency of the posterior mean and the convergence to zero of the posterior variance. Then, posterior consistency is proved by using a Chebishev’s inequality, see (6). We start by analyzing the posterior mean. The following theorem gives its asymptotic behavior.

J-P. Florens and A. Simoni/Regularizing Priors

10

Theorem 2.1. Let consider the probability specification in (1.2) and (1.3) and α = δg. Under Assumptions 1, 2 and 3 the posterior mean of x is consistent in the sense that ||E(x|Yˆ , g, s)− x∗ ||2 converges to zero with respect to the sampling probability. The associated MISE is of order β γ(a+s)+a Ex∗ ||E(x|Yˆ , g, s) − x∗ ||2 ∼ Op α a+s + δα− a+s where Ex∗ denotes the expectation taken with respect to the sampling distribua+s tion. Moreover, if α = c1 δ β+a+γ(a+s) , for some constant c1 , β

δ − β+a+γ(a+s) E∗ ||E(x|Yˆ ) − x∗ ||2 ∼ Op (1). A proof of Theorem 2.1 is given in section 5.1. a+s The value of α = c1 δ β+a+γ(a+s) , given in the last part of the theorem, is the optimal one in the sense that it minimizes the speed of convergence of the posterior mean and of its risk. We denote this value with α∗ . When g is treated as fixed and not as an hyperparameter, the optimal value g∗ for g is obtained through the relationship α = δg: g∗

∝ =

α∗ δ −1

β−s+γ(a+s)

c2 δ − β+a(1+γ)+sγ ,

with c2 some constant. The requirement that g must go to the infinite slower than n is satisfied if −a < s, that is always true under Assumption 3 (b). In d−1 addition, in order to have that g converges to +∞ faster than n d , we demand that β > ds + (d − 1)a − γ(a + s), that is compatible with Assumption 3 (b) for values of d such that 2s + a > (2s + a) − γ(a + s). In general, the rate of convergence of the classical solution of an inverse probβ lem is δ β+˜a , where a ˜ denotes the degree of ill-posedness in a classical inverse problem. The rate of convergence that we obtain for our Bayes estimator is not directly comparable to this one and then it is not slower, as it could seem. In fact, the degrees of ill-posedness a and a ˜ in the denominators are not the same. Take for instance a sampling covariance operator Σ equal to L−2m , for some m > 0. Henceforth, under the Hilbert Scale assumption (for which L defines 1 the Hilbert scale for K in the classical problem and for Σ− 2 K in the Bayes problem), a ˜ = a + m and then the ill-posedness is larger in the classical inverse problem than in the Bayesian one. The asymptotic behavior of the posterior variance is given in the following theorem. The rate given is the rate of the MISE of the posterior variance. Theorem 2.2. Let consider the probability specification in (1.2) and (1.3). Under Assumptions 1 and 3 the posterior variance of x converges to zero in X norm with respect to the sampling probability: ||V ar(x|Yˆ , g, s)φ|| → 0, ∀φ ∈ X . β

For every φ ∈ X such that φ ∈ R(Ω02s ), then

J-P. Florens and A. Simoni/Regularizing Priors

11

1 β+2s ||V ar(x|Yˆ , g, s)φ||2 ∼ Op 2 α a+s . g A proof of Theorem 2.2 is given in section 5.2. When the optimal α∗ and g∗ are used, the MISE of the posterior variance 3β+2γ(a+s) converges at the speed of δ β+a+γ(a+s) . The rate of ||V ar(x|Yˆ , g, s)φ|| is faster than the optimal rate at which the posterior mean converges towards the true x∗ in squared norm. A simple Chebishev’s inequality allows to conclude that convergence of the posterior distribution towards the point mass δx∗ is implied by convergence of the posterior mean towards x∗ and of the posterior variance to 0 in norm, see (6). Since the rate of contraction of ||V ar(x|Yˆ , g, s)φ|| is fast enough it does not affect the rate of contraction of the posterior distribution that is completely determined by the rate of the MISE associated to the posterior mean. Instead of demanding that function φ, at which the posterior covariance operator β

is applied, belongs to R(Ω02s ) we could impose a weaker condition. For instance, 1

β−s

we could require that φ ∈ X is such that Ω02 φ ∈ R(Ω0 2s ), then 1 β ||V ar(x|Yˆ , g, s)φ||2 ∼ Op 2 α a+s g

and, by replacing g and α with their optimal values, we would obtain that 3β−2s+2γ(a+s) ||V ar(x|Yˆ , g, s)φ||2 ∝ δ β+a+γ(a+s) . The price to pay for demanding a weaker condition is that the rate of convergence is slower than in the previous case and that, only if β + 2γ(a + s) ≥ 2s, the variance term does not affect the rate of contraction of the posterior distribution. This condition is trivially verified if 2s−β , 1]. β ≥ 2s; in general it is verified ∀γ ∈ [ 2(a+s) 3. Particular cases In Section 3.1 we consider geometric spectra for operators K, Σ, L and in Section 3.2 we consider models where Σ and L are proportional to operator K of the sampling mechanism. 3.1. Operators with geometric spectra We address the case where the spectra of the operators K, Σ, L in the Bayes experiment are geometric. For some constants a0 , c0 ∈ R+ we denote j −a0 the eigenvalues of K, with j −c0 the eigenvalues of Σ and with j the eigenvalues of L. The corresponding eigenfunctions ϕj are trigonometric functions, while the j-th Fourier coefficients of x is of order j −b0 , for some b0 ∈ R+ . Next, we translate Assumptions 1, 2 and 3 in order to get conditions on the parameters a0 , b0 and c0 . Assumption 1 (a) implies that a0 ≥ c20 that means 1 that operator K must be smoother than Σ 2 , or in other words, that the spectrum

J-P. Florens and A. Simoni/Regularizing Priors

12

1

of K must decrease faster than that one of Σ 2 . Assumption 1 (b) demands that 1 Σ− 2 K and L−a have the same norm and then the same eigenvalues. This is true when a = a0 − c20 . Moreover, in order L−1 be trace-class, we must have 2s > 1. The regularity assumption on the true value x∗ requires that for β ≥ s, x∗ ∈ R(L−β ). This is equivalent to say ||Lβ x∗ ||2 =

∞ X j=1

j 2β < x∗ , ϕj >2 < ∞

and since < x, ϕj >∼ j , the norm ||Lβ x∗ ||2 is finite only if 2b0 − 2β > 1. By abuse of notation we write β = b0 − 12 ; in reality this equality is up to an additive small term ǫ > 0. 1 Assumption 3 (a) implies that Σ−1 KΩ02 is well-defined, that translates on the condition a0 +s−c0 ≥ 0. This condition is implied if the previously found condition a0 ≥ c20 and the condition s ≥ c20 hold. Assumption 3 (b), and in particular the strict inequality a > 0, requires the strict inequality in a0 > c20 . Lastly, P Assumption 3 (c) demands that j j −2γ(a0 +s)+c0 γ is finite, that is satisfied if 1 1 γ > 2(a0 +s)−c , i.e. for some ε > 0 γ = 2(a0 +s)−c + ε. By summarizing, we have 0 0 −b0

a0 a β s γ

>

c0 2

(3.1)

c0 2 1 = b0 − 2 c0 ≥ 2 1 = . 2(a0 + s) − c0 = a0 −

(3.2) (3.3) (3.4) (3.5)

Therefore, the rate of convergence of the MISE of the posterior mean (and of the posterior distribution) obtained in Theorem 2.1 becomes n

2b0 −1 0 +2a0 −c0

− 2b

.

This rate make explicit the degree of ill-posedness a of the Bayesian problem in 1 terms of the decreasing rate of the spectra of K and Σ 2 . Then, the posterior rate is decreasing in the ill-posedness (a0 − c20 ). 3.2. Covariance operators proportional to K We consider the particular case where L is chosen to be the canonical Hilbert 1 scale L = (K ∗ K)− 2 , i.e. L is chosen in according to the sampling model, and where, for some r, s, σ 2 ∈ R+ δ=

σ2 , n

Σ = (KK ∗ )r ,

Ω0 = (K ∗ K)s .

J-P. Florens and A. Simoni/Regularizing Priors

13

In this case the operator K is required to be Hilbert-Schmidt in order to guarantee that the covariance operators are compact and trace-class. This example is motivated by a class of applications that can be generalized in the following way. Let M be an operator from X to a finite dimensional space Rk and the inverse problem defining x be y = M x + u,

V ar(u) = σ 2 Ik

with y, u ∈ Rk . We could think to it as a discretization of an infinite dimensional model or as a model in R, with x of infinite dimension, for which we have a vector of k discrete observations. Hence, it is usual to transform this model in a functional model by using the adjoint operator M ∗ : Rk → X . Then, Yˆ = Kx + U , with Yˆ = M ∗ y, K = M ∗ M and U = M ∗ u. The operator K is 1 self-adjoint and the sampling covariance is V ar(U ) = σ 2 K = σ 2 (KK) 2 and it is not invertible since it has only k eigenvalues different than 0. The philosophy of this example is close to that one of Support Vector Machine. This particular case may also be motivated by the example of Instrumental variable estimation in econometrics where M would be the conditional expectation operator and x the instrumental regression function. Then,

Yˆ |x x|g, s Yˆ |g, s

σ2 ∼ GP Kx, (KK ∗ )r n σ2 ∗ s ∼ GP x0 , (K K) g 1 1 ∼ GP(Kx0 , σ 2 (KK ∗ )r + K(K ∗ K)s K ∗ ). n g

(3.6)

The prior distribution is in the extended Zellner’s g-prior form, but when s = 1 we exactly have the Zellner’s g-prior. In this case, Assumption 1 (a) and (b) holds for r ≤ 1 and a = 1−r, respectively. Assumption 3 (a) holds for s ≥ 2r − 1. Hence, we replace Assumptions 1 and 3 by Assumption 4. (a) a, b and s are three real parameters satisfying the inequalities 0 < a ≤ s ≤ β ≤ 2s + a; (b) r ≤ 1 and s ≥ 2r − 1; a r (c) a = 1 − r so that ||(K ∗ K) 2 x|| = ||(KK ∗ )− 2 Kx||; (d) there exists a γ ∈]0, 1] such that the operatorP(B ∗ B)γ is trace class, i.e. if {λ2j } denotes the eigenvalues of B ∗ B, then j λ2γ j < ∞.

Assumption 2 remains valid. It should be noticed that, if (κj ) denote the eigen2(s+a) values of K, then λ2j = κj under Assumption 4 (c). The results of Theorems 2.1 and 2.2 trivially apply to this particular case.

J-P. Florens and A. Simoni/Regularizing Priors

14

Remark 3.2.1. An example of application of this particular case is the functional regression estimation as studied in (10). The operator K would then be the covariance operator of the functional covariates and r = 12 . Suppose that the spectrum of K has a geometric decline rate, as in subsection 3.1, and that it is the same as in (10), then j −a0 = σ 2 j −c0 and a0 = c0 . The rate of convergence of the MISE of the posterior mean is 2b0 −1

n− 2b0 +a0 which is the same as the minimax rate in (10). Therefore, when the eigenvalues have a geometrical rate of decay, our rate is optimal in the minimax sense. Furthermore, our rate is minimax not only in the framework of functional regression estimation, as found in (10), but for general inverse problems. 4. An adaptive selection of g through an empirical Bayes approach In the preceding sections we have treated the parameter g in the prior distribution as a fixed parameter which we have to optimally choose in order to get the good rate of contraction of the prior distribution. Now, we want to consider g as an hyperparameter and to express our degree of ignorance of the prior through a prior distribution ν on g. Then, we propose an adaptive method for selecting g when it is unknown, based on an Empirical Bayes approach. Adaptive methods for choosing regularization parameters in a classical framework have been widely discussed, see for instance (18) and (19). The distributional scheme is the following: g x|g Yˆ |x, g

∼ ∼ ∼

ν µg P x.

The indices g and x mean that the prior and the sampling distributions are conditioned on g and x, respectively. Hence, implicitly we are saying that, conditionally on x, Yˆ is independent on g, in symbols Yˆ k g |x. The specification of P x and µg remains as in (1.2) and (1.3), respectively, i.e. P x ∼ GP(Kx, δΣ) and µg ∼ GP(x0 , g1 L−2s ). We integrate the sampling distribution P x with respect to µg and we get the conditional marginal distribution P g , conditional on g. Therefore, we use the following model to recover a posterior estimator for g g ˆ Y |g

∼ ∼

ν P g,

with P g ∼ GP(Kx0 , δΣ + g1 KΩ0 K ∗ ). A result of Kuo (1975) (16) shows that it is possible to define a density for P g with respect to another measure different

J-P. Florens and A. Simoni/Regularizing Priors

15

than the Lebesgue measure. We restate this result by applying it to our case in the following Theorem. Theorem 4.1. Let P g be a gaussian measure on Y with mean Kx0 and covariance operator S2 = (δΣ + 1g KΩ0 K ∗ ) and P ∞ be another gaussian measure on the same space with mean Kx0 and covariance operator S1 = δΣ. If there exists 1 1 a positive definite, bounded, invertible operator T such that S2 = S12 T S12 and T − I is Hilbert-Schmidt, then P g is equivalent to P 0 . Moreover, the RadonNikodym derivative is given by λ2

∞ r j Y z2 dP g α 2(λ2 +α) j j , = e dP ∞ λ2j + α j=1

with

λ2j α

the eigenvalues of T −I, zj2 =

2 δl2j

associated to Σ.

(4.1)

and {lj2 , ϕj } the eigensystem

We refer to (16) for a proof of this Theorem. It is possible to notice that 1 δΣ + KΩ0 K ∗ g 1

=

i 1√ √ 1h 1 1 1 1 δΣ 2 I + √ Σ− 2 KΩ0 K ∗ Σ− 2 √ Σ 2 δ, g δ δ 1

1 Σ− 2 KΩ0 K ∗ Σ− 2 √1δ ]. All the properties of T in the Theorem so that T = [I + g√ δ are trivially satisfied. Assumption 3 (c) guarantees that P T −I is Hilbert Schmidt, P since it guarantees that j λ2j < ∞ that implies that j λ4j < ∞, where {λ2j } 1 1 are the eigenvalues of Σ− 2 KΩ0 K ∗ Σ− 2 . The density in (4.1) has been expressed as function of α instead of g. This is aimed to directly select the regularization parameter α = δg. We put a noninformative prior distribution on α (or equivalently on g) and we select the regularization parameter that maximizes the posterior distribution of α. Clearly, the posterior distribution of α is proportional to the density in (4.1) so that it is enough to maximize it with respect to α. The nice result that we get is that the value of α maximizing the posterior distribution, and that we denote with αMAP is of the same order as the optimal one defined in Theorem 2.1. 1

Lemma 1. Under Assumptions 1, 2, 3 and if Σ− 2 and KΩ0 K ∗ commute then a+β ∂ log(dP g /dP ∞ ) ∼ Op (α a+s + δα−γ ). ∂α a+s

The Maximum a Posteriori (MAP) estimator for α is of order αMAP ∝ δ a+β+γ(a+s) . A proof of this Lemma can be found in 5.3. This Lemma gives a practical rule for selecting a value for α, from the data, that has the optimal rate of convergence. The hypothesis of this Lemma could seem rarely satisfied. However, we stress that at least for all the applications entering in the examples given in section 3

J-P. Florens and A. Simoni/Regularizing Priors

16

it is trivially verified. Once we have fixed α equal to αMAP , gˆ = δ −1 αMAP and g equal to gˆ, several issues could be investigated. For instance, we plan to study the sampling properties of E(x|Yˆ , gˆ, s) and of the whole posterior distribution of x|Yˆ , s when g is fixed, the sampling and asymptotic properties of the posterior distribution of x|Yˆ , s when g is an hyperparameter and it has been integrated out and those of x|ˆ g , Yˆ , s. Moreover, oracle inequalities require to be studied. This can be performed relatively easily through simulations with the following steps: (i) to draw B independent samples and for each one obtain a value for αMAP , (ii) b to compute the empirical risk by using the bootstrapped values αMAP , (iii) b to compute the risk associated to the oracle, (iv) to compute the difference of these two risks. For numerical simulations of the posterior mean estimator and showing the ability of the prior distribution to regularize, we refer to (24). Remark 4.1. Instead of doing empirical Bayes estimation by plugging gˆ in the prior distribution for x we could recover the posterior distribution of g|Yˆ and then draw a sample from it through methods like Rejection Method or Metropolis-Hastings. This sample would be used for integrating out g, by Monte Carlo integration, in the distribution of x|g, Yˆ . Remark 4.2. The specification of a prior distribution on the hyperparameter g allowing to obtain a posterior for g|Yˆ in closed form is not an easy task. For the finite dimensional linear regression model, a prior in the gamma form has been proposed in (29). In the Bayesian variable selection problem, different prior specifications for g have been proposed in (20). For our model (4.1) we propose a natural conjugate prior which, for some parameters ν0 , µ0 > 0 and a sequence (aj )j with values in R, has kernel α−

ν0 2

[det(α(αI + BB ∗ )−1 )]

µ0 +1 2

exp

n1 X 2

j

o λ2j a2j . 2 α + λj

This prior depends on operator K and we suggest that it could be used for selecting the operator itself (like model selection in finite dimensional regression models, see (20), (8)). Moreover, this prior can be think as the posterior distribution resulting from a ”conceptual” sample and a non-informative prior, both the sampling and prior distributions being in the same family. A non-informative prior in the same family requires aj = 0, ∀j. 5. Proofs In order to prove Theorems 2.1 and 2.2, we need Corollary 8.22 in (4). We give here a simplified version of it: 1

Corollary 1. Let Xs , s ∈ R be a Hilbert scale induced by L and let Σ− 2 K : 1 X → Y be a bounded operator satisfying ||L−a x|| ∼ ||Σ− 2 Kx||, ∀x ∈ X and for 1 some a > 0. Then, for B = Σ− 2 KL−s , s ≥ 0 and |ν| ≤ 1

J-P. Florens and A. Simoni/Regularizing Priors

17

ν

||(B ∗ B) 2 x|| ∼ ||L−ν(a+s) x||

ν

and R((B ∗ B) 2 ) = Xν(a+s) ≡ D(Lν(a+s) ). We refer to (4) for the proof of it.

5.1. Proof of Theorem 2.1 The posterior bias E(x|Yˆ , g, s) − x∗ is re-written as C

D

}| { z}|{ z E(x|Yˆ , g, s) − x∗ = −(I − AK)(x∗ − x0 ) + AU ,

where A is defined as in (2.2) and the MISE is Ex∗ ||E(x|Yˆ , g, s)−x∗ ||2 = ||C||2 + Ex∗ ||D||2 . Let ρ∗ ∈ X be such that (x∗ − x0 ) = L−β ρ∗ , then ||C||2

= = =

1

1

||[I − Ω02 (αI + B ∗ B)−1 (Σ− 2 B)∗ K]L−β ρ∗ ||2 1

1

1

||Ω02 [I − (αI + B ∗ B)−1 (Σ− 2 B)∗ KΩ02 ]Ls−β ρ∗ ||2

β−s

s

||(B ∗ B) 2(a+s) [I − (αI + B ∗ B)−1 B ∗ B](B ∗ B) 2(a+s) v˜||2 β

=

||α(αI + B ∗ B)−1 ](B ∗ B) 2(a+s) v˜||2

∼

Op (α a+s ).

β

The third equality is obtained by applying Corollary 1 and v˜ is an element of β−s X such that Ls−β ρ∗ = (B ∗ B) 2(a+s) v˜. Next, we address the second term of the MISE: Ex∗ ||D||2 = tr(AV ar(U )A∗ ). 1

s

Application of Corollary 1 implies that R(Ω02 ) ≡ D(Ls ) is equal to R(B ∗ B) 2(a+s) s 1 so that A = (B ∗ B) 2(a+s) (αI + B ∗ B)−1 (Σ− 2 B)∗ and then tr(AV ar(U )A∗ )

h s 1 1 = tr (B ∗ B) 2(a+s) (αI + B ∗ B)−1 (Σ− 2 B)∗ δΣΣ− 2 B i s (αI + B ∗ B)−1 (B ∗ B) 2(a+s) h i s s = δtr (B ∗ B) 2(a+s) (αI + B ∗ B)−1 B ∗ B(αI + B ∗ B)−1 (B ∗ B) 2(a+s)

after simplification. By denoting with {λ2j } the sequence of eigenvalues associated to BB ∗ , or equivalently to B ∗ B, we have

J-P. Florens and A. Simoni/Regularizing Priors

2s

∗

tr(AV ar(U )A ) ≤

δ

X λja+s 2s

∼

+2−2γ

(α + λ2j )2

j

≤

+2

X λja+s δ (α + λ2j )2 j 2s

=

18

λ2γ j

+2−2γ

X 2γ λja+s δ sup λ 2 (α + λj )2 j j j γ(a+s)+a Op δα− a+s ,

where we have exploited Assumption 3 (c). In choosing α we find the usual trade-off: while ||C||2 is increasing in α, Ex∗ ||D||2 is decreasing in α. The optimal α, denoted with α∗ , is the value for which ||C||2 and Ex∗ ||D||2 are of the same order: β

α a+s ⇔

α∗

δα−

∼

γ(a+s)+a a+s a+s

=

c1 δ β+a+γ(a+s)

with c1 some constant. The fastest speed of convergence of the posterior mean, β obtained by substituting the optimal α∗ , is of order δ β+a+γ(a+s) that is decreasing in sγ. 5.2. Proof of theorem 2.2 The asymptotic behavior of the posterior variance is similar to that one of term C, considered in the proof of Theorem 2.1, scaled by the factor g1 : 1 1 1 V ar(x|Yˆ , g, s)φ = [Ω0 − Ω02 (αI + B ∗ B)−1 (Σ− 2 B)∗ KΩ0 ]φ, g for any φ ∈ X . The MISE of the posterior variance applied to an element β

φ ∈ R(Ω02s ) has rate: ||V ar(x|Yˆ , g, s)φ||2

= = ≤ ∼

1 1 1 1 || Ω02 [Ω02 − (αI + B ∗ B)−1 (Σ− 2 B)∗ KΩ0 ]φ||2 g 1 1 s 1 1 || (B ∗ B) 2(a+s) [I − (αI + B ∗ B)−1 B ∗ Σ− 2 KΩ02 ]Ω02 φ||2 g β+s s 1 2 || || ||(B ∗ B) 2(a+s) [I − (αI + B ∗ B)−1 B ∗ B](B ∗ B) 2(a+s) υ||2 g 1 β+2s Op 2 α a+s g β

where υ ∈ X is such that φ = (B ∗ B) 2(a+s) υ.

J-P. Florens and A. Simoni/Regularizing Priors

19

5.3. Proof of Lemma 1 g

dP We first consider the density dP ∞ in (4.1) with the product truncated at J < ∞. Its logarithm is proportional to

J X j=1

log

J X (< K(x∗ − x0 ), ϕj >2 + < U, ϕj >2 + < K(x∗ − x0 ), ϕj >< U, ϕj > λ2j ) α + α + λ2j j=1 δlj2 (α + λ2j )

after having replaced Yˆ with its expression. Then, we equate to zero the derivative with respect to α and we multiply it by δα: IJ

z }| { J 2 X αλ δ j α j=1 α + λ2j

IIJ

IIIJ

z

}| { z }| { J J X X < K(x∗ − x0 ), ϕj >2 λ2j < U, ϕj >2 λ2j = α +α lj2 (α + λ2j )2 lj2 (α + λ2j )2 j=1 j=1 IVJ

z

}| { J X < K(x∗ − x0 ), ϕj >< U, ϕj > λ2j +α . lj2 (α + λ2j )2 j=1 We take the limit for J → ∞ of each term: lim IJ J

=

2(1−γ) J X αλj δ lim λ2γ α J j=1 α + λ2j j

≤

2(1−γ) J X αλj δ sup lim λ2γ j J α j α + λ2j j=1

∼

Op

J δ X 2γ lim λ j αγ J j=1

and the limit of the sum is finite under Assumption 3 (c). To analyze term IIj , we 1 notice that the fact that Σ− 2 and KΩ0 K ∗ commute implies that they have the same eigenfunctions. Then, there exists {bj } such that KΩ0 K ∗ ϕj = bj ϕj . More1 1 over, {ϕj } are also the eigenvalues of BB ∗ since BB ∗ ϕj = Σ− 2 KΩ0 K ∗ Σ− 2 ϕj = (bj /lj2 )ϕj . Hence,

J-P. Florens and A. Simoni/Regularizing Priors

20

1

lim IIJ j

1 J X < KΩ02 ξ∗ , Σ− 2 ϕj >2 λ2j = α lim J (α + λ2j )2 j=1 β−s

J X < Ω0 2s ρ∗ , ψj >2 λ4j = α lim J (α + λ2j )2 j=1 β−s

J X < (B ∗ B) 2(a+s) v, ψj >2 λ4j = α lim J (α + λ2j )2 j=1 β−s 2(2+ (a+s) )

J X < v, ψj >2 λj = α lim J (α + λ2j )2 j=1 β−s

2(2+ (a+s) ) J X λj ≤ α sup lim < v, ψj >2 J (α + λ2j )2 j j=1 a+β ∼ Op α a+s ||v||2 .

By using Markov inequality it is possible to show that term IIIJ is negligible with respect to term IJ and that term IVJ is equal to zero in probability. a+β Then, the αMAP is such that α a+s = αδγ and the result follows. References [1] Carrasco, M. and J.P., Florens (2007), ’Spectral Method for Deconvolving a Density’, preprint. [2] Carrasco, M., Florens, J.P., and E., Renault (2005), ’Estimation based on Spectral Decomposition and Regularization’, Handbook of Econometrics, J.J. Heckman and E. Leamer, eds., 6, Elsevier, North Holland. [3] Diaconis, F., and D., Freedman (1986), ’On the Consistency of Bayes Estimates’, Annals of Statistics, 14, 1-26. [4] Engl, H.W., Hanke, M. and A., Neubauer (2000), ’Regularization of Inverse Problems’, Kluwer Academic, Dordrecht. [5] Franklin, J.N. (1970), ’Well-posed stochastic extension of ill-posed linear problems’, Journal of Mathematical Analysis and Applications, 31, 682 716. [6] Florens, J.P., and A., Simoni (2008), ’Regularized Posteriors in Linear illPosed Inverse Problems’, preprint. [7] Florens, J.P., and A., Simoni (2008), ’Regularized Posteriors in Linear illPosed Inverse Problems: Extensions’, preprint. [8] George, E.I. and D.P., Foster (2000), ’Calibration and Empirical Bayes Variable Selection’, Biometrika, 87, 4, 731 - 747. [9] Ghosh, J.K. and R.V., Ramamoorthi (2003), ’Bayesian Nonparametrics’, Springer Series in Statistics.

J-P. Florens and A. Simoni/Regularizing Priors

21

[10] Hall,P. and J.L., Horowitz (2007), ’Methodology and Convergence Rates for Functional Linear Regression’, Annals of Statistics, 35, 70 - 91. [11] Hiroshi, S. and O., Yoshiaki (1975), ’Separabilities of a Gaussian Measure’, Annales de l’I.H.P., section B, tome 11, 3, 287 - 298. [12] Johannes, J., Van Bellegem, S. and A., Vanhems (2009), ’Convergence Rates for ill-posed Inverse Problems with an Unknown Operator’, preprint. [13] Kaipio, J., and E., Somersalo (2004), ’Statistical and Computational Inverse Problems’, Applied Mathematical Series, vol.160, Springer, Berlin. [14] Krein, S.G. and J.I., Petunin (1966), ’Scales of Banach Spaces’, Russian Math. Surveys, 21, 85 - 160. [15] Kress, R. (1999), ’Linear Integral Equation’, Springer. [16] Kuo, H.H. (1975), ’Gaussian Measures in Banach Spaces’, Springer. [17] Lehtinen, M.S., P¨aiv¨arinta, L. and E., Somersalo (1989), ’Linear Inverse Problems for Generalised Random Variables’, Inverse Problems, 5, 599-612. [18] Lepskii, O.V. (1990), ’A Problem of Adaptive Estimation in Gaussian White Noise’, Theory Prob. Appl., 35, No.3, 454 - 466. Translated from Teor. Veroyatnost. i Primenen., 35 (1990), No.3, 459 - 470. [19] Lepskii, O.V., Mammen, E. and V.G., Spokoiny (1997), ’Optimal Spatial Adaptation to Inhomogeneous Smoothness: an Approach Based on Kernel Estimates with Variable Bandwidth Selectors’, Annals of Statistics, 25, No.3, 929 - 947. [20] Liang, F., Paulo, R., Molina, G., Clyde, M. and Berger, J. (2005), ’Mixtures of g-priors for Bayesian Variable Selection’, ISDS Technical Report, Duke University. [21] Mandelbaum, A. (1984), ’Linear Estimators and Measurable Linear Transformations on a Hilbert Space’, Z. Wahrcheinlichkeitstheorie, 3, 385-98. [22] Neubauer, A. (1988), ’When do Sobolev Spaces Form a Hilbert Scale?’, Proc. Amer. Math. Soc., 103, 557 - 562. [23] Neveu, J. (1965), ’Mathematical Fundations of the Calculus of Probability’, San Francisco: Holden-Day. [24] Simoni, A. (2009), ’Bayesian Analysis of Linear Inverse Problems with Applications in Economics and Finance’, PhD Dissertation - Universit´e de Science Sociales, Toulouse. [25] Van der Vaart, A.W. and J.H., Van Zanten (2007), ’Bayesian Inference with Rescaled Gaussian Process Priors’, Electronic Journal of Statistics, 1, 433 - 448. [26] Van der Vaart, A.W. and J.H., Van Zanten (2008), ’Rates of Contraction of Posterior Distributions Based on Gaussian Process Priors’, Annals of Statistics, 36. [27] Van der Vaart, A.W. and J.H., Van Zanten (2008), ’Reproducing Kernel Hilbert Spaces of Gaussian Priors’, volume 3 of IMS Collections, pages 220 - 222. Institute of Mathematical Statistics, 2008. [28] Van der Vaart, A.W. and J.H., Van Zanten, ’Adaptive Bayesian Estimation Using a Gaussian Random Field with Inverse Gamma Bandwidth’, Annals of Statistics to appear. [29] Zellner, A. (1986a), ’On Assessing Prior Distributions and Bayesian Re-

J-P. Florens and A. Simoni/Regularizing Priors

22

gression Analysis with g-prior Distribution’, in: Goel, P.K. and Zellner, A. (Eds) Bayesian Inference and Decision Techniques: essays in honour of Bruno de Finetti, pp. 233-243 (Amsterdam, North Holland).

Regularizing Priors for Linear Inverse Problems

g-prior and we show that, under mild assumptions, this prior is able to ... problems, the existence of a regular version of the posterior distribution, see. (23).

Download PDF

251KB Sizes 1 Downloads 212 Views

Report

Regularizing Priors for Linear Inverse Problems

Recommend Documents